Big data contains patterns and methods to inform companies about their customers and vendors, as well as help improve their business processes. Some of the biggest companies in the world like Facebook have used MapReduce framework as a tool for their cloud computing applications sometimes through implementing Hadoop, an open source code of MapReduce. MapReduce was designed by Google for parallel distributed computing of big data.
Before MapReduce, companies needed to pay data modelers and buy supercomputers to process timely big data insights. MapReduce has been an important development in helping businesses solve complex problems across big data sets like determining the optimal price for products, understanding the return on the investment of advertising, performing long term predictions and mining web clicks to inform product and service development.
MapReduce works across a network of low-cost commodity machines allowing actionable business insights to be more accessible than ever before. It is strong computation tool for solving problems that involve things like pattern matching, social network analysis, log analysis and clustering.
The logic behind MapReduce is basically dividing big problems into small manageable tasks that are then distributed to hundreds of thousands of server nodes. The server nodes operate in parallel to generate results. From a programming standpoint, this involves writing a map script where the data is mapped into a collection of key value pairs and writing a reduce script over all pairs with the same key. One challenge is the time it takes to convert and break the data into the new key-value pair which increases latency.
Hadoop is Apache’s open-source implementation of the MapReduce framework. In addition to the MapReduce distributed processing layer, Hadoop uses HDFS for reliable storage, YARN for resource management and has flexibility in dealing with structured and unstructured data. New nodes can be added easily to Hadoop without downtime and if a machine goes down, data can be easily retrieved. Hadoop can be a cost efficient solution for big data processing, allow terabytes of data to be analyzed within minutes.
But, cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google’s Cloud Platform offer similar MapReduce components where the operational complexity is handled by the cloud vendors instead of the individual businesses. Hadoop was known for its strong combination of computation with storage, but in place of HDFS, cloud-based object stores have been built on models like AWS which given the ability to still compute and use virtualization technology like Kubernetes instead of YARN. With this shift to cloud vendors, there have been some increased concerns around the long-term vision for Hadoop.
Hortonworks was the data software company that supported open-source software, primarily Hadoop. But in January 2019, Hortonworks closed an all-stock $5.2 billion merger with Cloudera. While Cloudera also supports open source Hadoop, it has a different vendor-lock management suite that is supposed to help with both installation and deployment whereas Hortonworks was 100% open-source. In May 2109, another Hadoop provider, MapR, announced they were looking for a new source of funding. On June 6, 2019, Cloudera’s stock declined 43% and the CEO left the company.
Understanding the advantages and disadvantages of the MapReduce framework and Hadoop in big data analytics is helpful to making informed business decisions as this field continues to evolve. In terms of the drawbacks of Hadoop, Monte Zwebe, the CEO of Splice Machine, that creates relational databases for Hadoop says, “When we need to transport ourselves to another location and need a vehicle, we go and buy a car. We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.”
What do you think? Please DM me or leave your feedback in the comments below.
#Hadoop #MapReduce #CloudComputing