Seeing the Big Picture: MapReduce, Hadoop and the Cloud

Big data contains patterns and methods to inform companies about their customers and vendors, as well as help improve their business processes. Some of the biggest companies in the world like Facebook have used MapReduce framework as a tool for their cloud computing applications sometimes through implementing Hadoop, an open source code of MapReduce. MapReduce was designed by Google for parallel distributed computing of big data.

Before MapReduce, companies needed to pay data modelers and buy supercomputers to process timely big data insights. MapReduce has been an important development in helping businesses solve complex problems across big data sets like determining the optimal price for products, understanding the return on the investment of advertising, performing long term predictions and mining web clicks to inform product and service development.

No alt text provided for this image

MapReduce works across a network of low-cost commodity machines allowing actionable business insights to be more accessible than ever before. It is strong computation tool for solving problems that involve things like pattern matching, social network analysis, log analysis and clustering.

The logic behind MapReduce is basically dividing big problems into small manageable tasks that are then distributed to hundreds of thousands of server nodes. The server nodes operate in parallel to generate results. From a programming standpoint, this involves writing a map script where the data is mapped into a collection of key value pairs and writing a reduce script over all pairs with the same key. One challenge is the time it takes to convert and break the data into the new key-value pair which increases latency.

No alt text provided for this image

Hadoop is Apache’s open-source implementation of the MapReduce framework. In addition to the MapReduce distributed processing layer, Hadoop uses HDFS for reliable storage, YARN for resource management and has flexibility in dealing with structured and unstructured data. New nodes can be added easily to Hadoop without downtime and if a machine goes down, data can be easily retrieved. Hadoop can be a cost efficient solution for big data processing, allow terabytes of data to be analyzed within minutes. 

But, cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google’s Cloud Platform offer similar MapReduce components where the operational complexity is handled by the cloud vendors instead of the individual businesses. Hadoop was known for its strong combination of computation with storage, but in place of HDFS, cloud-based object stores have been built on models like AWS which given the ability to still compute and use virtualization technology like Kubernetes instead of YARN. With this shift to cloud vendors, there have been some increased concerns around the long-term vision for Hadoop. 

No alt text provided for this image

Hortonworks was the data software company that supported open-source software, primarily Hadoop. But in January 2019, Hortonworks closed an all-stock $5.2 billion merger with Cloudera. While Cloudera also supports open source Hadoop, it has a different vendor-lock management suite that is supposed to help with both installation and deployment whereas Hortonworks was 100% open-source. In May 2109, another Hadoop provider, MapR, announced they were looking for a new source of funding. On June 6, 2019, Cloudera’s stock declined 43% and the CEO left the company.

Understanding the advantages and disadvantages of the MapReduce framework and Hadoop in big data analytics is helpful to making informed business decisions as this field continues to evolve. In terms of the drawbacks of Hadoop, Monte Zwebe, the CEO of Splice Machine, that creates relational databases for Hadoop says, “When we need to transport ourselves to another location and need a vehicle, we go and buy a car. We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.”

What do you think? Please DM me or leave your feedback in the comments below.

#Hadoop #MapReduce #CloudComputing

Consequences of Multiplying the Internet of Things

The Internet of Things (IoT) are multiplying as technology costs decrease and smart device sales increase. Generally speaking, if there is a device with an on and off switch, there is a likely chance that it will become a future part of the IoT movement. IoT architecture includes the sensors on devices, the Internet, and the people that use the applications.

IoT devices are connected through Internet infrastructure and different wireless networks. Smart devices by themselves are not that good at dealing with massive amounts of data, let alone learning from the data received and generated. Currently the data from IoT devices is relatively basic because of the small computing power and limited capacity to store data on most devices. However, that basic data gets transferred to a data processing center that has more advanced computing capability to produce desired business insights.

No alt text provided for this image

IoT smart devices require unique addresses that allow them to connect on the Internet. There are some challenges as it relates to accessing these new places on the Internet with growing amount of smart devices. Internet Protocol version 4 (IPv4) has the capacity for about 4.3 billion addresses. Gartner estimates that by 2020, the world will have over 26 billion connected devices. However, there are several thought-leaders as it relates to a unified addressing scheme for IoT that may help solve this bottleneck.

IoT applications also have bottlenecks around the quality of the current artificial intelligence algorithms. For example, having increased transparency and reduced bias around algorithms continues to peak the interest of citizens and could pose challenges to some proprietary business models. With machine learning, producing training sets that are actually representative of targeted populations also remains a challenge.

No alt text provided for this image

There are some additional obstacles as it relates to the physical path of the transmission media. For example, IoT can receive or transmit data based on a variety of technology from RFID to Bluetooth. The common problems associated with these kinds of transmission media from bandwidth to interference also creates problems for IoT. Trying to optimize transmission media is a challenge in IoT applications as it relates to supporting and sustaining networks.

Security is also an ongoing concern to IoT since the basic data feeds into a receiver on the internet. Many IoT devices are low powered constrained devices making them more susceptible to attack. Security challenges of IoT include the ability to ensure that the data has not been changed during transmission and protecting data from unwanted exposure. The World Economic Forum estimates that if a single cloud provider was successfully attacked, it could cause $50 billion to $120 billion of economic damage. With the growth of poorly-protected devices on a shared infrastructure, there is a wide attack surface for hackers where IoT botnets could send swarms of connected sensors information through a variety of IoT devices like thermometers, sprinklers and other devices.  A recent State of IoT Security Research report shared that 96 percent of businesses and 90 percent of customers think there should be IoT security regulations. As public confidence decreases in security while IoT sales increase, this is likely to result in regulatory reform.

No alt text provided for this image

IoT allows businesses to solve problems and even delight their customers by leveraging the intelligence of connected devices. While there is always uncertainty and risk involved with new technology, and customer confidence around IoT may be hit or miss, the promise of IoT is a fully connected world where devices connect together and with people to enable action that has never before been possible.

#IoT #Cybersecurity #BigData

Web Evolution and Eliminating Performance Bottlenecks

If the Internet is a bookstore, the World Wide Web is the collection of books within that store. The Web is a collection of information which can be accessed via the Internet. The Web was created in 1989 by Sir Tim Berners-Lee and remained quiet through the 1990s, but as users increased, companies like Google started to develop algorithms to better index content which eventually lead to the concept of SEO (a significant driver of the Internet today). Sir Tim Berners-Lee’s initial vision of the Web was explained in a document called, “Information Management: A Proposal,” but today with Facebook and social media, the focus has also changed the Web into a communication tool. 

Back in 1989, Sir Tim Berners-Lee wrote about three fundamental technologies that are still foundational to the Web today which include HTML, URI, and HTTP. HTML refers to the markup language of the Web, URI is like the address or URL, and HTTP supports the retrieval of linked items across the Web. These core technologies used in Web 1.0 are responsible for today’s large-scale web data. Back in Web 1.0, bottlenecks included web pages that were only understandable by a human. Also, Web 1.0 was slow and pages that needed to be refreshed often. In retrospect, it is easier to identify that Web 1.0 had servers as a major bottleneck and lacked a sound systems design with networked elements. Nonetheless, Web 1.0 is referred to as the “web of content” and was critical to the development of Web 2.0.

No alt text provided for this image

Web 2.0 began in 1999 and let people contribute, modify, and aggregate content using a variety of applications from blogs to wikis. This was revolutionary in the sense the web moved from being focused on content to being focused on the communication space, where content was created by individual users instead of just being produced for individual users. Web 2.0 embraced the reuse of collective information, crowdsourcing, and new methods for data aggregation. In terms of online architecture, Web 2.0 drove collaborative knowledge construction where networking became more critical to driving user interaction. At the same time, issues of open access and reuse of free data started to surface. Performance issues were encountered with frequent database access, which put a strain on Web 2.0’s scalability. However, the good news is that Web 1.0 bottlenecks on the database server side were eliminated with the ability to have databases on ramdisk and high-performance multi-core processors that supported enhanced multi-threading. However, with the benefits of Web 2.0’s flexible web design, creative reuse, and collaborative content development, bottlenecks were created by the increased volume of content by users.

Web 3.0 started around 2003 and was termed the “the web of context.” Web 3.0 is the era of defined data structures and the linking of data to support knowledge searching and automation across a variety of applications. Web 3.0 is also still referred to as the “semantic” Web, which was revolutionary in the sense that it shifted to focus to have the Web not only read by people, but also by machines. In this spirit, different models of data representation surfaced, like the concept of nodes, which lead to the scaling of web data. One of the challenges of the Web 3.0 data models was that the location and extraction processes turned into a bottleneck.

No alt text provided for this image

Web 4.0 began around 2012 and was named the “web of things.” Web 4.0 further evolved the concept of the Web into a symbiotic web that focused more on the intersection of machines and humans. At this point, Internet of Things devices, smart home and health monitoring devices started to contribute to big data. Mobile devices and wireless connections helped support data generation, and cloud computing took a stronghold in helping users both create and control their data. However, bottlenecks were created with the multiple devices, gadgets and applications that were connected to Web 4.0 along with changing Internet of Things protocols and exponentially growing big data logs.

No alt text provided for this image

Web 5.0 is currently referred to as the “symbiont web” or the web of thoughts. It was designed in a decentralized manner where devices could start to find other interconnected devices. Web 5.0 creates personal servers for personal data on information stored on a smart device like a phone, tablet, or robot. This enables the smart device to scan the 3D virtual environment and use artificial intelligence to better support the user. The bottleneck in Web 5.0 becomes the memory and calculation power of each interconnected smart device to calculate the billions of data points needed for artificial intelligence. Web 5.0 is recognized for emotional integration between humans and computers. However, the algorithms involved in understanding and predicting people’s behavior have also created a bottleneck for Web 5.0.

Where will Web evolution end? One thing is for sure, data generation is increasing year after year. To continue to get new functionality out of the evolving Web, new bottlenecks need to be addressed. There are a variety of future considerations as it relates to anticipated bottlenecks from encoding strategies to improving querying performance. However, the best way to predict what will happen in the future is to invent it.

#WebEvolution #WebPerformance #OnlineArchitecture #Innovation

About the Author

Shannon Block is an entrepreneur, mother and proud member of the global community. Her educational background includes a B.S. in Physics and B.S. in Applied Mathematics from George Washington University, M.S. in Physics from Tufts University and she is currently completing her Doctorate in Computer Science. She has been the CEO of both for-profit and non-profit organizations. Follow her on Twitter @ShannonBlock or connect with her on LinkedIn.