Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

Organizations of any size can be victims of a cyber attack. Small and medium-sized organizations can be tempting for attackers because they may have fewer obstacles for attackers. On the other hand, large employers face challenges in strategically thinking through structures around security governance and dimensions of monitoring. Security analytics tools can help address common problems, but I have found that solutions vary depending on what you are trying to do. Many companies are subject to industry regulations such as the Health Insurance Portability and Accountability Act, Payment Card Industry Data Security Standard and Sarbanes-Oxley Act creating compliance requirements. Security analytics tools can help address compliance requirements but also mitigate risk of data breaches and security attacks.

Below are some of the pros and cons of tools like Splunk, Devo, IBM, and McAfee, as well as their primary functions like anomaly detection capability, event correlation capability, and real-time analytics capability. Also, given the explosion of cloud computing, I considered each tool relative to the cloud computing environment. The security issues or targeted applications that the tool seeks to solve were also explored, as well as the critical design considerations for the scalability of each tool.

No alt text provided for this image

Splunk offers a security intelligence platform that supports many security data sources like network security, endpoint solutions, malware, and payload analysis, network data, asset management systems, and threat intelligence. Splunk has a security operation suite that works with real-time security monitoring, advanced threat detection, fraud analysis, as well as incident management. Their analytics-driven SIEM solution is focused on visibility, context, and efficiency, having a modern and flexible big data platform, as well as using machine learning to perform behavioral analytics. Splunk’s enterprise security solution is customizable with drill-down capability. Also, the Splunkbase app store has over 600 apps that can be leveraged with Splunk’s security products.

In terms of advantages, Splunk provides holistic solutions that can grow with users over time. Also, Spunk offers a broad array of partner integration services, in addition to many applications. In terms of disadvantages, Gartner has expressed some concerns around their licensing model and expensive costs associated with implementation. Also, for businesses that want the on-premises appliance, they have to engage with a third-party provider. Another drawback of Splunk includes that their advanced threat detection solutions were not ranked as high when compared with other top products in the marketplace.

No alt text provided for this image

Devo offers a next-gen SIEM solution that has a central hub for data and processes within the security operations center. They offer a cloud capability where SIEM is the cloud-native and flexible deployment models to help companies streamline security operations as they shift to the cloud. Devo’s capability is deployed through a scalable, extensible data analytics platform that can handle petabyte-scale data growth and real-time data analytics. Their solution offers a holistic insight relative to scalable attack surfaces, which help organizations mitigate overwhelming amount of security alerts while providing relevant context to prioritize investigation. The main features of their next-gen SIEM platform includes:    

  • Behavioral analytics – using behavioral analytics as the foundation for detection moves away from rules-based detection of better prioritize high impact threats with context to support intervening actions
  • Community collaboration – solutions focus on the relationship with peers and providers, sharing of proprietary intelligence
  • Analytics insight – SIEM can learn from analyst behavior to help automate investigations and enhance decision making with continual learning
  • Orchestration & automation – SIEM enables a rapid threat response by integrating automated, manual and repetitive processes to improve incident response workforce

Devo’s solutions involve applications in detecting and hunting high impact threats in real-time, triaging and investigating high confidence alerts, increasing signal with rich behavioral analytics, while enhancing speed through actionable insight. The Devo architecture parallelizes the data pipeline allowing growth without decreased performance.

The advantages of the Devo Data Operations Platform include performance, scalability, accessibility, security, and cost-efficiency in a full-stack, multi-tenant platform. Their solutions offer the ability to find unusual behavior in real-time. A drawback of their solution is the lack of a comprehensive free trial and limited reviews as it relates to their newer products.

No alt text provided for this image

IBM’s silent security strategy has a QRadar platform can be deployed as a virtual appliance, SaaS infrastructure as a service, or as a traditional appliance. They also provide a hybrid option where the Saas Solution is hosted in the IBM Cloud, which includes remote monitoring from the managed security service operations centers. They offer a variety of user and entity behavior analytics functionality that is based in machine learning analytics.

Their silent security model allows organizations to silently understand which people have access to data, detect insider threats with behavioral analytics, enforce the principle of least privilege and protect data with multi-factor authentication. Their solution is focused on the seamless user experience with single sign-on, seamless authentication, and leverages design thinking techniques to create solutions targeted to the user. Moreover, their application helps companies get ahead of issues related to compliance and regulation, delegate and simplify access recertification for LOBs, map roles to business activities and manage user data for GDPR and secure transactions for PSD2. Their Silent Security product line helps companies secure their business, enable digital transformation, prove compliance and provide security for business assets. An extension of the IBM QRadar Security Intelligence Platform is the QRadar Behavior Analytics runs on machine learning algorithms to help detect threats. It also includes a dashboard that indicates risky users by name with unusual activities by looking at QRadar associated incidents that differ from their peers or have invalid sequences of operations.

IBM Security Guardium, a complimentary feature, provides end-to-end data security and compliance solutions. This feature includes around-the-clock data activity monitoring, data protection design, and configuring and customizing data policy settings. IBM has a security secret server that is used for protecting and auditing privileged account access and authentication secrets across the business. Also, IBM’s Cloud Identity and Security Access Manager can assess high-risk activities while also providing robust authentication features. IBM Managed Identity Services then help with handling user access and diagnosing root causes in IAM programs. IBM’s security solutions work across the security lifecycle for both onsite and cloud applications.

In terms of advantages, IBM’s QRadar program is a fit for medium and large businesses looking for core SIEM functionality or those that want a unified platform to manage several security solutions. In terms of disadvantages, according to Gartner, some IBM clients have turned to third-party solutions instead of IBM’s solutions. Also, QRadar’s UBA functionality can lag behind some of the other vendors. Another drawback includes that IBM Resilient incident response tool does not have native integration within the QRadar platform. Also, automation can only be accessed on IBM’s Incident Response Platform, and some threat-hunting capabilities only are available at premium pricing.

No alt text provided for this image

McAfee offers integrated tools for a variety of security needs. The McAfee Enterprise Security Manager provides a security framework that includes monitoring and threat defense features. Their solutions are built to streamline operations and synchronize device data loss prevention within the cloud that can be used with any cloud service. The McAfee MVISION Cloud service protects data while stopping threats in the cloud across SaaS, PaaS, and IaaS from a single, cloud-native enforcement point.

Main features include helping organizations meet security and compliance requirements when transferring information technology environments to the cloud while extending data loss prevention, threat protection, and application security across public, hybrid, and private cloud environments or software-defined data center environments. Another key feature of McAfee includes reviewing security responsibilities related to protecting user access, data, and network traffic. Their McAfee MVISION Cloud solution helps with enforcing data loss prevention policies in the cloud, preventing unauthorized sharing of sensitive data, blocking sync downloads, detecting compromised situations, encrypting cloud data, and auditing for misconfiguration. Their Cloud Security Maturity Dashboard includes a Cloud Security Report, Cloud Security Maturity Scores, and Quadrant and Cloud Security Recommendations.

In terms of advantages of their solutions, McAfee provides proper central management, the GUI is user-friendly, it supports both MAC and Linux operating systems, it has a large user community and deployment, and administration is fairly straightforward. Also, McAfee’s solution has been recognized for their successful machine learning algorithms in preventing attacks. In terms of disadvantages, McAfee can sometimes require additional software, updates come from third-party applications, and the solution takes up CPU utilization and memory. Also, some customers have commented that when the system is scanning it can hang on the screen effecting the use of other operations. Additionally, there is some noted concern from customers about the costs as it relates to requirements.

No alt text provided for this image

Overall, security analytics tools are essential in gathering, filtering, and integrating diverse security event data to holistically view the security of a company’s infrastructure. The security analytics market is changing fast with the merger of vendors, addition of new capabilities, and deployment of solutions in the cloud. While security analytics tools have a variety of capabilities, hopefully this post provided some initial insight on some of the popular products. While there is not single taxonomy for security analytics, most requirements included things like basic security analytics, significant enterprise use cases, focus on advanced persistent threats and forensics, as well as a variety of security tools and services.

Streaming Data Solutions: Flink versus Spark

While real-time stream processing has been around for a while, businesses are now trying to quickly process larger volumes of streaming data. Streaming data is everywhere from Twitter, sensors, stock ticker prices, and weather. Streaming data comes in continuously, which poses challenges in processing streaming data.

No alt text provided for this image

Flink was initially written in Java and Scala and exposes many Application Programming Interfaces (APIs), including the DataStream API. Flink was developed by a German University and became an incubator project for Apache in 2014.

No alt text provided for this image

Similar, but different, Spark Streaming is one of the most used libraries in Apache Spark. Spark developers create streaming applications using DataFrames or Dataset API’s, which are available in programming languages like Java, Python, and R. The product is essentially an extension of the core Spark API.

No alt text provided for this image

Similarities

Both Flink and Spark are big data systems that are fault-tolerant and built to scale data. While both Flink and Spark are in-memory databases and have ability to write data to permanent storage, the goal is to keep it in memory for current usage. Both products enable programmers to use MapReduce functions and apply machine learning algorithms with streaming data. That is, both Flink and Spark are good with machine learning in processing large training and testing datasets across a distributed architecture. Also, both technologies can work with Kafka (LinkedIn’s streaming product), as well as Storm topologies.

Differences

Flink was made to be a streaming product, whereas Spark added the steaming product onto an existing service line. Spark was initially built on static data, but Flink can process batch operations by stopping the streaming. With Spark, the stream data was initially divided into micro-batches that repeat in a continuous loop. This means that with the batch program, the file needs to be opened, processed, and then closed. However, in 2018, with Spark 2.3, Spark was able to start to move away from the previous “micro-batch” approach. In contrast, Flink has, for some time, been breaking streaming data into finite sets at a checkpoint, which can be an advantage in terms of speed in running algorithms.

Flink’s Performance

No alt text provided for this image

Flink can be customized to have optimal performance. Specifically, code logic changes and configuration are relevant to performance. For example, event time or processing time can be considered as it relates to performance effectiveness.

Flink breaks time into “processing time” generated at each machine in a cluster and “time making” at the entry point machine in a cluster. The time generated at the entry point machine in a cluster is also known as the “ingestion time” since it is generated at the time of an event. Several scholars recommend using event time because the event time is constant, which means operations can generate deterministic results regardless of throughput. On the other hand, the processing time is the time observed by the machine. Using this lens, the operations based on processing time are not deterministic. In practice, while events are thought of as real-time, there is the assumption that the clocks at event sources are synchronized, which is rarely the case. As such, this challenges the assumption that the event time is monotonically increasing, which means the allowed lateness solves the dropped events problem, but the large lateness value can still have a significant effect on performance. Without setting the lateness, events can then be dropped due to incorrect timestamps.

Regardless of what approach is chosen, the key for efficient processing time is making sure the logic can handle events in the same event time window being split into smaller processing time windows. Researchers have also shown some performance efficiencies can be achieved by not breaking up complex events, but the tradeoff is the operators have to go through the dimensions in each event, and the event object is larger.

Spark’s Performance

In terms of Spark, identified bottlenecks include the network and disk I/O. CPI can also be a bottleneck but is not as common. Resolving the CPU is estimated to improve the completion of job time by 1-2%. Some of the challenges in managing Spark performance include that tasks can create bottlenecks on a variety of resources and different times. Also, concurrent tasks on a machine may compete for resources. Additionally, memory conditions can be a common issue since Spark’s traditional architecture is memory-centric. The causes of these performance setbacks often involve high concurrency, inefficient queries, and incorrect configurations. These issues can be mitigated with an understanding of both Spark and the data, realizing that Spark’s default configuration may not be the best to optimize performance. 

Final Thoughts

The importance of solutions like Flink and Spark is about allowing businesses to make important decisions based on what is currently happening. No one framework solves all the problems, so it becomes a situation of the best fit. Understanding the system and resources can help in addressing performance bottlenecks. There are many stream processing applications, and it is essential to pick a framework that best meets the business’ needs, as not all products are the same. Flink and Spark are two of the popular open stream processing frameworks. Depending on the application, parameters need to be set correctly to meet performance goals. It is essential to understand the tradeoffs involved to get the best performance relative to business needs.

#Spark #Flink #Performance #StreamingData #BigData

What Should Be Keeping You Up At Night: Where is Big Data Stored?

The digital universe is expected to double in size every two years with machine-generated data experiencing a 50x faster growth rate than traditional business data. Big data has a lifecycle which includes:

  • Raw data
  • Collection
  • Filtering and classification
  • Data analysis
  • Storing
  • Sharing & publishing
  • Security
  • Retrieval, reuse, & discovery

However, viewing security as an isolated stage in the lifecycle can be misleading since the storing, sharing, publishing, retrieval, reuse, and discovery are all involved with security.

96% of organizations are estimated to use cloud computing in one way or another. Cloud computing is a distributed architecture model that can centralize several remote resources on a scalable platform. Since cloud computing offers data storage in mass, it is critical to think about security as it relates to storage. With storage, the primary security risks are caused by both the location to store the data and volume of the data. 

Even if the data is stored in the cloud, it can be challenging to understand if those cloud vendors are storing all the data. Companies must ask not just about costs when selecting cloud vendors but where their data is stored — understanding where data is stored is fundamental to several other security and privacy-related issues. Reasons to understand where data is stored could be as simple as mitigating risks caused by geographic weather concerns. For example, if a hurricane hits Florida in a place where your data is stored, do you know if it has been backed up to a safe location? Also, how is the data protected in the data center from not just weather events, but intruders and cybercrime? Compliance regulations like General Data Protection Regulation (GDPR) make the company responsible for the security its data, even if that data is outsourced to the cloud. Before a company can really answer questions like who has access to data and whom did the company send data to, understanding where the data is stored is critical. This situation becomes more relevant in the event of a breach, which is likely a discussion of when the breach occurs not if the breach will occur.

Data verification is also essential to ensure the data is accurate. In terms of verifying the actual data stored in the cloud, it is not as easy as just downloading the entire data set to see if it has been stored with integrity in the cloud because of cost and local bandwidth. There have been some query authentication methods that have addressed issues of correctness, completeness, and freshness over the years. Basically, a set of data values can be authenticated by a binary tree, verification is done on the data values based on the hash value of the root of the tree, and authenticity is done by the customer in iteratively computing all the hashes up the tree and checking if the hash has been computed for the root in a way that matches the authentically published value. Creating automated processes has helped with data verification efforts, but the algorithms continue to evolve to support faster and larger-scale verification for different version data.

Security enforcement has been increasing with new global regulations. There has been enforcement for companies in this space, including in July 2019, when Marriott was fined EUR 100 million for failure to implement appropriate information security protocols resulting in a breach of 339 million customer records. Also, in the same month this year, British Airways was fined EUR 183 million for failure to implement appropriate information security protocols that resulted in a breach of 500,000 customer records. While some of these fines may be a drop in the bucket for larger companies, smaller companies may be just taking a gamble on not investing in the needed systems because they do not think their organizations are high profile enough to be enforced. However, as data fines continue to increase – now is the time to re-evaluate cyber defense positioning. Regardless of company size, all organizations can at least start a regular dialogue about understanding where their data is stored. 

#Cyber #Security #BigData #DataStorage #Cloud

Machine Learning and Extracting Knowledge from Big Data

The Resource Description Framework is essentially an application of Extensible Markup Language (XML) that helps describe Internet resources like a website and its content. RDF descriptions are called metadata since they are typically data about data like the particular site map or date of page updating. RDF is based on the idea of a model that is developed between statements and web resources. It is essential because the framework makes it easier for developers that build a product using that metadata.

A study by Casteleiro et al. (2016) explored the ability to disturbed work functions from machine learning algorithms with the terms of Cardiovascular Disease Ontology. This study was critical because it demonstrated the benefits of using terms from ontology classes to obtain other term variants. The study opened up the research of the feasibility of different methods that can scale with big data and enable automation of machine learning analysis.

 Sajjad, Bajwa, and Kazmi’s (2019) research was already looking at rule engines and producing rules in the era of big data. They proposed a method to work with the semantic complexity in the rules and then do an automated generation of the RDF model of rules to help in analyzing big data.  Specifically, they used a machine learning technique to classify the Semantic of Business Vocabularies and Rules (SBVR) rule and map it to the RDF model. A challenge for the research included the automatic parsing of the rules as well as the semantic interpretation. Also, mapping the vocabulary to the RDF syntax to verify the RDF schema proven successful, but challenging. However, their work did show that it was possible to have consistency in checking a set of big data rules through automated tools. However, these scholars also found a need for a method to semantically analyze rules to help with the testing and validating as it relates to rule changes. Their particular system makes an ontology model that can be useful in the interpretation of a set of rules. This research supports both the semantic understanding of rules, but also generates the RFP model of rules that provides support for querying.

#MachineLearning #Knowledge #BigData #RDF #XML

References

Casteleiro, M. A., Demetriou, G., Read, W. J., Prieto, M. J. F., Maseda-Fernandez, D., Nenadic, G., … & Stevens, R. (2016). Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations. In ODLS (pp. 1-6).

Sajjad, R., Bajwa, I. S., & Kazmi, R. (2019). Handling Semantic Complexity of Big Data using Machine Learning and RDF Ontology Model. Symmetry11(3), 309.

Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K per sponsored post. Just this year, location-based marketing is predicted to grow to $24.4 billion in ad spending. Nearly half of advertisers plan on using influencer marketing this year as real click rates can translate into purchased products and services.

No alt text provided for this image

As such, this market is ripe for cyber-attacks. However, one way to detect these hackers is to look at the IP size distribution or the number of users that are sharing the same source IP. IP size distributions are created from 1) actual users 2) sponsored providers that provide fraudulent clicks and 3) bot-masters with botnets. The good news is that most machine-generated attacks share an anomalous deviation from the expected IP size distribution. 

However, bots are changing every day as they become more similar to human usage. Gen 1 bots surfaced from in-house scripts but can usually be detected by the absence of cookies. Gen 2 bots are scrappy and can typically be found by the absence of JavaScript firing. Gen 3 bots look like browsers (as compared to Gen 1and 2 bots), but can still be detected using challenge tests and fingerprinting. However, Gen 4 bots look more like human usage with their non-linear mouse movements.

No alt text provided for this image

Security frameworks, supported by machine learning techniques, have been implemented to automatically detect and group deviations. Most detection methods for these Gen 4 bots can be detected with behavioral analysis. Frameworks aggregate statistics around network traffic for investigation recommendations. For example, anomaly detection algorithms can be written to find unusual patterns that do not fit with expected behavior. Code can be written to run MapReduce in parallel processing, assigning a distinct cookie ID for each created click. Then a regression model can be used to compare the IP rates using Poisson distribution with a diverse explanatory model to count the unique cookies and measure the entropy relative to the distribution so that the accurate IP size can be determined. This data can also be analyzed using linear regression and percentage regression techniques to help identify the true IP size.

No alt text provided for this image

Some people have also leveraged historical data in helping create accurate IP size distributions. In this day and age, even a lack of historical data or constant cache cleaning can be used as an input to machine learning techniques to find hackers. However, these methods do depending on securing the click data to run the code to find the source of the fraudulent clicks or bonet behavior.

No alt text provided for this image

The next-generation bots are likely to have more advanced artificial intelligence (AI) making them harder to detect. As a result, AI-based bot detections algorithms need to stay on the leading edge to keep a fair playing field and prevent harm to society.

#Bots #IPDistributionSize #CyberSecurity #BigData

The Future of Wireless Charging

Do you have a drawer at home full of cords? The challenge is that while the wireless charging market is in demand, some strides still need to be made in terms of functionality. According to IHS, the wireless power market is estimated to grow to one billion charging units by 2020. It’s not a new concept, and early adoption was seen with the electronic toothbrushes.

In 2017, Disney Research showcased the ability to charge your device while the receiver is across the room, similar to how WiFi works on a computer. Apple tried to enter the marketplace awhile back, but because of their charging needs, the phones got too hot. Also, the length of time it took to charge the device was originally an issue.

The way charging typically works is that there is a charging dock that has a transmitting coil that induces current into a receiver coil on your smart device that in turn charges that battery. Early products required the exact positioning of the device on the charger to work, but then ‘free positioning’ became a popular concept in charging. The key to the first ‘free positioning’ concepts was the ability to have a transmittable back surface on the device. Some smartphones have a surface made of glass which helped, but the obvious drawback is when you drop your phone. The industry continues to struggle with how to charge devices at a distance.

It is easy to imagine a world where wireless charging is available everywhere from hotels to airports. Some have even imagined roads embedded with wireless charging to support electric cars.

Different solutions have hit the market from chargeable phone cases to Disney’s vision of wireless charging hotspots. Companies like Logitech and Corsair are currently selling wireless charging technology that is transmitted via a mouse pad. Also, Apple recently re-entered the market in submitting a patent that shows how devices can transmit wireless power to nearby smart devices without using a wireless charging mat but instead just wirelessly connecting to your home computer.

This year could be a hot patent space for companies trying to find their foothold in this future market.

#Wireless #Charging #Patents

Cloud Computing: Stochastic Model Architecture, Barriers and Application

According to Gartner, the market for cloud computing will expand to 623 billion USD by 2030. With this growth, there is increased demand from cloud providers to make the best use of their resources in terms of performance efficiency. Three fairly common approaches to performance analysis include experiment-based performance analysis, discrete event-simulation-based performance analysis, and stochastic model-based performance analysis. However, the stochastic model-based performance analysis is the preferred method due to the lower cost point, as well as timeliness. Sakr and Gaber developed a simplified three-pool cloud architecture-based stochastic model to address some of the common barriers encountered with scaling.

Stochastic Model & Three-Pool Cloud Architecture

Sakr and Gaber proposed a simplified model from Markov’s stochastic model to support scalability and tractability solution at a lower cost. Their model leverages three-pool cloud architecture that has several interacting sub-models, including resource provisioning decision engine, virtual machine sub-models, and pool sub-models. Sakr and Gaber leveraged three-pool cloud architecture which includes the concepts of the hot pool sub-model, warm pool sub-model, and cold pool sub-model. What pool is used depends on how the response time and power consumption are sorted. Each sub-model can be used to represent a machine in the pool. The hot pool sub-model addresses the group needing maximum power with low response times. The warm pool sub-model addresses machines in sleep model that are waiting for the next run. Finally, the cold pool sub-model has machines that are in the off state and have minimum power needs and longer response times. Figure 1 shows Sakr and Gaber’s model of what happens when there is a service request and how the resource provisioning decision engine tries to leverage the various pool sub-models.

Figure 1. Provisioning steps. Reprinted from Large Scale and Big Data: Processing and Management (p. 559), by S. Sakr and M. Gaber, 2014. Auerbach Publications.

Resolving Barriers

The biggest challenge to the model is the potential service request rejection. When the buffer is at capacity or the resource provisioning decision engine cannot find an available machine due to capacity issues, request rejection is possible. However, there is some understanding related to the service request probability that can be helpful. For example, the longer the mean service time, the increased likelihood of higher service requests. Therefore, if the capacity of the machines is increased, the potential for service rejection can decrease.

Additional challenges for the three-pool cloud architecture model to work effectively is that both service requests and the machines have to have the same type. However, virtual provisioning sub-models can be used with different machines where the machines are grouped into classes, and each class is this represented by a pool so that the individual pool is homogeneous.

Development of Applications

The advantage of the Stochastic Model and pool architecture is that the pools can be strategically used to identify performance bottleneck through what-if analysis and planning capacity. For example, that Symbolic Hierarchical Automated Reliability and Performance Evaluator (SHARPE) is a modeling application that can look at performance, reliability, and availability. This application has been installed at over 450 sites and lets users choose different what-if analysis with alternative algorithms depending on their objectives. Also, three-pool architecture can be useful for data recovery efforts in the sense that mission-critical information can be in the hot pool for faster recovery whereas less critical data can be stored in the other sub-pools.

Cloud services are growing exponentially in demand putting increased pressure on finding both timely and cost-effective solutions to address performance issues. The Stochastic Model is desirable because of its’ low cost and timeliness as compared to other models. The Stochastic Model leverages a three-pool cloud architecture with pool sub-models to process service requests. While there are some challenges related to rejecting service requests and heterogeneous machines, the model can be adapted to handle these challenges. Several applications have demonstrated the promise of the pool architecture in strategically managing performance bottlenecks. 

#Cloud #ThreePoolCloudArchitecture #BigData

Seeing the Big Picture: MapReduce, Hadoop and the Cloud

Big data contains patterns and methods to inform companies about their customers and vendors, as well as help improve their business processes. Some of the biggest companies in the world like Facebook have used MapReduce framework as a tool for their cloud computing applications sometimes through implementing Hadoop, an open source code of MapReduce. MapReduce was designed by Google for parallel distributed computing of big data.

Before MapReduce, companies needed to pay data modelers and buy supercomputers to process timely big data insights. MapReduce has been an important development in helping businesses solve complex problems across big data sets like determining the optimal price for products, understanding the return on the investment of advertising, performing long term predictions and mining web clicks to inform product and service development.

No alt text provided for this image

MapReduce works across a network of low-cost commodity machines allowing actionable business insights to be more accessible than ever before. It is strong computation tool for solving problems that involve things like pattern matching, social network analysis, log analysis and clustering.

The logic behind MapReduce is basically dividing big problems into small manageable tasks that are then distributed to hundreds of thousands of server nodes. The server nodes operate in parallel to generate results. From a programming standpoint, this involves writing a map script where the data is mapped into a collection of key value pairs and writing a reduce script over all pairs with the same key. One challenge is the time it takes to convert and break the data into the new key-value pair which increases latency.

No alt text provided for this image

Hadoop is Apache’s open-source implementation of the MapReduce framework. In addition to the MapReduce distributed processing layer, Hadoop uses HDFS for reliable storage, YARN for resource management and has flexibility in dealing with structured and unstructured data. New nodes can be added easily to Hadoop without downtime and if a machine goes down, data can be easily retrieved. Hadoop can be a cost efficient solution for big data processing, allow terabytes of data to be analyzed within minutes. 

But, cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google’s Cloud Platform offer similar MapReduce components where the operational complexity is handled by the cloud vendors instead of the individual businesses. Hadoop was known for its strong combination of computation with storage, but in place of HDFS, cloud-based object stores have been built on models like AWS which given the ability to still compute and use virtualization technology like Kubernetes instead of YARN. With this shift to cloud vendors, there have been some increased concerns around the long-term vision for Hadoop. 

No alt text provided for this image

Hortonworks was the data software company that supported open-source software, primarily Hadoop. But in January 2019, Hortonworks closed an all-stock $5.2 billion merger with Cloudera. While Cloudera also supports open source Hadoop, it has a different vendor-lock management suite that is supposed to help with both installation and deployment whereas Hortonworks was 100% open-source. In May 2109, another Hadoop provider, MapR, announced they were looking for a new source of funding. On June 6, 2019, Cloudera’s stock declined 43% and the CEO left the company.

Understanding the advantages and disadvantages of the MapReduce framework and Hadoop in big data analytics is helpful to making informed business decisions as this field continues to evolve. In terms of the drawbacks of Hadoop, Monte Zwebe, the CEO of Splice Machine, that creates relational databases for Hadoop says, “When we need to transport ourselves to another location and need a vehicle, we go and buy a car. We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.”

What do you think? Please DM me or leave your feedback in the comments below.

#Hadoop #MapReduce #CloudComputing

Consequences of Multiplying the Internet of Things

The Internet of Things (IoT) are multiplying as technology costs decrease and smart device sales increase. Generally speaking, if there is a device with an on and off switch, there is a likely chance that it will become a future part of the IoT movement. IoT architecture includes the sensors on devices, the Internet, and the people that use the applications.

IoT devices are connected through Internet infrastructure and different wireless networks. Smart devices by themselves are not that good at dealing with massive amounts of data, let alone learning from the data received and generated. Currently the data from IoT devices is relatively basic because of the small computing power and limited capacity to store data on most devices. However, that basic data gets transferred to a data processing center that has more advanced computing capability to produce desired business insights.

No alt text provided for this image

IoT smart devices require unique addresses that allow them to connect on the Internet. There are some challenges as it relates to accessing these new places on the Internet with growing amount of smart devices. Internet Protocol version 4 (IPv4) has the capacity for about 4.3 billion addresses. Gartner estimates that by 2020, the world will have over 26 billion connected devices. However, there are several thought-leaders as it relates to a unified addressing scheme for IoT that may help solve this bottleneck.

IoT applications also have bottlenecks around the quality of the current artificial intelligence algorithms. For example, having increased transparency and reduced bias around algorithms continues to peak the interest of citizens and could pose challenges to some proprietary business models. With machine learning, producing training sets that are actually representative of targeted populations also remains a challenge.

No alt text provided for this image

There are some additional obstacles as it relates to the physical path of the transmission media. For example, IoT can receive or transmit data based on a variety of technology from RFID to Bluetooth. The common problems associated with these kinds of transmission media from bandwidth to interference also creates problems for IoT. Trying to optimize transmission media is a challenge in IoT applications as it relates to supporting and sustaining networks.

Security is also an ongoing concern to IoT since the basic data feeds into a receiver on the internet. Many IoT devices are low powered constrained devices making them more susceptible to attack. Security challenges of IoT include the ability to ensure that the data has not been changed during transmission and protecting data from unwanted exposure. The World Economic Forum estimates that if a single cloud provider was successfully attacked, it could cause $50 billion to $120 billion of economic damage. With the growth of poorly-protected devices on a shared infrastructure, there is a wide attack surface for hackers where IoT botnets could send swarms of connected sensors information through a variety of IoT devices like thermometers, sprinklers and other devices.  A recent State of IoT Security Research report shared that 96 percent of businesses and 90 percent of customers think there should be IoT security regulations. As public confidence decreases in security while IoT sales increase, this is likely to result in regulatory reform.

No alt text provided for this image

IoT allows businesses to solve problems and even delight their customers by leveraging the intelligence of connected devices. While there is always uncertainty and risk involved with new technology, and customer confidence around IoT may be hit or miss, the promise of IoT is a fully connected world where devices connect together and with people to enable action that has never before been possible.

#IoT #Cybersecurity #BigData

Web Evolution and Eliminating Performance Bottlenecks

If the Internet is a bookstore, the World Wide Web is the collection of books within that store. The Web is a collection of information which can be accessed via the Internet. The Web was created in 1989 by Sir Tim Berners-Lee and remained quiet through the 1990s, but as users increased, companies like Google started to develop algorithms to better index content which eventually lead to the concept of SEO (a significant driver of the Internet today). Sir Tim Berners-Lee’s initial vision of the Web was explained in a document called, “Information Management: A Proposal,” but today with Facebook and social media, the focus has also changed the Web into a communication tool. 

Back in 1989, Sir Tim Berners-Lee wrote about three fundamental technologies that are still foundational to the Web today which include HTML, URI, and HTTP. HTML refers to the markup language of the Web, URI is like the address or URL, and HTTP supports the retrieval of linked items across the Web. These core technologies used in Web 1.0 are responsible for today’s large-scale web data. Back in Web 1.0, bottlenecks included web pages that were only understandable by a human. Also, Web 1.0 was slow and pages that needed to be refreshed often. In retrospect, it is easier to identify that Web 1.0 had servers as a major bottleneck and lacked a sound systems design with networked elements. Nonetheless, Web 1.0 is referred to as the “web of content” and was critical to the development of Web 2.0.

No alt text provided for this image

Web 2.0 began in 1999 and let people contribute, modify, and aggregate content using a variety of applications from blogs to wikis. This was revolutionary in the sense the web moved from being focused on content to being focused on the communication space, where content was created by individual users instead of just being produced for individual users. Web 2.0 embraced the reuse of collective information, crowdsourcing, and new methods for data aggregation. In terms of online architecture, Web 2.0 drove collaborative knowledge construction where networking became more critical to driving user interaction. At the same time, issues of open access and reuse of free data started to surface. Performance issues were encountered with frequent database access, which put a strain on Web 2.0’s scalability. However, the good news is that Web 1.0 bottlenecks on the database server side were eliminated with the ability to have databases on ramdisk and high-performance multi-core processors that supported enhanced multi-threading. However, with the benefits of Web 2.0’s flexible web design, creative reuse, and collaborative content development, bottlenecks were created by the increased volume of content by users.

Web 3.0 started around 2003 and was termed the “the web of context.” Web 3.0 is the era of defined data structures and the linking of data to support knowledge searching and automation across a variety of applications. Web 3.0 is also still referred to as the “semantic” Web, which was revolutionary in the sense that it shifted to focus to have the Web not only read by people, but also by machines. In this spirit, different models of data representation surfaced, like the concept of nodes, which lead to the scaling of web data. One of the challenges of the Web 3.0 data models was that the location and extraction processes turned into a bottleneck.

No alt text provided for this image

Web 4.0 began around 2012 and was named the “web of things.” Web 4.0 further evolved the concept of the Web into a symbiotic web that focused more on the intersection of machines and humans. At this point, Internet of Things devices, smart home and health monitoring devices started to contribute to big data. Mobile devices and wireless connections helped support data generation, and cloud computing took a stronghold in helping users both create and control their data. However, bottlenecks were created with the multiple devices, gadgets and applications that were connected to Web 4.0 along with changing Internet of Things protocols and exponentially growing big data logs.

No alt text provided for this image

Web 5.0 is currently referred to as the “symbiont web” or the web of thoughts. It was designed in a decentralized manner where devices could start to find other interconnected devices. Web 5.0 creates personal servers for personal data on information stored on a smart device like a phone, tablet, or robot. This enables the smart device to scan the 3D virtual environment and use artificial intelligence to better support the user. The bottleneck in Web 5.0 becomes the memory and calculation power of each interconnected smart device to calculate the billions of data points needed for artificial intelligence. Web 5.0 is recognized for emotional integration between humans and computers. However, the algorithms involved in understanding and predicting people’s behavior have also created a bottleneck for Web 5.0.

Where will Web evolution end? One thing is for sure, data generation is increasing year after year. To continue to get new functionality out of the evolving Web, new bottlenecks need to be addressed. There are a variety of future considerations as it relates to anticipated bottlenecks from encoding strategies to improving querying performance. However, the best way to predict what will happen in the future is to invent it.

#WebEvolution #WebPerformance #OnlineArchitecture #Innovation

About the Author

Shannon Block is an entrepreneur, mother and proud member of the global community. Her educational background includes a B.S. in Physics and B.S. in Applied Mathematics from George Washington University, M.S. in Physics from Tufts University and she is currently completing her Doctorate in Computer Science. She has been the CEO of both for-profit and non-profit organizations. Follow her on Twitter @ShannonBlock or connect with her on LinkedIn.