Streaming Data Solutions: Flink versus Spark

While real-time stream processing has been around for a while, businesses are now trying to quickly process larger volumes of streaming data. Streaming data is everywhere from Twitter, sensors, stock ticker prices, and weather. Streaming data comes in continuously, which poses challenges in processing streaming data.

No alt text provided for this image

Flink was initially written in Java and Scala and exposes many Application Programming Interfaces (APIs), including the DataStream API. Flink was developed by a German University and became an incubator project for Apache in 2014.

No alt text provided for this image

Similar, but different, Spark Streaming is one of the most used libraries in Apache Spark. Spark developers create streaming applications using DataFrames or Dataset API’s, which are available in programming languages like Java, Python, and R. The product is essentially an extension of the core Spark API.

No alt text provided for this image


Both Flink and Spark are big data systems that are fault-tolerant and built to scale data. While both Flink and Spark are in-memory databases and have ability to write data to permanent storage, the goal is to keep it in memory for current usage. Both products enable programmers to use MapReduce functions and apply machine learning algorithms with streaming data. That is, both Flink and Spark are good with machine learning in processing large training and testing datasets across a distributed architecture. Also, both technologies can work with Kafka (LinkedIn’s streaming product), as well as Storm topologies.


Flink was made to be a streaming product, whereas Spark added the steaming product onto an existing service line. Spark was initially built on static data, but Flink can process batch operations by stopping the streaming. With Spark, the stream data was initially divided into micro-batches that repeat in a continuous loop. This means that with the batch program, the file needs to be opened, processed, and then closed. However, in 2018, with Spark 2.3, Spark was able to start to move away from the previous “micro-batch” approach. In contrast, Flink has, for some time, been breaking streaming data into finite sets at a checkpoint, which can be an advantage in terms of speed in running algorithms.

Flink’s Performance

No alt text provided for this image

Flink can be customized to have optimal performance. Specifically, code logic changes and configuration are relevant to performance. For example, event time or processing time can be considered as it relates to performance effectiveness.

Flink breaks time into “processing time” generated at each machine in a cluster and “time making” at the entry point machine in a cluster. The time generated at the entry point machine in a cluster is also known as the “ingestion time” since it is generated at the time of an event. Several scholars recommend using event time because the event time is constant, which means operations can generate deterministic results regardless of throughput. On the other hand, the processing time is the time observed by the machine. Using this lens, the operations based on processing time are not deterministic. In practice, while events are thought of as real-time, there is the assumption that the clocks at event sources are synchronized, which is rarely the case. As such, this challenges the assumption that the event time is monotonically increasing, which means the allowed lateness solves the dropped events problem, but the large lateness value can still have a significant effect on performance. Without setting the lateness, events can then be dropped due to incorrect timestamps.

Regardless of what approach is chosen, the key for efficient processing time is making sure the logic can handle events in the same event time window being split into smaller processing time windows. Researchers have also shown some performance efficiencies can be achieved by not breaking up complex events, but the tradeoff is the operators have to go through the dimensions in each event, and the event object is larger.

Spark’s Performance

In terms of Spark, identified bottlenecks include the network and disk I/O. CPI can also be a bottleneck but is not as common. Resolving the CPU is estimated to improve the completion of job time by 1-2%. Some of the challenges in managing Spark performance include that tasks can create bottlenecks on a variety of resources and different times. Also, concurrent tasks on a machine may compete for resources. Additionally, memory conditions can be a common issue since Spark’s traditional architecture is memory-centric. The causes of these performance setbacks often involve high concurrency, inefficient queries, and incorrect configurations. These issues can be mitigated with an understanding of both Spark and the data, realizing that Spark’s default configuration may not be the best to optimize performance. 

Final Thoughts

The importance of solutions like Flink and Spark is about allowing businesses to make important decisions based on what is currently happening. No one framework solves all the problems, so it becomes a situation of the best fit. Understanding the system and resources can help in addressing performance bottlenecks. There are many stream processing applications, and it is essential to pick a framework that best meets the business’ needs, as not all products are the same. Flink and Spark are two of the popular open stream processing frameworks. Depending on the application, parameters need to be set correctly to meet performance goals. It is essential to understand the tradeoffs involved to get the best performance relative to business needs.

#Spark #Flink #Performance #StreamingData #BigData

What Should Be Keeping You Up At Night: Where is Big Data Stored?

The digital universe is expected to double in size every two years with machine-generated data experiencing a 50x faster growth rate than traditional business data. Big data has a lifecycle which includes:

  • Raw data
  • Collection
  • Filtering and classification
  • Data analysis
  • Storing
  • Sharing & publishing
  • Security
  • Retrieval, reuse, & discovery

However, viewing security as an isolated stage in the lifecycle can be misleading since the storing, sharing, publishing, retrieval, reuse, and discovery are all involved with security.

96% of organizations are estimated to use cloud computing in one way or another. Cloud computing is a distributed architecture model that can centralize several remote resources on a scalable platform. Since cloud computing offers data storage in mass, it is critical to think about security as it relates to storage. With storage, the primary security risks are caused by both the location to store the data and volume of the data. 

Even if the data is stored in the cloud, it can be challenging to understand if those cloud vendors are storing all the data. Companies must ask not just about costs when selecting cloud vendors but where their data is stored — understanding where data is stored is fundamental to several other security and privacy-related issues. Reasons to understand where data is stored could be as simple as mitigating risks caused by geographic weather concerns. For example, if a hurricane hits Florida in a place where your data is stored, do you know if it has been backed up to a safe location? Also, how is the data protected in the data center from not just weather events, but intruders and cybercrime? Compliance regulations like General Data Protection Regulation (GDPR) make the company responsible for the security its data, even if that data is outsourced to the cloud. Before a company can really answer questions like who has access to data and whom did the company send data to, understanding where the data is stored is critical. This situation becomes more relevant in the event of a breach, which is likely a discussion of when the breach occurs not if the breach will occur.

Data verification is also essential to ensure the data is accurate. In terms of verifying the actual data stored in the cloud, it is not as easy as just downloading the entire data set to see if it has been stored with integrity in the cloud because of cost and local bandwidth. There have been some query authentication methods that have addressed issues of correctness, completeness, and freshness over the years. Basically, a set of data values can be authenticated by a binary tree, verification is done on the data values based on the hash value of the root of the tree, and authenticity is done by the customer in iteratively computing all the hashes up the tree and checking if the hash has been computed for the root in a way that matches the authentically published value. Creating automated processes has helped with data verification efforts, but the algorithms continue to evolve to support faster and larger-scale verification for different version data.

Security enforcement has been increasing with new global regulations. There has been enforcement for companies in this space, including in July 2019, when Marriott was fined EUR 100 million for failure to implement appropriate information security protocols resulting in a breach of 339 million customer records. Also, in the same month this year, British Airways was fined EUR 183 million for failure to implement appropriate information security protocols that resulted in a breach of 500,000 customer records. While some of these fines may be a drop in the bucket for larger companies, smaller companies may be just taking a gamble on not investing in the needed systems because they do not think their organizations are high profile enough to be enforced. However, as data fines continue to increase – now is the time to re-evaluate cyber defense positioning. Regardless of company size, all organizations can at least start a regular dialogue about understanding where their data is stored. 

#Cyber #Security #BigData #DataStorage #Cloud

Machine Learning and Extracting Knowledge from Big Data

The Resource Description Framework is essentially an application of Extensible Markup Language (XML) that helps describe Internet resources like a website and its content. RDF descriptions are called metadata since they are typically data about data like the particular site map or date of page updating. RDF is based on the idea of a model that is developed between statements and web resources. It is essential because the framework makes it easier for developers that build a product using that metadata.

A study by Casteleiro et al. (2016) explored the ability to disturbed work functions from machine learning algorithms with the terms of Cardiovascular Disease Ontology. This study was critical because it demonstrated the benefits of using terms from ontology classes to obtain other term variants. The study opened up the research of the feasibility of different methods that can scale with big data and enable automation of machine learning analysis.

 Sajjad, Bajwa, and Kazmi’s (2019) research was already looking at rule engines and producing rules in the era of big data. They proposed a method to work with the semantic complexity in the rules and then do an automated generation of the RDF model of rules to help in analyzing big data.  Specifically, they used a machine learning technique to classify the Semantic of Business Vocabularies and Rules (SBVR) rule and map it to the RDF model. A challenge for the research included the automatic parsing of the rules as well as the semantic interpretation. Also, mapping the vocabulary to the RDF syntax to verify the RDF schema proven successful, but challenging. However, their work did show that it was possible to have consistency in checking a set of big data rules through automated tools. However, these scholars also found a need for a method to semantically analyze rules to help with the testing and validating as it relates to rule changes. Their particular system makes an ontology model that can be useful in the interpretation of a set of rules. This research supports both the semantic understanding of rules, but also generates the RFP model of rules that provides support for querying.

#MachineLearning #Knowledge #BigData #RDF #XML


Casteleiro, M. A., Demetriou, G., Read, W. J., Prieto, M. J. F., Maseda-Fernandez, D., Nenadic, G., … & Stevens, R. (2016). Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations. In ODLS (pp. 1-6).

Sajjad, R., Bajwa, I. S., & Kazmi, R. (2019). Handling Semantic Complexity of Big Data using Machine Learning and RDF Ontology Model. Symmetry11(3), 309.

Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K per sponsored post. Just this year, location-based marketing is predicted to grow to $24.4 billion in ad spending. Nearly half of advertisers plan on using influencer marketing this year as real click rates can translate into purchased products and services.

No alt text provided for this image

As such, this market is ripe for cyber-attacks. However, one way to detect these hackers is to look at the IP size distribution or the number of users that are sharing the same source IP. IP size distributions are created from 1) actual users 2) sponsored providers that provide fraudulent clicks and 3) bot-masters with botnets. The good news is that most machine-generated attacks share an anomalous deviation from the expected IP size distribution. 

However, bots are changing every day as they become more similar to human usage. Gen 1 bots surfaced from in-house scripts but can usually be detected by the absence of cookies. Gen 2 bots are scrappy and can typically be found by the absence of JavaScript firing. Gen 3 bots look like browsers (as compared to Gen 1and 2 bots), but can still be detected using challenge tests and fingerprinting. However, Gen 4 bots look more like human usage with their non-linear mouse movements.

No alt text provided for this image

Security frameworks, supported by machine learning techniques, have been implemented to automatically detect and group deviations. Most detection methods for these Gen 4 bots can be detected with behavioral analysis. Frameworks aggregate statistics around network traffic for investigation recommendations. For example, anomaly detection algorithms can be written to find unusual patterns that do not fit with expected behavior. Code can be written to run MapReduce in parallel processing, assigning a distinct cookie ID for each created click. Then a regression model can be used to compare the IP rates using Poisson distribution with a diverse explanatory model to count the unique cookies and measure the entropy relative to the distribution so that the accurate IP size can be determined. This data can also be analyzed using linear regression and percentage regression techniques to help identify the true IP size.

No alt text provided for this image

Some people have also leveraged historical data in helping create accurate IP size distributions. In this day and age, even a lack of historical data or constant cache cleaning can be used as an input to machine learning techniques to find hackers. However, these methods do depending on securing the click data to run the code to find the source of the fraudulent clicks or bonet behavior.

No alt text provided for this image

The next-generation bots are likely to have more advanced artificial intelligence (AI) making them harder to detect. As a result, AI-based bot detections algorithms need to stay on the leading edge to keep a fair playing field and prevent harm to society.

#Bots #IPDistributionSize #CyberSecurity #BigData