While real-time stream processing has been around for a while, businesses are now trying to quickly process larger volumes of streaming data. Streaming data is everywhere from Twitter, sensors, stock ticker prices, and weather. Streaming data comes in continuously, which poses challenges in processing streaming data.
Flink was initially written in Java and Scala and exposes many Application Programming Interfaces (APIs), including the DataStream API. Flink was developed by a German University and became an incubator project for Apache in 2014.
Similar, but different, Spark Streaming is one of the most used libraries in Apache Spark. Spark developers create streaming applications using DataFrames or Dataset API’s, which are available in programming languages like Java, Python, and R. The product is essentially an extension of the core Spark API.
Both Flink and Spark are big data systems that are fault-tolerant and built to scale data. While both Flink and Spark are in-memory databases and have ability to write data to permanent storage, the goal is to keep it in memory for current usage. Both products enable programmers to use MapReduce functions and apply machine learning algorithms with streaming data. That is, both Flink and Spark are good with machine learning in processing large training and testing datasets across a distributed architecture. Also, both technologies can work with Kafka (LinkedIn’s streaming product), as well as Storm topologies.
Flink was made to be a streaming product, whereas Spark added the steaming product onto an existing service line. Spark was initially built on static data, but Flink can process batch operations by stopping the streaming. With Spark, the stream data was initially divided into micro-batches that repeat in a continuous loop. This means that with the batch program, the file needs to be opened, processed, and then closed. However, in 2018, with Spark 2.3, Spark was able to start to move away from the previous “micro-batch” approach. In contrast, Flink has, for some time, been breaking streaming data into finite sets at a checkpoint, which can be an advantage in terms of speed in running algorithms.
Flink can be customized to have optimal performance. Specifically, code logic changes and configuration are relevant to performance. For example, event time or processing time can be considered as it relates to performance effectiveness.
Flink breaks time into “processing time” generated at each machine in a cluster and “time making” at the entry point machine in a cluster. The time generated at the entry point machine in a cluster is also known as the “ingestion time” since it is generated at the time of an event. Several scholars recommend using event time because the event time is constant, which means operations can generate deterministic results regardless of throughput. On the other hand, the processing time is the time observed by the machine. Using this lens, the operations based on processing time are not deterministic. In practice, while events are thought of as real-time, there is the assumption that the clocks at event sources are synchronized, which is rarely the case. As such, this challenges the assumption that the event time is monotonically increasing, which means the allowed lateness solves the dropped events problem, but the large lateness value can still have a significant effect on performance. Without setting the lateness, events can then be dropped due to incorrect timestamps.
Regardless of what approach is chosen, the key for efficient processing time is making sure the logic can handle events in the same event time window being split into smaller processing time windows. Researchers have also shown some performance efficiencies can be achieved by not breaking up complex events, but the tradeoff is the operators have to go through the dimensions in each event, and the event object is larger.
In terms of Spark, identified bottlenecks include the network and disk I/O. CPI can also be a bottleneck but is not as common. Resolving the CPU is estimated to improve the completion of job time by 1-2%. Some of the challenges in managing Spark performance include that tasks can create bottlenecks on a variety of resources and different times. Also, concurrent tasks on a machine may compete for resources. Additionally, memory conditions can be a common issue since Spark’s traditional architecture is memory-centric. The causes of these performance setbacks often involve high concurrency, inefficient queries, and incorrect configurations. These issues can be mitigated with an understanding of both Spark and the data, realizing that Spark’s default configuration may not be the best to optimize performance.
The importance of solutions like Flink and Spark is about allowing businesses to make important decisions based on what is currently happening. No one framework solves all the problems, so it becomes a situation of the best fit. Understanding the system and resources can help in addressing performance bottlenecks. There are many stream processing applications, and it is essential to pick a framework that best meets the business’ needs, as not all products are the same. Flink and Spark are two of the popular open stream processing frameworks. Depending on the application, parameters need to be set correctly to meet performance goals. It is essential to understand the tradeoffs involved to get the best performance relative to business needs.
#Spark #Flink #Performance #StreamingData #BigData