Holaaaa!!! How are you, friends? We love engines, don’t we? Engines have simplified our day to day life, travel, work, etc. But why are we talking about engines while the topic is Apache Spark? Because Apache Spark is also a type of engine! It is a big-data processing engine that is equipped with machine learning. The need for big-data processing engines is quite clear. The data is increasing every second and we need to process it. High-Speed Internet connectivity for everyone is good, but I do believe that it has led to “data population”. While I don’t want to sound “traditional”, I still believe that there is a lot of unnecessary data into existence. In short, the present and future are loaded up with plenty of data. This means we will need more powerful data-processing engines in the future.
Coming back to Apache Spark, it is widely used. Why? The reason is that it is 100 times faster than Apache Hadoop. It is built around the concept of ease. The outcome is that Apache Spark is quick, simple, and versatile. It was started by Matei Zaharia at the UC-Berkeley’s AMPLab in 2009. By witnessing its potential, it was contributed to Apache in 2013. Apache Spark has the capacity to leverage loads of memory, optimize the code across entire pipelines, and reuse JVMs across tasks.
This is a quick and short description of Apache Spark. If you remember, we have already discussed Apache Spark in another article titled “How are Big Companies using Apache Spark”. In that article, we have also listed down big companies that use Apache Spark. So now it is time to discuss two such IT Giants in detail. Let us begin.
Data processing engines are smart. Their tasks aren’t just to find the data, they are also used for analytics. Facebook has a huge number of users and the data is updated almost every second. For data-driven decision making, Facebook uses analytics. As discussed, the user and product growth have grown tremendously. In fact, it is growing while you are reading this article! To handle this, the analytics engines operate on the datasets in tens of terabytes, even for a single query.
There is a platform known as Hive platform. It executes some of Facebook’s batch analytics. There is one more platform known as Corona. It is known for the custom MapReduce implementation. Other than these, Facebook uses its Presto footprint for ANSI-SQL based queries. Graph processing and machine learning are also used by Facebook.
To change from the old hive based infrastructure to the new Apache Spark wasn’t easy. Facebook uses real-time ranking in numerous ways. For example, raw feature values are generated offline with Hive and the data is loaded into the real-time affinity query system. The old Hive infrastructure was computationally “resource intensive”. So maintaining it was a challenge. The pipeline was divided into smaller Hives. As a first step, Facebook took fresh feature data and tried to improve its manageability. They selected one of the existing pipelines and migrated it to Spark.
Facebook is huge and we know it! Debugging at such a large scale is not easy. It eats up a lot of resources and is time-consuming. So Facebook started with a small sample of 50 GB compressed input. Yes, this is small for Facebook. Slowly, they scaled up to 300 GB, then 1 TB, and then 20 TB. With each step, they resolved the performance and stability issue. They faced the major challenge at 20 TB. Why? At 20 TB, they generated too many output files, each of them around 100 MB. Three out of 10 hours of job runtime were spent moving files from the staging directory to the final directory in HDFS.
There were only two options to tackle this situation. First, improve the batch renaming in HDFS in such a way that it supports the use case. Second, configure Spark to generate fewer output files. But there was a third alternative. They removed the two temporary tables that were used to store the pipeline’s intermediate output. The three Hive stages were combined into a single Spark job. It reads 60 TB of compressed data and performs a 90 TB shuffle/sort.
Running a single Spark job is not easy for a large pipeline. There were many improvements and optimizations to the core Spark infrastructure. Let us have a look at the major improvements that led to the deployment of one of the entity ranking pipelines.
For executing long-running jobs, the system should be fault-tolerant. What does fault tolerant mean? A lot of people misunderstand it. Being fault-tolerant doesn’t mean that there should be no faults. We can’t stop faults, but we can tolerate it or rather tackle it and recover from it. Spark tolerates machine reboots, but there were still other bugs that had to be addressed. For example, common failures are pretty common, right? So what were those issues and how were they handled?
Other than the above, there were some other reliability fixes like the unresponsive driver, excessive driver speculation, Timsort issue, etc.
After reliability improvements, the next thing was performance improvements. For this, Facebook shifted the performance related projects to Spark. The performance bottlenecks were solved with the help of Spark metrics. Tools like Spark UI Metrics, Jstack, and Spark Linux Perf/Flame Graph support were also used.
Soon Facebook was pleased to report that they built and deployed a faster manageable pipeline for the entity ranking systems. Now, Facebook started working with Spark for other projects.
eBay, the e-commerce giant creates a huge amount of data. With a lot of existing and new members, eBay is just growing. So what is the most important asset for eBay? Is it the seller or buyer? No, the most important asset is the data. The company doesn’t have an inventory like its competitors. They simply connect buyers and sellers. So data becomes the most important asset.
At eBay, different teams make use of the transactional and behavioral data. Some may be trying to display interesting items, some trying to help sellers understand buyers, etc, the task could be anything. But all of them use the data. I am not interested in sharing the figures of eBay right now, because anyone can find it on Google. We are going to discuss how eBay has used Spark.
eBay’s data platform has three main components. First is the data repository. Second are the data streams and the third is data services. So let us begin with the data repositories. eBay uses data repositories like Hadoop, Hive, and HBase. These repositories are supported with hardware from Teradata. It stores the data that has been created from daily transactions. So from shopping to shipping, everything is stored.
These were actually known as data warehouses. Yes, the concept of data warehouses is still there, but when the data is huge, it can’t be stored in a warehouse. So now, they are known as data lakes. Why? The data is unlimited, uneven, and not precisely predictable. One can predict that on Christmas, the transactions will go up. But how much? This is a mystery and no one can give an exact count. The data is changing and the structure is also changing.
Apache Hadoop is an important aspect for implementing data lakes. The data streams for the data lake are also important. Different teams like product teams and analysts want to see data, the way it is, without any alterations. For this, eBay has built connectors to Hadoop which process the streaming data with Storm and Spark clusters.
eBay has deployed more than 400 Kafka brokers. But it is believed that LinkedIn has the biggest number of Kafka deployment. In eBay, the product team requests the highest number of available data streams. But this is quite obvious, the company is based on the products! How about the data services? How are they handled? eBay handles the data services in their own unique way. They have created their own distributed analytics engine. It includes an SQL interface and multi-dimensional analysis on Hadoop. This was made open-source and is known as the Apache Kylin project. Actually, SQL interface and multi-dimensional analysis on Hadoop made it open source.
By now, eBay had realized that they have a commodity scale computation platform. They also realized that the MOLAP style cubes weren’t operational on a huge scale. One could never take a 100TB cube and keep scaling it. It was impossible to work in tune with the high data rate growth.
But now, eBay has all the necessary components. The raw data is in Hive, the processing capabilities are taken care by MapReduce or Spark. The cube storage is managed by HBase. So with little effort (as compared to the previous tasks), eBay was able to build MOLAP cubes. Within eBay, they have more than a dozen MOLAP cubes. The largest cubes can be approx.100TB consisting of 10 billion rows of data.
Kylin cubes have proved to be the savior for eBay. Earlier, the data used to refresh after three hours. But with Kylin cubes, the data refresh is every minutes or even seconds! So eBay certainly enjoys Kylin cubes and is quite valued in the company.
After all this, the final thing is to create Notebook views about the complex data being processed. So why is it important? It helps analysts to collaborate, co-operate, and quickly make decisions. The notebook is the most important thing for analysts. Queries are also included in it. So analysts who are good at writing queries include it in the Notebook. Others can use the query as per their requirement to achieve desired results.
Facebook and eBay need performant and scalable analytics for product development. This is only possible with Apache Spark as it has the unique ability to unify numerous analytics use cases into a single API. In this article, notice that both the companies literally challenged Spark. These companies are the leading IT companies and their tests cases aren’t easy! From basic level processing to real-time use cases, Spark successfully helped achieve precise results. All this, in less time and with more efficiency.
Spark has now moved beyond just the general processing framework. The general classes of applications are moving to Spark and this includes compute-intensive applications as well. The ones that need input from data streams like sensors and social data. The compute intensive applications have benefited a lot from the in-memory processing. The applications are intelligent and they provide advanced analytics to engage the end users such as healthcare providers.
Spark workloads have been deployed by numerous customer teams into production. It is now well-known for performance, maintainability, flexibility, and precision. Facebook and eBay are just two of them, many small companies have adopted Spark. Why wait then, enter the world of Spark and reach the next level.
Begin your journey with this Free YouTube Spark video tutorial: