The future of Apache Spark

At Strata + Hadoop World in London the future of Apache Spark was a big topic of attention. The young platform born originally with the humble objective “to spark an ecosystem in Mesos” has become the most contributed project in the Apache Foundation. This year it attracted hundreds of attendants interested in gaining a deeper understanding ranging from how to develop applications based on Spark, what are interesting use cases, how to integrate it in an existing ecosystem and how to use it in a production environment.

The questions were numerous across all panels, trainings and discussions. Spark exists since 2010 but it really started getting significant traction in 2013. The relative novelty of Spark was clear, many of the attendants either did not have relevant experience or had just toyed with it. In addition, for many, it is not easy to understand where Spark fits in the big data landscape. Common questions were related to the relation of Spark with other popular technologies such as Hadoop, Kafka, Casandra, ElasticSearch and similar.

One of the big advantages of Apache Spark is that it acts as a common platform and makes everything high level. Until 2011, trying to do a big data project required working with four to five different technologies of the Hadoop ecosystem and writing hundreds of lines of low-level code. Additionally, some of these technologies felt heavy. This has changed with Spark and its surrounding ecosystem. Whereas Hadoop alone has more than 200 thousand lines of code, Spark is a much smaller project. Partly, this can be explained because Hadoop was written in Java, a verbose language, and Spark in Scala, a more concise programming language, but mostly this is due to spark being developed almost 10 years later after the famous MapReduce paper from Google.

In addition, Spark is acting as a big magnet with its platform approach. Developers can create applications in some of the most popular languages available: java, scala, python, other JVM languages and soon R. They also can make use of modules built on top of Spark by Databricks covering important use cases such as streaming, machine learning, graphs and data retrieval in sql-like fashion. Here, Databricks has been careful in avoiding a fast and organic growth in their libraries that can lead to code of lower quality or in the case of MLLib to algorithms that are not properly implemented and tested. This gives developers and companies alike significant peace of mind, other projects targeting machine learning for big data such as Mahout suffer from a lack of stringent quality control. Often, it is not clear if the algorithms available in the library are ready for production or were just a weekend project of some enthusiastic contributor. It is clear that Databricks is aiming for wide enterprise adoption and they seem to be heading in the right direction.

However, the question remains on what can we expect from Spark for this and next year. Patrick Wendell, Andy Kowinsky, Sean Owen and other key contributors at Databricks and Cloudera discussed this vividly across many panels and personal discussions. The main takeaways can be grouped around the following projects:

Spark streaming: Right now is a very exciting time in stream processing. There are a lot of open source tools in the market such as Storm, Samza or Kafka. Although all of them have their special place in the market, Spark does not want to be a direct competitor. Spark streaming is rather focused on those users seeking to run everything under one platform. The same code for batches should also run smoothly under streaming mode. The panel acknowledged that other tools such as Storm are still more mature and that Spark streaming is not exactly real streaming but rather micro batching (one RDD is created every second). However, they assured that the project is maturing very rapidly and that micro batching is good enough for most use cases. (See: https://spark.apache.org/streaming/) Dataframes: Although the API is undergoing serious changes and will only be stable until Spark 1.5, Dataframes is clearly one of the most important projects at Databricks. The idea is to offer the same flexibility and feature-richness that users of R and Python already enjoy. For users it should be very intuitive to do common transformations. For example, instead of doing a series of lambda functions to do a groupby, one can only do df.groupby(“user”). In addition, it has been heavily optimised and therefore it is much faster than an RDD regardless of the language being used. (See: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)

Spark SQL: It is being positioned as a replacement for Hive and one of the key features behind Spark. For this, it leverages heavily Spark dataframes and provides a higher level of abstraction that enables analysts and other non-technical users to run queries on Spark. With the SchemaRDDs it is possible to plug different sources and query them all using the same sql-like queries. (See: https://spark.apache.org/sql/)

BlinkDB: Similarly to Spark SQL, BlinkDB is an approximate query engine to run queries at scale. The idea is to run queries extremely fast by trading query accuracy for response time. It enables interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. Although currently not part of Spark and still an experimental project, it is expected that several of the features behind it will be included in Spark SQL (See: http://blinkdb.org/)

MLLib: Mahout has clearly lost traction and currently we do not have “the” library for machine learning on big data or on the JVM. MLLib tries to fill this void specially for engineers that are not familiar with the theory behind machine learning but would like to include predictive features in their projects. However, the team is aware that the number of algorithms available is still limited, there are some rough edges and the documentation can be improved. (See: https://spark.apache.org/mllib/) Mesos: Spark originally started as a support tool for Mesos. The project is very actively developed and companies such as Twitter and Apple rely on it (as highlighted by Paco Nathan in the comments section). However, many of the original contributors are now at Databricks focused on other tasks. For some projects it might be ideal, but it is clear that now YARN has more momentum and that is a good combination specially if it is being used together with Hadoop. (See: http://mesos.apache.org/)

Tachyon: Some attendants wanted to know the relation between Tachyon and Spark. Spark is bundled with Tachyon. Both projects came originally from Amplab, the laboratory in Berkeley where Spark was originally conceived. But Tachyon is a separate project and not directly related to Spark. Moreover, Databricks does not work with it. Yet, external contributors are trying to do integration with spark. (See: http://tachyon-project.org/ ) Project Tungsten: It is a major optimisation project. It consists of three main initiatives: Memory management and binary processing, cache-aware computation and code generation. We can expect to see the first results already in Spark 1.4. (See: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) Language support: Spark can be run in Java, Scala and Python. This year R will be included. It is also expected that PySpark will become more mature. Some companies are using PySpark in production but many others still feel that the API is still lacking. Due to differences in programming paradigms, it is not possible to expect that all APIs will have exactly the same functionalities and features. In addition, performance will vary across languages. Reality is that a basic knowledge of Scala is necessary to make fully use and optimise Spark. Certification: Currently provided by O’Reilly, the Apache Spark certification is an important area of attention at Databricks. Most likely they will create more specialised certifications (e.g. with a focus on devops) and the existing one will be updated soon to reflect the recent changes and new features. The current certification has a focus on the inner details of Spark core and is a good way to demonstrate a solid understanding of the technology powering Spark. (See: http://www.oreilly.com/data/sparkcert.html) In conclusion, although the Spark ecosystem is growing very fast and is now one of the biggest names in the cluster computing ecosystem, both speakers and attendants agreed that Hadoop and its ecosystem is not going anywhere. Both technologies are production-ready and being used by hundreds of companies worldwide. Competition is healthy and data engineers and data scientists a-like should choose the right tool for the job.