It’s been 2 years since I wrote first tutorial on how to setup local docker environment for running Spark Streaming jobs with Kafka. This post is the follow-up to the previous one, but a little bit more advanced and up to date. It shows basic working example of Spark application that uses Spark SQL to process data stream from Kafka. I’ll also show how to run Spark application and setup local development environment with all components (ZooKeepr, Kafka) using docker and docker-compose.
spark-shell is an interactive shell that comes with spark’s distribution. Shell is useful for learning api, quick experiments, prototyping and etc. But to do that you don’t need to have cluster or even have spark distribution installed.
Recently I have been involved in reviewing one “Big Data” project. I can’t say that volume of data processed by that app was that big, but the project was treated as real big data project by people who developed it. They used Hadoop just because that is what you use for big data. And even project’s codename had phrase “Big Data” in it.
In short, that project was a data analytics application with ETL, data warehouse and reporting. A variety of technologies were used to build it: Java, Hadoop, Python, Pandas, R with UI implemented in PHP and Shiny Apps embedded as IFRAMEs. But honestly that project was a real mess. And everything that could go wrong actually went wrong. Yet it is a good example of common problems and anti-patterns that might happen with data analytics apps. So I’d like to talk about major ones.
This is a short note on how to deal with Parquet files with Spark.
Spark SQL provides built-in support for variety of data formats, including JSON. Each new release of Spark contains enhancements that make use of DataFrames API with JSON data more convenient. Same time, there are a number of tricky aspects that might lead to unexpected results. In this post I’ll show how to use Spark SQL to deal with JSON.
Column oriented data stores have proven its success for many analytical purposes. Such success was shown by RCFile and Apache ORC formats and their wide adoption in many distributed data processing tools in Hadoop ecosystem.
Another columnar format that has gained popularity lately is Apache Parquet, which is now top level Apache project. It is supported by many data processing tools including Spark and Presto provide support for parquet format.
Recently I’ve been experimenting with storing data in the parquet format, so I thought it might be a good idea to share a few examples. This post covers the basics of how to write data into parquet.
If you’ve always wanted to try Spark Streaming, but never found a time to give it a shot, this post provides you with easy steps on how to get development setup with Spark and Kafka using Docker.