This is a short note on how to deal with Parquet files with Spark.
Previously I showed how to write parquet files using just parquet library.
But Spark SQL has built-in support for Parquet data format, which makes processing data in parquet files easy using simple DataFrames API.
Reading DataFrame from parquet is simple as:
val df = sqlContext.read.parquet("s3a://bucket/data/")
And writing data to parquet files:
.write.parquet("s3a://bucket/another_data/") df
More advanced topics like partitioning and schema merging will be covered later.