This is a short note on how to deal with Parquet files with Spark.
Previously I showed how to write parquet files using just parquet library.
But Spark SQL has built-in support for Parquet data format, which makes processing data in parquet files easy using simple DataFrames API.
Reading DataFrame from parquet is simple as:
val df = sqlContext.read.parquet("s3a://bucket/data/")And writing data to parquet files:
df.write.parquet("s3a://bucket/another_data/")More advanced topics like partitioning and schema merging will be covered later.