Introduction to Spark DataFrame | Exploring Spark DataFrame API using PySpark | Hands-On

Before start learning what is DataFrame in Spark, I would request learn about what is Apache Spark and the main abstraction or data type called RDD(Resilient Distributed Datasets) in Apache Spark

Apache Spark

RDD(Resilient Distributed Datasets)

What is the problem with RDD?

  • RDD transformation will not have any idea about logic of the function passed to it 
  • RDD can’t do any optimization 
  • What is the need for DataFrame? 
  1. It gives table like view, because of that it is user friendly 
  2. Important use of DataFrame is performance 
  3. DataFrame goes through Catalyst Optimizer(optimization takes place) before sending for execution as RDD

Create DataFrame in Apache Spark

There are two main ways in which you can create DataFrame in Saprk.

Create DataFrame using Data Source API by reading data from Input Files(CSV, JSON, TextFile, Parquet, etc.)

Create DataFrame Programmatically(Through program code)

Create DataFrame with few employee records

Happy Learning !!!

Post a Comment