Thursday, October 22, 2015

Apache Spark : Loading CSV file as DataFrame

CSV data can be analyzed in Apache Spark, the best way to do a exploratory analysis of a CSV data is to use Spark Shell and convert CSV into DataFrame.

After that, SQL can be used to analyse the data.

I am using Apache Spark 1.5.0 on windows machine. The CSV file used is available at Red Wine Quality Data.

Extract only first 3 columns of from the CSV file and print it.





  val sqlContext= new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.implicits._

  val wineCSV = sc.textFile("winequality-red.csv")  
  val header = wineCSV.first()
  val wineData = wineCSV.filter(_ != header)

  case class WineData(fixedAcidity: Float, volatileAcidity: Float, citiricAcid: Float)

  val wineCols = wineData.map(_.split(";")).map( p => WineData(p(0).toFloat,p(1).toFloat, p(2).toFloat)).toDF()

  wineCols.registerTempTable("wine")

  sqlContext.sql("SELECT * FROM  wine").collect().foreach(println)

No comments:

Post a Comment