Apache Spark, R, Cassandra and Kafka: Apache Spark : Loading CSV file as DataFrame

CSV data can be analyzed in Apache Spark, the best way to do a exploratory analysis of a CSV data is to use Spark Shell and convert CSV into DataFrame.

After that, SQL can be used to analyse the data.

I am using Apache Spark 1.5.0 on windows machine. The CSV file used is available at Red Wine Quality Data.

Extract only first 3 columns of from the CSV file and print it.

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val wineCSV = sc.textFile("winequality-red.csv")
val header = wineCSV.first()
val wineData = wineCSV.filter(_ != header)

case class WineData(fixedAcidity: Float, volatileAcidity: Float, citiricAcid: Float)

val wineCols = wineData.map(_.split(";")).map( p => WineData(p(0).toFloat,p(1).toFloat, p(2).toFloat)).toDF()

wineCols.registerTempTable("wine")

sqlContext.sql("SELECT * FROM wine").collect().foreach(println)

Apache Spark, R, Cassandra and Kafka

Thursday, October 22, 2015

Apache Spark : Loading CSV file as DataFrame

No comments:

Post a Comment