CSV data can be analyzed in Apache Spark, the best way to do a exploratory analysis of a CSV data is to use Spark Shell and convert CSV into DataFrame.
After that, SQL can be used to analyse the data.
I am using Apache Spark 1.5.0 on windows machine. The CSV file used is available at Red Wine Quality Data.
Extract only first 3 columns of from the CSV file and print it.
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val wineCSV = sc.textFile("winequality-red.csv")
val header = wineCSV.first()
val wineData = wineCSV.filter(_ != header)
case class WineData(fixedAcidity: Float, volatileAcidity: Float, citiricAcid: Float)
val wineCols = wineData.map(_.split(";")).map( p => WineData(p(0).toFloat,p(1).toFloat, p(2).toFloat)).toDF()
wineCols.registerTempTable("wine")
sqlContext.sql("SELECT * FROM wine").collect().foreach(println)
After that, SQL can be used to analyse the data.
I am using Apache Spark 1.5.0 on windows machine. The CSV file used is available at Red Wine Quality Data.
Extract only first 3 columns of from the CSV file and print it.
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val wineCSV = sc.textFile("winequality-red.csv")
val header = wineCSV.first()
val wineData = wineCSV.filter(_ != header)
case class WineData(fixedAcidity: Float, volatileAcidity: Float, citiricAcid: Float)
val wineCols = wineData.map(_.split(";")).map( p => WineData(p(0).toFloat,p(1).toFloat, p(2).toFloat)).toDF()
wineCols.registerTempTable("wine")
sqlContext.sql("SELECT * FROM wine").collect().foreach(println)
No comments:
Post a Comment