Skip to content

CS 5542 BigData Lab Report #02

Amy Lin edited this page Mar 29, 2017 · 2 revisions

SPARK PROGRAMMING - Text Data


[ QUESTION ]

Write a spark program with an interesting use case using text data as the input and program should have at least 2 Spark Transformations and 2 Spark Actions.

Present your use case in map reduce paradigm as shown below ( for word count ).


[ IMPLEMENTATION ]

  • Import needed library & how to access Spark-> import org.apache.spark.{SparkConf, SparkContext}
  • Create an object called "TextDataProcess" and define main as String Array.
  • Initialize Spark -> setMaster and AppName
  • Read in text data from an external file by using sc.textFile("textdata.txt")
  • Split and map the texts by words. Using cache to support pulling data sets into a cluster-wide in-memory cache. -> textdata.flatMap(line => {line.split(" ")}).map(word => (word, 1)).cache()
  • Shuffling and reducing -> wc.reduceByKey(_ + _)
  • Output the result as a text file format and balance the data by repartitioning -> output.repartition(1).saveAsTextFile("output.txt")
  • Print out the formatted result in the IntelliJ shell -> var s: String = "----------------\n Words : Count \n---------------- \n" result.foreach { case (word, count) => { s += word + " : " + count + "\n" }} println(s)

<< Words in BOLD are Actions, Transformations, Shuffle Operations Commands. >>

Clone this wiki locally