# Runnging Wikipedia SparkSQL example on Amazon EMR The document shows how to run a simple word count example for files sitting on Amazon S3 #Contents This project has two files - ``build.sbt`` File containing the build defination - ```WikiS3SparkSQL.scala``` Our word count code Querying data sitting in S3 bucket ``` s3://support.elasticmapreduce/bigdatademo/sample/wiki ``` Each line in the log file has four fields: ``projectcode, pagename, pageviews, and bytes``. A sample of the type of data stored in Wikistat is shown below. ``` en Barack_Obama 997 123091092 en Barack_Obama%27s_first_100_days 8 850127 en Barack_Obama,_Jr 1 144103 en Barack_Obama,_Sr. 37 938821 en Barack_Obama_%22HOPE%22_poster 4 81005 en Barack_Obama_%22Hope%22_poster 5 102081 ``` # Build using SBT Download the two files and build this project using SBT. Keep in mind to maintain the directory structure ``` ./build.sbt ./src ./src/main ./src/main/scala ./src/main/scala/WikiS3SparkSQL.scala ``` #Submitting code to cluster Copy your project JAR to your [Amazon EMR] cluster running Spark and from command line run the following command. This will submit our spark job to cluster and print the results on screen. ```MASTER=yarn-client /home/hadoop/spark/bin/spark-submit --class WikiS3SparkSQL /path/to//example-wikisparksql_2.10-1.0.jar```