# Large-Scale Machine Learning with Spark on Amazon EMR
This is the code repository for the code sample used in the AWS Big Data blog post Large-Scale Machine Learning with 
Spark on Amazon EMR.  It demonstrates an example machine learning workflow using Spark and MLlib on EMR.

## Prerequisites
  - Amazon Web Services account
  - [AWS Command Line Interface (CLI)](http://aws.amazon.com/cli/)
  - [sbt](http://www.scala-sbt.org/)
  - [sbt-assembly](https://github.com/sbt/sbt-assembly)
  
## Building
```
sbt assembly
```

## Copying to S3
```
aws s3 cp spark-emr/target/scala-2.10/spark-emr-assembly-1.0.jar s3://your-bucket-name/$USER/spark/jars/spark-emr-assembly-1.0.jar
```

## Example invocation

```
aws emr create-cluster \
  --name "exampleJob" \
  --ec2-attributes KeyName=MyKeyName \
  --auto-terminate \
  --ami-version 3.8.0 \
  --instance-type m3.xlarge \
  --instance-count 3 \
  --log-uri s3://your-bucket-name/$USER/spark/`date +%Y%m%d%H%M%S`/logs \
  --applications Name=Spark,Args=[-x] \
  --steps "Name=\"Run Spark\",Type=Spark,Args=[--deploy-mode,cluster,--master,yarn-cluster,--conf,spark.executor.extraJavaOptions=-XX:MaxPermSize=256m,--conf,spark.driver.extraJavaOptions=-XX:MaxPermSize=512m,--class,ModelingWorkflow,s3://your-bucket-name/$USER/spark/jars/spark-emr-assembly-1.0.jar,s3://support.elasticmapreduce/bigdatademo/intentmedia/,s3://your-bucket-name/$USER/spark/output/]"
```