Reading LZO files with Spark ===================== # Goals of this document Discussion and examples of working with LZO files. # Background LZO compressed files are commonly used in big data processing. Traditionally with Hadoop MapReduce LZO files are not splittable without first being indexed. Indexing of LZO files can be done using the [twitter/hadoop-lzo](https://github.com/twitter/hadoop-lzo) project. The `hadoop-lzo.jar` is preinstalled on EMR AMIs at `/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar`. Spark provides multiple methods to read in datasets such as `.textFile()`. The problem with `.textFile()` and LZO compression is that this input format does not understand how to split LZO files even if an indexing is available. # Example In order to effectively read LZO datasets utilize the `.newAPIHadoopFile()` method specifying the Hadoop LZO input format `com.hadoop.mapreduce.LzoTextInputFormat` then transform the returned data structure into a RDD for use in the application. Please note the LZO data should be already indexed ([see twitter/hadoop-lzo](https://github.com/twitter/hadoop-lzo)). ``` val files = sc.newAPIHadoopFile("s3:///