---
title: "Blazingtext preprocessing"
date: 2020-02-07T00:15:04-05:00
draft: false
algo: [blazingtext]
---

From the Sagemaker example [here](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.ipynb)...

> BlazingText expects a preprocessed text files in S3 with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "__label__".

## What does this mean?
Let's say you have a CSV file with 2 columns,

```html
CATEGORY,Text of document 1
CATEGORY,Text of document 2
CATEGORY,Text of document 3
```

A "document" here can be a sentence, a paragraph or several paragraphs.

> Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them.


Read the file using ```pandas``` like this:

```python
import pandas as pd

data = pd.read_csv('documents.txt', names = {'category','text'})
```

Modify the category column and write out a preprocessed file:

```python
data.category  = '__' + data.category + '__'

import nltk
nltk.download('punkt')
data.text.apply(lambda x: ' '.join(nltk.word_tokenize(str.lower(x))))

# Don't include headers or indices
data.to_csv('out.csv',index=False,header=False)
```

Your CSV file that is ready for blazing text will now look like...

```html
__CATEGORY__,Text of document 1
__CATEGORY__,Text of document 2
__CATEGORY__,Text of document 3
```

*Note* : Repeat this for all files that is part of your dataset.

Upload ```out.csv``` to an S3 localtion that looks like ```s3://bucketname/train/out.csv```

Also see [this link](../splittesttrain) for information on how to split this output file into two files, for training and testing.