--- title: "Comprehend custom preprocessing" date: 2020-02-07T00:14:57-05:00 draft: false algo: [comprehend] --- From Comprehend [Custom Classifier](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html), it supports two modes: multi-class and multi-label In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive. In multi-label classification, individual classes represent different categories, but these categories are somehow related and are not mutually exclusive. >Comprehend custom classifier expects training data with exactly two columns in each row. Column one is one of the possible labels, Column two is the content of the document itself. ### What does it mean? Let's say you have a CSV file with 2 columns, ### Multi-Class Mode: ```html Sample label 1,Text of document 1 Sample label2,Text of document 2 samplelabel3,Text of document 3 ``` A "document" here can be a sentence, a paragraph or several paragraphs. > Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them or join them into a single column. For training dataset, the file format must conform to the following requirements: File has exactly two columns in each row. Column one is one of the possible labels, Column two is the content of the document itself. No header Format UTF-8, carriage return ā€œ\nā€. Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped. Read the file using ```pandas``` like this: ```python import pandas as pd data = pd.read_csv('file.csv', names = {'label','text'}) ``` Modify the label column and write out a preprocessed file: ```python data.label = data.label.upper() data.label = data.label.replace (" ", "_") # Don't include headers or indices data.to_csv('out.csv',header=False,index=False,escapechar='\\',doublequote=False,quotechar='"') ``` Your CSV file that is ready for comprehend custom classifier training will now look like... ```html SAMPLE_LABEL_1,Text of document 1 SAMPLE_LABEL2,Text of document 2 SAMPLELABEL3,Text of document 3 ``` ### Multi-Label Mode: ```html Sample label 1|Sample label2,Text of document 1 Sample label2,Text of document 2 Sample label 1|Sample label2|samplelabel3,Text of document 3 ``` A "document" here can be a sentence, a paragraph or several paragraphs. > Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them or join them into a single column. For training dataset, the file format must conform to the following requirements: File has exactly two columns in each row. Column one is one, many, or all of the possible labels, each separated by a delimiter chosen from the available options. The default delimiter is bar (|). Column two is the content of the document itself. No header Format UTF-8, carriage return ā€œ\nā€. Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped. Read the file using ```pandas``` like this: ```python import pandas as pd data = pd.read_csv('file.csv', names = {'label','text'}) ``` Modify the label column and write out a preprocessed file: ```python data.label = data.label.upper() data.label = data.label.replace (" ", "_") # Don't include headers or indices data.to_csv('out.csv',header=False,index=False,escapechar='\\',doublequote=False,quotechar='"') ``` Your CSV file that is ready for comprehend custom classifier training will now look like... ```html SAMPLE_LABEL_1|SAMPLE_LABEL2,Text of document 1 SAMPLE_LABEL2,Text of document 2 SAMPLE_LABEL_1|SAMPLE_LABEL2|SAMPLELABEL3,Text of document 3 ``` *Note* : Repeat this for all files that is part of your dataset