# NLP: Text Classification ## Business Problem Building and producing products that are actually adopted by customers and solve real problems for them is a historically challenging task. Today, imagine that you have joined the machine learning team on the Amazon e-commerce site! Your webpage is full of reviews from customers for each of your products. Your product owners want to know about a negative review *immediately*. Ideally, they'd like to know why the review was negative. ## Topic Modeling Your research team just finished labeling a set of data for positive and negative reviews. Go ahead, you can put it into a model right away. It should work straight out of the box! Your task is to identify topics, especially the negative ones. Download the data set as listed below, and extract the negative rewivews. Next, load them into Amazon Comprehend. Spend some time reading about Comprehend here: - https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html You can either do this through the console, or programatically from your notebook. If you're doing this from your notebook, just make sure you attach the Comprehend role directly to your SageMaker execution role. After you've extracted the topics, spend some time reading through them. Do they seem logical? Can you describe them in English, give it a try, it's pretty tough! ## Second Phase After using Comprehend, you can go one of two routes. In advanced cases, you can do both! (1) Build your own topic modeler. SageMaker has two built-in topic modeling algorithms: Latent Dirichlet Allocation and Neural Topic Modeling. Pick one or both of them, and train your data in them. What do the resulting topics look like, are they better or worse than the ones you found in Comprehend? (2) Re-label your data. With sub-categories identified, or with a smaller subset of the most relevant ones, can you add additional labels to your data? Extra points for using SageMaker Ground Truth for this. Can you build accurate models that can identify the sub-categories? ## Data Sets The dataset you'll be working with comes directly from the Amazon review site. This is hosted on AWS through coursework via fast.ai https://course.fast.ai/datasets. Navigate to this page and click download for **Amazon Reviews: Polarity**. The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored. In the dataset, class 1 is the negative and class 2 is the positive. Each class has 1,800,000 training samples and 200,000 testing samples. # Existing Research Your research team just developed an innovative model that uses convolution to classify text. See this page for further details. http://xzh.me/docs/charconvnet.pdf # Sample Code Code from your researchers is available here. https://github.com/zhangxiangxiao/Crepe Download your data from the site, upload it to an s3 bucket via the AWS console, and then run this block of code on your SageMaker notebook instance to read the data into a pandas data frame. ```python import pandas as pd !mkdir /Data !aws s3 cp s3://nlp-workshop-reviews/amazon_review_polarity_csv.tgz /Data !tar -xvzf Data/amazon_review_polarity_csv.tgz df = pd.read_csv("amazon_review_polarity_csv/train.csv", names=["Label", "Title", "Review"]) ```