{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Final Project: Tree-based Model for the IMDB Movie Review Dataset\n", "\n", "For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/\n", "\n", "Use the notebooks from the class and implement the model, train and test with the corresponding datasets.\n", "\n", "You can follow these steps:\n", "1. Read training-test data (Given)\n", "2. Train a classifier (Implement)\n", "3. Make predictions on your test dataset (Implement)\n", "\n", "You can use the __LogisticRegression__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html \n", "You can use the __DecisionTreeClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html \n", "You can use the __RandomForestClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html \n", "You can use the __GridSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html \n", "You can use the __RandomizedSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "\n", "We will use the __pandas__ library to read our dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### __Training data:__\n", "Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlabel
0This movie makes me want to throw up every tim...0
1Listening to the director's commentary confirm...0
2One of the best Tarzan films is also one of it...1
3Valentine is now one of my favorite slasher fi...1
4No mention if Ann Rivers Siddons adapted the m...0
\n", "
" ], "text/plain": [ " text label\n", "0 This movie makes me want to throw up every tim... 0\n", "1 Listening to the director's commentary confirm... 0\n", "2 One of the best Tarzan films is also one of it... 1\n", "3 Valentine is now one of my favorite slasher fi... 1\n", "4 No mention if Ann Rivers Siddons adapted the m... 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "train_df = pd.read_csv('../data/final_project/imdb_train.csv', header=0)\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### __Test data:__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlabel
0What I hoped for (or even expected) was the we...0
1Garden State must rate amongst the most contri...0
2There is a lot wrong with this film. I will no...1
3To qualify my use of \"realistic\" in the summar...1
4Dirty War is absolutely one of the best politi...1
\n", "
" ], "text/plain": [ " text label\n", "0 What I hoped for (or even expected) was the we... 0\n", "1 Garden State must rate amongst the most contri... 0\n", "2 There is a lot wrong with this film. I will no... 1\n", "3 To qualify my use of \"realistic\" in the summar... 1\n", "4 Dirty War is absolutely one of the best politi... 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "test_df = pd.read_csv('../data/final_project/imdb_test.csv', header=0)\n", "test_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Train a Classifier\n", "\n", "You can work with these models:\n", "* __LogisticRegression__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html \n", "* __DecisionTreeClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html \n", "* __RandomForestClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html \n", "* __GridSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html \n", "* __RandomizedSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Implement this" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Make predictions on your test dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Implement this" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }