# Create a synthetic dataset based on the Tweets

Since, we cannot distribute the tweets as per the Twitter Privacy Policy, we create synthetic tweets that are based on the tweets annotated by the Diego Lab, a Biomedical Informatics Lab at Arizona State University (ASU).

To create synthetic tweets we - 1) translate the original tweet from English to German and back to English, to add variability and noise in the original text; 2) remove stop words. 

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
!pip install better_profanity

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import boto3
from nltk.corpus import stopwords

translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)

def get_translations(source_lang_code, destination_lang_code, text):
    result = translate.translate_text(Text=text, 
                                      SourceLanguageCode=source_lang_code, 
                                      TargetLanguageCode=destination_lang_code)
    TranslatedText = result.get('TranslatedText')
    return TranslatedText

def create_synthetic_text(text):
    # translate the text, twice, to add noise
    translated_result = get_translations('en', 'de', text)
    result = get_translations('de', 'en', translated_result)
    
    # remove stop words
    if text == result:
        result = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
        
    return result

Note: To run the cell above ensure that the IAM role attached to the SageMaker notebook instance has perimissions to invoke the Amazon Translate API. For more information on attaching the necessary policies and adding trusted entites, refer to the documentation - [Creating a role to delegate permissions to an AWS service](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html)

In [4]:
import pandas as pd
df = pd.read_csv("adr_classify_twitter_data.csv", lineterminator='\n')

In [5]:
df.head(0)

Unnamed: 0,text,tweet_id,user_id,label


In [6]:
# remove profanity from the dataset
from better_profanity import profanity

df['has_profanity'] = df['text'].apply(lambda ele: True if profanity.contains_profanity(ele) else False)

In [7]:
df = df[df['has_profanity'] == False]

In [8]:
df['text_synthetic'] = df['text'].apply(lambda ele: create_synthetic_text(ele))

In [9]:
df[df['text'] == df['text_synthetic']]['label'].value_counts()

0    33
1     1
Name: label, dtype: int64

In [10]:
# remove observations where the original text is the same as the augmented text
df_syn = df[df['text'] != df['text_synthetic']]

In [11]:
### Remove Twitter usernames
import re
df_syn['text_synthetic'] = df_syn['text_synthetic'].apply(lambda ele: re.sub('@[^\s]+','', ele))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [12]:
df_syn = df_syn[['text_synthetic', 'label']]
df_syn.rename({'text_synthetic':'text'}, axis=1).to_csv("adr_classify_twitter_synthetic_data.csv", index=False)