{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Optional: Descriptive Statistics\n", "\n", "This lab is optional. \n", "\n", "The goal of this quick starter guide is to very quickly go through time-series modeling pipeline. This tutorial does not intend to get deep into concepts of time-series modeling and only uses a very clean dataset. To use an analogy from learning music, the intention is to help the reader to play a few basic songs on the instrument before being disheartened by musical theory.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from scipy.stats import pearsonr\n", "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Strategy\n", "The dataset that is used in this tutorials includes tweet volume for the hashtag **#AMZN**. We load the csv file directly into a `pandas` dataframe, run a few quick checks and descriptive statistics to have an idea of what the dataset is all about and we build a baseline neural model using MLP. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Dataset\n", "The dataset is located in [Numenta Anomaly Detection Benchmark](https://raw.githubusercontent.com/numenta/NAB/master/data/realTweets/Twitter_volume_AMZN.csv). The dataset is based on real-time streaming data from Twitter with **#AMZN** hashtag." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://raw.githubusercontent.com/numenta/NAB/master/data/realTweets/Twitter_volume_AMZN.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n", "The dataset includes only two columns, *timestamp* and *value*. Each line represents a five-minute aggregate of AMZN tweet volume." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(url)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking for missing data\n", "`pandas.isna()` returns `True` or `False` for every record in the dataframe. True indicates that at a certain position in the dataset, a value is missing. The output of `isna()` is a DataFrame, so we can aggregate the values using `sum` to retrieve the total number of missing values per column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no missing values in the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which year/month do we have data from?\n", "First we check what years we have data for. As you can see data is only from 2015.\n", "We also have data only for February till April. February data seems to be incomplete as well" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dt = pd.to_datetime(df.timestamp)\n", "dt.groupby([dt.dt.year, dt.dt.month]).count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting Timestamp to Readable Dates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['timestamp_readable'] = df['timestamp'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S').ctime()[:-4]).astype('str')\n", "df.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will help us uncover seasonnality events" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Descriptive Statistics\n", "Using `describe` we can get a quick descriptive report on the dataset. Since the data includes only one numerical column, we can improve the readability, using `no.transpose()` or `np.T`.\n", "\n", "`df.describe` by default only includes numerical values, but we could pass `include='all'` to the function to include all field. In the case of timestamp, the output would be of no interest to us, so we do not change default values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notable Observations**\n", "- There are 15831 rows of data\n", "- min==0.0, therefore there exists at least one row with `value==0`. It could be possible that missing values are replaced with zeros, so we would like to check that the total number of zeros are not significant. We run a count on `df.value==0`, and if the output shows only a small number of rows are zero, then we accept those zeros and acceptable, accurate, or ignorable.\n", "- In this case, only 28 rows have value of zero, so we keep them as they are." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.value==0].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting the Data\n", "The data is showing signs of seasonality, indeed we notice that we have a peak for each day, most likely corresponding to day time / night time.\n", "We can also see that the activity seems more limited on the Saturday and Sunday" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fourteen_days = 24*60*14//5 # Seven days of the data that is sampled every 5 minutes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = df[:fourteen_days].plot(y='value', figsize=(20,10), sharex=False)\n", "o = ax.set_xticklabels([df.timestamp_readable[int(x)] if x >= 0 else '' for x in ax.get_xticks()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Are there any patterns in the data if we reduce granularity? \n", "Very often short term data is far too jittery and all that can be observed is white noise. There are several techniques to address the issue and try to understand whether there are underlying patterns in the data.\n", "Aggregation and smoothing are amongst those techniques.\n", "\n", "In the next few cells, the data has been aggregated to hourly average per month in order to investigate if there are patterns hidden in the data.\n", "\n", "The data in the dataset sums the number of tweets every five minutes. It is useful for short term prediction of immediate actions. Such data is similar to that of financial data such as stock prices.\n", "\n", "Aggregated data will reduce granularity and helps us predict seasonal and long term non-stationary time series. This is more similar to use cases such as forecasting weather patterns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting several hours, hour by hour to understand if they are any internal patterns. As the data has a 5 minute frequency, 12 * 5 minutes represents 1 hour." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then plot a histogram of data to understand how the data is distributed. It could be observed from the previous plot that most data were around 0-100. From the histogram we can see the point more clearly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.hist(column=['value'], bins=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Daily Patterns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.timestamp = pd.to_datetime(df.timestamp)\n", "df = df.set_index('timestamp')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[28]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "offset = 28 # We offset to 28 to start the day closest to 12am" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting separate days, day by day, to understand daily patterns. 288/12=24 hour." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timestamps_per_day = 24*60//5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14,14), sharey=True)\n", "\n", "df[offset : timestamps_per_day + offset].plot(ax=axes[0,0], title='day 1')\n", "df[1 * timestamps_per_day + offset: 2 * timestamps_per_day + offset].plot(ax=axes[0,1], title='day 2')\n", "df[2 * timestamps_per_day + offset: 3 * timestamps_per_day + offset].plot(ax=axes[1,0], title='day 3')\n", "df[3 * timestamps_per_day + offset: 4 * timestamps_per_day + offset].plot(ax=axes[1,1], title='day 4')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can try to compute the moving average to remove the noise" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14,14), sharey=True)\n", "\n", "df[offset : timestamps_per_day + offset].value.rolling(window=12).mean().plot(ax=axes[0,0], title='day 1')\n", "df[1 * timestamps_per_day + offset: 2 * timestamps_per_day + offset].value.rolling(window=12).mean().plot(ax=axes[0,1], title='day 2')\n", "df[2 * timestamps_per_day + offset: 3 * timestamps_per_day + offset].value.rolling(window=12).mean().plot(ax=axes[1,0], title='day 3')\n", "df[3 * timestamps_per_day + offset: 4 * timestamps_per_day + offset].value.rolling(window=12).mean().plot(ax=axes[1,1], title='day 4')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now if we average this across all days we can see the daily pattern even more clearer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "o = df.value.groupby(df.index.hour).mean().plot(title=\"Overall mean across the hours of the day\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook you have investigated the Twitter volume dataset. Now it is time to move to the [next](../part3/twitter_volume_forecast.ipynb) tutorial where you will train a DeepAR model on this dataset." ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }