---
title: "Rossman store sales forecast"
date: 2020-02-07T00:15:15-05:00
draft: false
algo: deepar
---

### Introduction

Dirk Rossmann GmbH is Germany's second-largest drug store chain (after dm-drogerie markt), with over 4000 stores in Europe.

You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

This data was obtained from Kaggle [here](https://www.kaggle.com/c/rossmann-store-sales/data) 

For folks who want to repeat this use case, the data is stored as a direct download [here](s3://easy-ml-pocs/rossman-sales-data/rossman-sales.csv)

The first few rows of the raw csv data looks like this:

```html
"Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday","SchoolHoliday"
1,5,2015-07-31,5263,555,1,1,"0","1"
2,5,2015-07-31,6064,625,1,1,"0","1"
3,5,2015-07-31,8314,821,1,1,"0","1"
4,5,2015-07-31,13995,1498,1,1,"0","1"
5,5,2015-07-31,4822,559,1,1,"0","1"
6,5,2015-07-31,5651,589,1,1,"0","1"
7,5,2015-07-31,15344,1414,1,1,"0","1"
8,5,2015-07-31,8492,833,1,1,"0","1"
9,5,2015-07-31,8565,687,1,1,"0","1"
```

### Assumptions

For this PoC, we will simplify the dataset by dropping a few columns. Many of the following assumptions may be inaccurate, but this is a PoC, and we want to exercise the provided code samples end-to-end. Let us try to forecast the ```Sales``` column for each store by ```Store``` ID, and ```Promo```. We will first and drop:

- ```DayOfWeek``` assuming most of the days have similar sales until the weekend
- ```Customers``` assuming this is number of customers, and we will not have access to this value until after the day
- ```Open``` assuming it the stores are open except during holidays, and we have a column that takes care of that
- ```SchoolHoliday```, since to prove this out, we will be considering other columns with dynamic values, like ```Promo```
- ```StateHoliday```, since the unique values in the column are [0,a,b,c]

### Step 1 - Preprocessing

#### Step 1.1 - Drop and rearrange columns
From the code on this page on [Select, drop or extract Columns](../../preprocessing/selecting), copy the following line:

```html
awk -F "|" '{ print $1 $3 $5 }' Folder/in*csv > outfile.csv
``` 
and adapt it for our use case:

```html
awk -F"," '{ {gsub(/\"/,"")}; print $3","$1","$4","$7}' rossman-sales.csv > outfile.csv
```

Now, the data in outfile.csv looks like this:


```html
Date,Store,Sales,Promo
2015-07-31,1,5263,1
2015-07-31,2,6064,1
2015-07-31,3,8314,1
2015-07-31,4,13995,1
2015-07-31,5,4822,1
2015-07-31,6,5651,1
2015-07-31,7,15344,1
2015-07-31,8,8492,1
2015-07-31,9,8565,1
```

Note that we also used the find-and-replace from the same [link](../../preprocessing/selecting) using ```awk``` to replace all the double-quotes (") by an empty string (basically remove all double-quotes)


#### Step 1.2 - Prepare data for DeepAR

From [this link](../../preprocessing/deepar/), you can create a new python file with the following contents:

```python
import pandas as pd
import jsonlines
from sklearn import preprocessing
le = preprocessing.LabelEncoder()


series = pd.read_csv('outfile.csv', parse_dates=[0], index_col=0)
series.sort_index(inplace=True)

target_column = 'Sales'
group_column = 'Store'

for col in series.columns:
    if col !=target_column:
        series[col] = le.fit_transform(series[col])

if series[group_column].nunique()==1:
    a = [series]
else:
    a = [v for k, v in series.groupby(group_column)]

out = []

for i in range(len(a)):
    dynamic_feat = []
    cat = []
    for col in a[0].columns:
        if col == target_column:
            target = a[0][col].values.tolist()
            start = str(a[0].index[0])

        else:
            if a[0][col].nunique()>=2: #if 2 or more values, add as dynamic feature
                dynamic_feat.append(a[0][col].values.astype(float).tolist())
            elif a[0][col].nunique()==1: #if 1 value, add as category
                cat.append(int(a[0][col][0]))
    out.append({'start':start, 'target':target, 'cat':cat, 'dynamic_feat':dynamic_feat})
    
with jsonlines.open('train-data.jsonl', mode='w') as writer:
    writer.write_all(out)
```

> The only things we changed are the name of the input file ('outfile.csv'), the name of the target column and group column.

Save it as ```dataprep.py```, and run ```python dataprep.py```

After running this file, you should get an output file called ```train-data.jsonl``` which looks like:

```html
{"start": "2013-01-01 00:00:00", "target": [0, 5530, 4327, 4486, 4997, 0, 7176, 5580, 5471, 4892, 4881, 4952, 0, 4717, 3900, 4008, 4044, 4127, 5182, 0, 5394, 5720, 5578, 5195, 5586, 5598, 0, 4055, 3725, 4601, 4709, 5633, 5970, 0, 7032, 6049, 6140, 5499, 5681, 5370, 0, 4409, 4015, 4252, 4241, 4809, 6154, 0, 6407, 5386, 5660, 5261, 5000, 5237, 0, 4038, 3794, 4558, 4676, 4611, 5350, 0, 7675, 6300, 5973, 5637, 5853, 5578, 0, 4949, 3853, 4341, 5108, 4925, 5003, 0, 7072, 6563, 5598, 5179, 5506, 5603, 0, 6729, 6686, 6660, 7285, 0, 7132, 0, 0, 5484, 4625, 4293, 4390, 5075, 0, 6046, 5514, 4903, 4366, .......
```

### Step 2 - Training

From [this link](../../training/deepar) on training with DeepAR, we first upload our ```train-data.jsonl``` file to S3, and make a note of this ```path```. You can also click the file on the S3 console, and hit the ```Copy  Path``` button.

![](/images/copypath.png)

Since this is daily data, change the '1H' frequency in the hyperparameters to '1D'.

Again from [this link](../../training/deepar) on training with DeepAR, copy all the code cells in a python file, or use a local jupyter notebook (make sure you have the right permissions and have run ```aws configure```); you can also run the following cells in a SageMaker notebook.

In this example, we use a SageMaker notebook and add these cells:

![](/images/startrosstraining.png)

At the end of training, you will see the following output:

```html
.
.
.
2020-03-12 21:42:57 Uploading - Uploading generated training model
2020-03-12 21:42:57 Completed - Training job completed
Training seconds: 156
Billable seconds: 156
```

### Step 3 - Deploy model

From [this link](../../inference/deepar) on deploying a model trained with DeepAR, do

```python
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
```

To forecast, follow instructions in the same [link](../../inference/deepar):

![](/images/predictrossdeepar.png)

### Done!