# Improving Forecast Accuracy with Machine Learning
## Sample Synthetic Data Creation Tooling

In order to test and demo the Improving Forecast Accuracy with Machine Learning Solution, it is useful to have a set of 
synthetic data that represents a time series that a customer would wish to forecast. Many repositories will include 
example datasets, but do not provide a mechanism to generate sample synthetic data. 

This is useful to both

 - generate sample data files that Amazon Forecast can support (so that users can compare to their own data sets)
 - create "predictably random" forecastable datasets (this can help us test the forecast result yields the expected 
   value of our stochastic process)
   
This README documents the process of generating synthetic data with the `create_synthetic_data.py` script.

## Getting Started

The `create_synthetic_data.py` script can be run from the command line: 

```
./create_synthetic_data.py --help
Usage: create_synthetic_data.py [OPTIONS] [INPUT]

  Create synthetic data for the items defined in INPUT (default:
  `config.yaml`)

Options:
  --start TEXT            start date or time, formatted as YYYY-MM-DD or YYYY-
                          MM-DD HH:MM:SS

  --length INTEGER RANGE  number of periods to output for each model defined
                          in the input configuration file

  --plot                  set this flag to output plots of each model
  --help                  Show this message and exit.
```

You will need to prepare a model configuration file (`config.yaml` by default), documented below, to generate data.

## Prerequisites
The following procedures assumes that all of the OS-level configuration has been completed. They are:

* Ensure Python 3.8+ is installed

## 1. Build the solution for deployment

Prepare a Python virtual environment:
```
# ensure Python 3 and virtualenv are installed
cd <repository_name>/source/synthetic
virtualenv .venv
source .venv/bin/activate
pip install -r ../requirements-build-and-test.txt
```

## 2. Prepare the configuration file `config.yaml`

The YAML formatted data file has certain required fields, documented below: 

```
--- 

# this retailer sells penne, two different brands of marinara, and one brand of alfredo sauce across two locations

output: 5min  # required - must be compatible with Amazon Forecast frequencies (Y|M|W|D|30min|15min|10min|5min|1min)   

no_sales_at_night: &no_sales_at_night [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0]  # hourly adjustments must have length 24 (starting at 00:00)
higher_weekends: &higher_weekends [1,1,1,1,1,2,1.5]  # weekly adjustments must have length 7 (starting Monday)
higher_holiday: &higher_holidays [0.5,0.5,1,1,1,1,1,1,1,1,1.5,2.5]  # montly adjustments must have length 12 (starting January)

models:                                 # required     - define each model in a list under this key
  # ottawa location 
  - name: penne x                       # required     - each model requires a name
    rate: 60                            # required     - each model requires a rate (expected occurrences per period)
    per: D                              # required     - period length for rate (must be compatible with Amazon Forecast frequencies (Y|M|W|D|30min|15min|10min|5min|1min))
    dimensions:                         # not required - each model can support optional forecast dimensions (commonly used to break down sales by location) 
      - name: store
        value: ottawa
    metadata:                           # not required - each model can support optional item metadata 
      - name: brand                     #                if metadata is defined, ensure it exists for each model under models
        value: brand x
    seasonalities:                      # not required - seasonalities to apply to the generated data
      hourly: *no_sales_at_night        #              - e.g. many retailers do not sell goods at night
      daily: *higher_weekends           #              - e.g. many retailers have higher sales rates on weekends
      monthly: *higher_holidays         #              - e.g. many retailers have higher sales in some months 
    dependencies:                       # not required - allows us to model complementary goods 
      marinara x:                       # required     - each dependency requires a name (in this case, this model depends on another model called `marinara x`
        chance: .5                      # required     - each dependency requires a chance that a sale of this good results in a sale in the model using it
      marinara y:
        chance: .3
      alfredo y:
        chance: .9

# [...]
```

**Note:** If using forecast dimensions, dependencies will search for the model defined with the same dimensions as 
the one that requires it - in the example above, the dependency on `marinara x` will apply the `marinara x` dependency
to `penne x` where `marinara x` matches the dimensions of `penne x` - that is, a `store` of `ottawa`. 

## 3. Run the script

Once your configuration is complete and saved as `config.yaml` you can run the synthetic data generation tool. The 
example below allows you to generate one year of synthetic data for the configuration defined in the included 
configuration file.

Note the length, `105120 = (60 minutes * 24 hours * 365 days) / 5 minutes = 1 year`. The output frequency is defined in
the configuration file as 5 minutes. Adjust the length to suit your requirements 

`./create_synthetic_data.py --start 2000-01-01 --length 105120` 

## 5. Consume the data

* the dataset generated will be saved by default to ts.csv (it will append to this file)
* the metadata generated will be saved by default to ts.metadata.csv (it will append to this file) 

### Known issues

***

Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.