# Setup the Dataset

The repo (https://github.com/gsoh/VED) was installed and mounted in `./VED/` by the Notebook setup.
Data is in 7zip files (2 parts)

## this extraction probably only needs to be done once

First, need to install tools

In [None]:
!sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
!sudo yum-config-manager --enable epel
!sudo yum install epel-release
!sudo yum install p7zip

## Now extract and join the archives
!mkdir -p DynamicData 

!7za x VED/Data/VED_DynamicData_Part1.7z
!7za x VED/Data/VED_DynamicData_Part2.7z

!mv *.csv DynamicData/

### the vehicle IDs are in xlsx files. Convert them
!pip install xlsx2csv

!mkdir -p StaticData

!xlsx2csv 'VED/Data/VED_Static_Data_ICE&HEV.xlsx' StaticData/ICEHEV.csv
!xlsx2csv 'VED/Data/VED_Static_Data_PHEV&EV.xlsx' StaticData/PHEVEV.csv

# Data Organization

The list of vehicle IDs and reference information is in `StaticData` in two CSV files -- one for ICE-type cars and another for EV-type cars. Note that the columns are slightly different.

## if the data has been expanded already, can start here

Then in the `DynamicData` folder, each of the 22 files is a week of telemetry data for the cars. 

### first, let's combine the StaticData into a consistent dataframe

In [None]:
import pandas as pd

vehiclesICE = pd.read_csv("StaticData/ICEHEV.csv")
vehiclesEV = pd.read_csv("StaticData/PHEVEV.csv")

# rename the EngineType column to match the ICE dataframe
vehiclesEV = vehiclesEV.rename(columns={"EngineType":"Vehicle Type"})

# combine the two sets of vehicle data into one dataframe
vehicles = pd.concat([vehiclesICE, vehiclesEV])

vehicles.head()

### now, combine all the weeks of data into one dataframe

In [None]:
from functools import reduce
from os import listdir
from os.path import isfile, join

dataDirectory = 'DynamicData'

telemetry = reduce(
 lambda d, w: d.append(pd.read_csv(join(dataDirectory,w))),
 [f for f in listdir(dataDirectory) if isfile(join(dataDirectory, f))],
 pd.DataFrame())
 
telemetry.head()

In [None]:
len(telemetry)

In [None]:
telemetry.columns

In [None]:
trips = telemetry['Trip'].unique()
len(trips)

## Summary

We now have a `telemetry` dataframe with 22M rows -- but not all columns are complete as some are EV and others ICE specific

There are 4000 trips with lat/lon that can be used

In [None]:
import matplotlib.pyplot as plt

soc = telemetry['HV Battery SOC[%]'].dropna()

soc.plot(kind='hist', y='HV Battery SOC[%]')
plt.show()

lots of 0s... even though I dropped the nas, which took the count from 22M to 3M. Could be due to ICE rows... but let's strip out the 0s first

In [None]:
socMinThresh = soc.where(soc > 2).dropna()
socMinThresh.plot(kind='hist', y="SOC")
plt.show()

2% seems to be the right level

## Explore Trip Telemetry

pick a random trip and plot out the telemetry values

In [None]:
import random

tripID = random.choice(trips)
print(f"looking at trip #{tripID}")

tripData = telemetry[telemetry['Trip'] == tripID]
tripData

### plot the telemetry for this trip

In [None]:
# tripData.plot(kind='line', x='Timestamp(ms)', y='Vehicle Speed[km/h]') #, y2='Engine RPM[RPM]', y3='Long Term Fuel Trim Bank 1[%]')


tripTel = tripData[['Timestamp(ms)','Vehicle Speed[km/h]','Engine RPM[RPM]','Long Term Fuel Trim Bank 1[%]']]
tripTel.plot(kind="scatter", x='Timestamp(ms)', y='Vehicle Speed[km/h]')
tripTel.plot(kind="scatter", x='Timestamp(ms)', y='Engine RPM[RPM]')
tripTel.plot(kind="scatter", x='Timestamp(ms)', y='Long Term Fuel Trim Bank 1[%]')
tripTel.plot(kind='scatter', x='Engine RPM[RPM]', y='Vehicle Speed[km/h]')
# plt.show()

# Prep for replay

What's going to be most helpful is strip this dataset apart by trip so that a simulated device can replay the trip.

In [None]:
!mkdir -p TripData

In [None]:
tripDir = 'TripData'

# [print(id) for id in trips]

[ telemetry[telemetry['Trip'] == id].sort_values(by=['DayNum','Timestamp(ms)']).to_csv(join(tripDir, str(id) + ".csv")) for id in trips ]


In [None]:
!aws s3 cp TripData s3://connected-vehicle-datasource/ --recursive --acl public-read

In [None]:
!ls TripData