--- title: "Split dataset to Train and Test" date: 2020-02-11T23:23:17-05:00 draft: false algo: [prep] --- ### Files are in a folder, and I like linux commands The easiest way to do this is using the linux ```awk``` command. Suppose you have a file called ```in.csv``` or a directory of files that look like ```text Folder ├── in1.csv ├── in2.csv ├── . ├── . └── in2000.csv ``` ... and assuming the delimiter used is a comma (,), and you want to select the first three columns, do ```html !awk '{if( rand() <= 0.2){ print $0 > "test_data.csv"} else {print $0 > "train_data.csv"}}' Folder/in*csv ``` That's it! To count the number of lines in the resulting csv files, do: ```html wc -l ``` ### How about in Python? ```python from numpy.random import RandomState import pandas as pd df = pd.read_csv('C:/Dataset.csv') rng = RandomState() #For a 70-30 split train = df.sample(frac=0.7, random_state=rng) test = df.loc[~df.index.isin(train.index)] train.to_csv('train.csv',index=False,header=False) test.to_csv('test.csv',index=False,header=False) ```