---
title: "Select, drop or extract Columns"
date: 2020-02-11T23:23:17-05:00
draft: false
algo: [prep]
---

### I am using a SQL database

Use SQL select before dumping out files for your ML project

```SQL
SELECT column1, column2 ,column3 INTO OUTFILE '/tmp/out.csv'
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\n'
  FROM my_table;
```

Or, use [Athena as shown here](https://docs.aws.amazon.com/athena/latest/ug/querying.html)


### Files are in a folder, and I like linux commands

The easiest way to do this is using the linux cut command. Suppose you have a file called ```in.csv``` or a directory of files that look like 

```text
Folder
├── in1.csv
├── in2.csv
├── .
├── .
└── in2000.csv
```

... and assuming the delimiter used is a comma (,), and you want to select the first three columns, do

```html
cut -d "," -f1-3 Folder/in*.csv > outfile.csv
```

If you need a specific list of columns, do:

```html
cut -d "," -f1-10,20-25,30-33 Folder/in*.csv > outfile.csv
```

You can also use ```awk``` on linux and MacOS to select and reorder columns:

```html
awk -F "|" '{ print $1","$3","$5 }' Folder/in*csv > outfile.csv
```

Here, we read from a pipe-delimited file, select **and** reorder columns, and then write a comma separated file.

To remove additional characters, like a double-quote, add a ```{gsub(/\"/,"")};``` before the print keyword.

### How about in Python?
```python
import pandas as pd
df = pd.read_csv('in.csv')

#Select columns named column1 and column2
df[['column1', 'column2']].to_csv('out.csv')

#Or select using column number
df.iloc[:, 0:2].to_csv('out.csv')
```