---
title : "c. Run single node data preprocessing with Slurm"
date: 2020-09-04T15:58:58Z
weight : 20
tags : ["preprocessing", "data", "ML", "srun", "slurm"]
---

In this section, you will run a data preprocessing step using the `fairseq` command line tool and `srun`. Fairseq provides the `fairseq-preprocess` that creates a vocabulary and binarizes the training dataset. For more information on the **Fairseq** command line tools refer to [the documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html).

#### Creating a Preprocessing Script

Create a `fairseq-preprocess` script in the _/lustre_ shared folder with the following commands:

```bash
cd /lustre
export TEXT=/lustre/wikitext-103
cat > preprocess.sh << EOF
#!/bin/bash

fairseq-preprocess \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir /lustre/data/wikitext-103 \
    --workers 48
EOF

chmod +x preprocess.sh
```

The main arguments are the **destination directory** and the **workers count**. Take note of the **destination directory** as you'll use it as the path to the training data in the coming sections. The **workers** argument parallelize the data preprocessing over CPUs. The compute fleet runs _p3dn.24xlarge_ instances with 48 vCPUS.   

This single line script is available to all compute nodes at the _/lustre_ directory and can be executed through an `srun` command.

#### Executing the Preprocessing Script

Before running the preprocessing, check if **SLURM** is available and the queue is empty by running `sinfo -ls` and `squeue -ls`. At this stage you should have _ZERO_ compute nodes and an empty queue.

To preprocess the data on a new compute node run the following command:

```bash
cd /lustre
srun --exclusive -n 1 preprocess.sh
```

The `srun` command requests allocation for one task, `-n 1`, and runs the job in a node with no other jobs running, `--exclusive`. For more information and options to control jobs in **SLURM**, check the [`srun` documentation](https://slurm.schedmd.com/srun.html). You will see the output of the preprocessing script in your terminal.

With 48 workers, the preprocessing completes in approximately 2 minutes, after initialization of the compute instance. As the cluster starts with _ZERO_ compute nodes, it will take around 7 minutes to start one. If AWS ParallelCluster is unable to provision new Spot instances, then a request for new instances is periodically repeated.

Once the job completes, you see a screen output similar to the following:

![Preprocessing data output](/images/ml/srun_preprocess.png)

The preprocess data is available to all compute node in the _/lustre_ directory. Run the following command to examine the data: `ls -alh /lustre/data/wikitext-103`

Next, run multi-node, multi-GPU training using the preprocessed data.