# Mutual Information Co-training

This repository is the source code for the paper:

**MICO: Selective Search with Mutual Information Co-training**

In Proceedings of the International Conference on Computational Linguistics (COLING) , 2022

*Zhanyu Wang, Xiao Zhang, Hyokun Yun, Choon Hui Teo and Trishul Chilimb*

## Introduction
This is the package of Mutual Information Co-training (MICO) for End2End topic sharding. MICO uses BERT to generate sentence representations, and performs query routing and document assignment with the representations. The document assignment module in MICO outputs almost equal-sized clusters, and the query routing module routes the queries to the cluster containing most (if not all) of its relevant documents. MICO achieves very high performance for topic sharding. 

This package can be tested through the example usage below.

## Usage
You can save the command below as a bash file and run it in the current folder. You can also find and run it in `./example/scripts/run_mico.sh`. It will take less than 5 minutes to finish running.

The results will be saved in `./results/`. In the folder `example_pair_BERT-finetune_layer-1_CLS_TOKEN_maxlen64_bs64_lr-bert5e-6_lr2e-4_warmup1000_entropy5_seed1` for this example experiment, we can see the final evaluation metrics saved in `metrics.json`. The document assigned to the clusters are saved in `clustered_docs.json` in a dictionary. The log files for training and evaluation are `*.log`. The model is saved as `*.pt`. The folder `./log` contains Tensorboard results for visualization. 

The `dataset_name` in the training command is set as `example` since we have an example dataset saved in `../example/data/example_dataset/`. You can change the `train_folder_path` and `test_folder_path` according to your needs.

During training, the `batch_size` is for each GPU card. If the current choice of `batch_size` is good on a machine with one GPU, we do not need to change it when switching to machines with more than one GPU (each with the same GPU memory). This is because we use the `DistributedDataParallel` function in `PyTorch` to support multi-GPU training: we assign one sub-process for each GPU and it maintains its own dataloader and counts its own epoch number (hence people usually focus on the iteration number instead of the epoch number). For a 4-GPU machine, finishing one epoch for each process means training the model for 4 epochs in total. For a GPU with 16GB memory, setting `batch_size=64` is good for the first try.

During testing, we use `DataParallel` in `PyTorch` for better efficiency (we only go through the dataset once with multi-GPU, much less than using `DistributedDataParallel`), and the `batch_size` is across all GPUs. Usually for testing, you can set a much larger `batch_size` than the one used in training, e.g., for four GPUs (each with 16GB memory), we can use `batch_size=2048`. You can also test the trained model directly by setting `--eval_only`.

    #!/bin/bash

    dataset_name=example
    train_folder_path=./example/data/${dataset_name}_train_csv/
    test_folder_path=./example/data/${dataset_name}_test_csv/

    batch_size=64
    selected_layer_idx=-1
    pooling_strategy=CLS_TOKEN
    max_length=64
    lr=2e-4
    lr_bert=5e-6
    entropy_weight=5
    num_warmup_steps=1000
    seed=1

    model_path=./example/results/${dataset_name}_pair_BERT-finetune_layer${selected_layer_idx}\
    _${pooling_strategy}\
    _maxlen${max_length}\
    _bs${batch_size}\
    _lr-bert${lr_bert}\
    _lr${lr}\
    _warmup${num_warmup_steps}\
    _entropy${entropy_weight}\
    _seed${seed}/

    python -u ./main.py \
        --model_path=${model_path} \
        --train_folder_path=${train_folder_path} \
        --test_folder_path=${test_folder_path} \
        --dim_input=768 \
        --number_clusters=64 \
        --dim_hidden=8 \
        --num_layers_posterior=0 \
        --batch_size=${batch_size} \
        --lr=${lr} \
        --num_warmup_steps=${num_warmup_steps} \
        --lr_prior=0.1 \
        --num_steps_prior=1 \
        --init=0.0 \
        --clip=1.0 \
        --epochs=1 \
        --log_interval=10 \
        --check_val_test_interval=10000 \
        --save_per_num_epoch=100 \
        --num_bad_epochs=10 \
        --seed=${seed} \
        --entropy_weight=${entropy_weight} \
        --num_workers=0 \
        --cuda \
        --lr_bert=${lr_bert} \
        --max_length=${max_length} \
        --pooling_strategy=${pooling_strategy} \
        --selected_layer_idx=${selected_layer_idx} 


## Visualize results with Tensorboard
To visualize the curves of the metrics calculated during training and evaluation, please use Tensorboard (for `Pytorch` we use `TensorboardX` which is installed in the setting up section.) 

The results for each experiment is saved in the folder specified by `--model_path` in the bash commands. We also have log files in text format in that folder. After running the following command, you can open your browser and type `localhost:14095` to view the training results.

    # start tensorboard
    tensorboard --logdir=./results/ --port=14095 serve

## Memory profiling
Although we have adopted several techniques to decrease the memory usage, it is still possible that one encounters memory problem when running with large scale dataset. You can try this memory profiling method to estimate how much memory you will need for running MICO. 

Some tips: 
1. Setting `num_worker=0` is a good way to save memory and it almost does not affect the training speed. 
2. Running MICO on more GPUs will create more sub-process automatically, and each sub-process may consume much memory. Therefore, the memory usage increases linearly with the GPU number. If needed, you can set `export CUDA_VISIBLE_DEVICES=0` to only use 1 GPU in training to save memory.

To use the memory profiling method below, please make sure that the python package `memory_profiler` is installed. (If not, you can install it with `pip install memory_profiler`.) It can track the memory usage of the Python codes. For more details, please see https://pypi.org/project/memory-profiler/.

To use it to track the memory usage, you can try the command below.

    mprof run --interval=10 --multiprocess --include-children './your_bash_file.sh'

During the bash file running, you can plot the memory usage over time by the command below. Please replace `mprofile_***.dat` with the name of the profile results you want to plot (the lastest `dat` file will be used if the file is not specified). The figure will be saved as `memory_profile_result.png`.

    mprof plot -o memory_profile_result.png --backend agg mprofile_***.dat

## Setting up a new EC2 machine
For setting up a new EC2 machine to run the scripts, please use the codes below

    wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
    bash ./Anaconda3-2021.05-Linux-x86_64.sh
    source ~/.bashrc  
    conda install pytorch=1.7.1 cudatoolkit=9.2 -c pytorch
    pip install -r requirements.txt
    pip install memory_profiler

After download the data, you can replace the two folders (for training and testing data) in `./example/data/` by the two large scale datasets. Then, you can modify and run the script `./example/scripts/run_mico.sh`.

## License

This project is licensed under the Apache-2.0 License.