# To SMOTE, or not to SMOTE? This package includes the code required to repeat the experiments in the paper and to analyze the results. > To SMOTE, or not to SMOTE? > > Yotam Elor and Hadar Averbuch-Elor ## Installation ``` # Create a new conda environment and activate it conda create --name to-SMOTE-or-not -y python=3.7 conda activate to-SMOTE-or-not # Install dependencies pip install -r requirements.txt ``` ## Running experiments The data is not included with this package. See an example of running a single experiment with a dataset from `imblanaced-learn` ```python # Load the data import pandas as pd import numpy as np from imblearn.datasets import fetch_datasets data = fetch_datasets()["mammography"] x = pd.DataFrame(data["data"]) y = np.array(data["target"]).reshape((-1, 1)) # Run the experiment from experiment import experiment from classifiers import CLASSIFIER_HPS from oversamplers import OVERSAMPLER_HPS results = experiment( x=x, y=y, oversampler={ "type": "smote", "ratio": 0.4, "params": OVERSAMPLER_HPS["smote"][0], }, classifier={ "type": "cat", # Catboost "params": CLASSIFIER_HPS["cat"][0] }, seed=0, normalize=False, clean_early_stopping=False, consistent=True, repeats=1 ) # Print the results nicely import json print(json.dumps(results, indent=4)) ``` To run all the experiments in our study, wrap the above in loops, for example ```python for dataset in datasets: x, y = load_dataset(dataset) # this functionality is not provided for seed in range(7): for classifier, classifier_hp_configs in CLASSIFIER_HPS.items(): for classifier_hp in classifier_hp_configs: for oversampler, oversampler_hp_configs in OVERSAMPLER_HPS.items(): for oversampler_hp in oversampler_hp_configs: for ratio in [0.1, 0.2, 0.3, 0.4, 0.5]: results = experiment( x=x, y=y, oversampler={ "type": oversampler, "ratio": ratio, "params": oversampler_hp, }, classifier={ "type": classifier, "params": classifier_hp }, seed=seed, normalize=..., clean_early_stopping=..., consistent=..., repeats=... ) ``` ## Analyze Read the results from the compressed csv file. As the results file is large, it is tracked using [git-lfs](https://git-lfs.github.com/). You might need to download it manually or install git-lfs. ```python import os import pandas as pd data_path = os.path.join(os.path.dirname(__file__), "../data/results.gz") df = pd.read_csv(data_path) ``` Drop nans and filter experiments with consistent classifiers, no normalization and a single validation fold ```python df = df.dropna() df = df[ (df["consistent"] == True) & (df["normalize"] == False) & (df["clean_early_stopping"] == False) & (df["repeats"] == 1) ] ``` Select the best HP configurations according to AUC validation scores. `opt_metric` is the key used to select the best configuration. For example, for a-priori HPs use `opt_metric="test.roc_auc"` and for validation-HPs use `opt_metric="validation.roc_auc"`. Additionaly calculate average score and rank ```python from analyze import filter_optimal_hps df = filter_optimal_hps( df, opt_metric="validation.roc_auc", output_metrics=["test.roc_auc"] ) print(df) ``` Plot the results ```python from analyze import avg_plots avg_plots(df, "test.roc_auc") ``` ## Citation ``` @misc{elor2022smote, title={To SMOTE, or not to SMOTE?}, author={Yotam Elor and Hadar Averbuch-Elor}, year={2022}, eprint={2201.08528}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This library is licensed under the MIT-0 License. See the LICENSE file.