This is the GitHub repo for the EE-449: Deep Learning Mini Project for Group 39.
The all the package requirements for this project are kept in the environment.txt file and can be installed by running:
python3 -m pip install -r environment.txt
If using Gnoto, you should create a virtual environment using this file instead.
The CAD dataset augmented with chain-of-thought sequences can be accessed at the following HuggingFace link.
The LLM used to produce chain-of-thought sequences for training the model can also be accessed on HuggingFace here.
Finally, the hate speech classification model trained in this project are located here.
The structure of the repo is the following:
└── DL_project
├── code
│ ├── configs
│ │ └── default.yaml
│ ├── cot_augment.ipynb
│ ├── cot.py
│ ├── data_io.py
│ ├── main.ipynb
│ ├── models.py
│ ├── upload_data.py
│ └── utils.py
├── data
│ └── .gitkeep
├── models
│ └── .gitkeep
├── .gitignore
├── environment.txt
├── LICENSE
└── README.md
code: Contains all the scripts, objects and methods used in this project.
configs: Contains all configuration dictionary files.
default.yaml: The default configuration settings used in this project.
cot_augment.ipynb: The main Jupyter notebook used to generate chain-of-thought annotations to the CAD dataset using the pretrained Mistral-7B-Instruct-v0.3.
cot.py: Contains the tokenize_and_merge_dataset() function which, when called, prepares an annotated dataset for training.
data_io.py: Contains various functions which import and export data to and from the HuggingFace repo.
main.ipynb: The main script used to train and validate the model, detailed later.
models.py: Defines the HybridCoTMistral class which is the model developed in this project, as well as all related methods for initialization, training, validation, and explicit CoT generation.
utils.py: Defines the function for reading the configuration dictionary.
The default.yaml configuration file contains all the hyperparameters for creating and training the model, as well as loading and saving. The important parameters for ensuring the main.ipynb script runs on your machine are:
device: Default is cuda, make sure this is available on your machine.
load_model: Ensure this is True if you want to load the model already trained in this study, or False if you want to see the script finetune the model from scratch.
load_from_huggingface: Should be True if load_model is True; if False, it will attempt to find the pretrained model in your local /models/ directory which, if this is your first run, is currently empty.
Most configuration parameters set by the user are in the configuration .yaml file. Within the main script, all parameters which the user must define are set as global variables in CAPITAL LETTERS:
CONFIG_FILE: Specifies the configuration .yaml file to import from /configs/ (leave as default unless you want to use a new config file).
SAMPLE_LIMIT: The number of samples to load from the CAD dataset for training and testing. Keep this number low to reduce the runtime of the script when checking it.
SPOT_CHECK: If True, prints the contents of the annotated CAD dataset at the specified CHECK_INDEX. If False, skips this.
ENABLE_TEST: If True, prints the CoT from the dataset at the specified TEST_INDEX and generates a CoT with the non-finetuned, freshly loaded model (for a comparison of its functionality). If False, skips this.
PROBE_INDEX: After training or loading a pretrained model, this selects the index in the CAD dataset for the sample at which to print the prompt, expected CoT, and the generated CoT of the trained model.
HATE_SPEECH: A custom string which you can define to test the model's CoT generation capabilities on any text of your choice.
The main script and configuration file are currently configured to load a small subset of the training dataset, load the sequentially trained model from HuggingFace, save the model's weights locally, run the validation using a 20% split of the loaded data, and print two examples of CoT generation.
To see the script train the model instead of load a pretrained one, change this entry in the configuration file:
load_model: False # if False, trains a new model instead