This repository contains the data and code for paper Does Compressing Activations Help Model Parallel Training? (MLSys'24). Our code is based on Megatron-LM developed by NVIDIA.
Installation 🛠️ • Data 🗃️ • Checkpoint ⚙️ • Quick Start 🚀 • Contributing 🐜 •
To get started, please first setup the environment:
pip install -r requirements.txt --find-links https://download.pytorch.org/whl/torch_stable.htmlWe employ Python 3.9 and CUDA 11.3. If you're using different versions of Python and CUDA, please ensure compatibility during the torch installation process.
To install apex, please proceed with the following steps:
git clone https://github.com/NVIDIA/apex.git
git checkout 22.04-dev
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./We provide two examples illustrating how to prepare data for fine-tuning and pre-training, respectively.
Download GLUE dataset:
python download_glue_data.pyDownload vocabulary files:
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txtDownload wikipedia dataset
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2Preprocess wikipedia dataset
python -m wikiextractor.WikiExtractor -o output --json enwiki-latest-pages-articles.xml.bz2
cd tools
bash preprocess_wiki.shDownload vocabulary files:
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txtDownload Checkpoints:
cd examples
mkdir checkpoints
cd checkpoints
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
unzip megatron_bert_345m_v0.1_cased.zip -d bert_345mSplit the checkpoints:
cd tools
bash split_single.shNote: we need to set pipeline parallelism degree and tensor parallelism degree to fit the fine-tuning process.
In the above section, the checkpoint is manually split with success. Here, we finetune BERT-345M (BERT-Large):
cd examples
bash finetune_mrpc_distributed_with_mp.shTo utilize the checkpoints from Huggingface, proceed with these steps:
- Implement Transformer-based Model by using Transformer function provided by Megatron-LM.
- Download checkpoints and preprocess the Huggingface checkpoints.
- Split the checkpoints for fine-tuning.
Here, we present an example. Given that the BERT-Base model is already implemented in our repository, we will only demonstrate the final two steps.
Download and preprocess the Huggingface checkpoints
python preprocess_hf_bert_checkpoint.pySplit the checkpoints
bash split_single_hf.shFinetune BERT-Base:
cd examples
bash finetune_mrpc_bert_base_with_mp.shExpanding our repository to include additional Huggingface models requires us to independently implement these models. Here are several steps:
- Implement parallel MLP and parallel Attention (please refer to
megatron/model/transformer.py) - Implement the language model by using parallel MLP and parallel Attention (please refer to
megatron/model/language_model.py) - Implement the model by using the above language model with embedding and head. (please refer to
megatron/model/bert_model.pyormegatron/model/gpt_model.py)
Authors: Song Bian*, Dacheng Li*, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman
Affiliated: University of Wisconsin-Madison, Carnegie Mellon University, MBZUAI, and Petuum Inc.
If you find the idea or code useful for your research, please consider citing our paper:
@article{bian2023does,
title={Does compressing activations help model parallel training?},
author={Bian, Song and Li, Dacheng and Wang, Hongyi and Xing, Eric P and Venkataraman, Shivaram},
journal={arXiv preprint arXiv:2301.02654},
year={2023}
}