- Linux: Fully supported and tested
- Windows: Supported with WSL2 (Windows Subsystem for Linux) or native installation
- macOS: Not supported due to CuPy requirement for CUDA GPU acceleration. CuPy requires CUDA Toolkit, which is not available on macOS due to lack of NVIDIA GPU support.
The time estimated for setting up is 10-20 minutes depending on network speed and platform.
The algorithm can peform evaluation/ De novo with a speed of roughly 280 spectra per-GPU, tested with Nvidia-H200.
We provide both Docker and Conda set-up guide, user can choose between option A: Docker and option B: Conda below:
Python version: 3.10
⚠️ Note: This Docker container includes only the runtime environment — no models are included! Please make sure you have downloaded the required models from the official or designated sources and loaded them correctly as described in the documentation.
To run this project with Docker, you can choose one of the following two options.
You can directly pull the pre-built image from Docker Hub:
# Pull the CUDA 11 version
docker pull guomics2017/massnet-dda:cuda11_v1.0
# Pull the CUDA 12 version
docker pull guomics2017/massnet-dda:cuda12_v1.0
# Change directory to MassNet-DDA
cd MassNet-DDA
# CUDA 11 version
docker build -t massnet-dda:cuda11_v1.0 . -f Dockerfile_cuda11
# CUDA 12 version
docker build -t massnet-dda:cuda12_v1.0 . -f Dockerfile_cuda12Create a new conda environment first:
conda create --name XuanjiNovo python=3.10
This will create an anaconda environment
Activate this environment by running:
conda activate XuanjiNovo
then install dependencies:
pip install -r ./requirements.txt
You need install torch from pytorch (https://pytorch.org/). It is advisable to install the entire pytorch package and follow the official installation method provided by pytorch. Specifically, first select the CUDA version according to your own operating system, and then, based on the CUDA version, choose the corresponding installation command to execute. For example, run "pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121". You can check the installed CUDA version by running the nvcc --version command.
⚠️ Note: The torch version should be 2.1.0.
installing gcc and g++:
conda install -c conda-forge gcc==11.4
conda install -c conda-forge cxx-compilerthen install ctcdecode, which is the package for ctc-beamsearch decoding
cd ctcdecode-master
pip install .
cd .. If pip install . fails(No module named 'torch'), try installing with
pip install . --no-build-isolationto avoid potential dependency conflicts.
(if there are no errors, ignore the next block and proceed to CuPy install)
**Troubleshooting**
if you encountered issues with C++ (gxx and gcc) version errors in above step, install gcc with version specified as :
conda install -c conda-forge gcc_linux-64=11.4Notes on Installing ctcdecoder:
When installing ctcdecoder (via pip install .), make sure that your CUDA and GCC versions are compatible. For example, CUDA 12.1 officially supports up to GCC 12.2. You can find this information in the official CUDA documentation:
https://docs.nvidia.com/cuda/archive/12.1.0/cuda-installation-guide-linux/
If the versions are not compatible, the installation may fail.
then install pytorch imputer for CTC-curriculum sampling
cd imputer-pytorch
pip install -e .
cd ..lastly, install CuPy to use our CUDA-accelerated precise mass-control decoding:
Please install the following Cupy package in a GPU available env, If you are using a slurm server, this means you have to enter a interative session with sbatch to install Cupy, If you are using a machine with GPU already on it (checking by nvidia-smi), then there's no problem
Check your CUDA version using command nvidia-smi, the CUDA version will be on the top-right corner
| cuda version | command |
|---|---|
| v10.2 (x86_64 / aarch64) | pip install cupy-cuda102 |
| v11.0 (x86_64) | pip install cupy-cuda110 |
| v11.1 (x86_64) | pip install cupy-cuda111 |
| v11.2 ~ 11.8 (x86_64 / aarch64) | pip install cupy-cuda11x |
| v12.x (x86_64 / aarch64) | pip install cupy-cuda12x |
Some of the important settings in config.yaml under ./XuanjiNovo
n_beam: Number of CTC-paths (beams) considered during inference. We recommend a value of 40. Controls the beam width during decoding, with higher values exploring more potential sequences at the cost of increased computation time.
refine_iters: Number of refinement iterations performed during forward pass (default: 3). Controls how many times the model refines its predictions. Higher values can lead to more refined predictions but increase computation time.
mass_control_tol: This setting is only useful when PMC_enable is True. The tolerance of PMC-decoded mass
from the measured mass by MS, when mass control algorithm (PMC) is used. For example, if this is set to 0.1, we will
only obtain peptides that fall under the mass range [measured_mass-0.1, measured_mass+0.1]. Measured mass is
calculated by : (pepMass - 1.007276) * charge - 18.01. pepMass and charge are given by input spectrum file (MGF).
PMC_enable: Whether to use PMC decoding unit or not, either True or False.
mask_schedule: Controls the dynamic masking schedule during training through three parameters:
- initial_peek (default: 0.93): Initial peek factor at the start of training
- epoch_decay (default: 0.01): How much to reduce peek factor per epoch
- min_peek (default: 0.00): Minimum peek factor (won't decay below this)
The peek factor at any epoch is calculated as: max(initial_peek - current_epoch * epoch_decay, min_peek). This schedule controls the curriculum learning process, where higher initial_peek means more teacher forcing at the start, higher epoch_decay means faster transition to self-learning, and higher min_peek maintains some minimum level of teacher forcing.
n_peaks: Number of the most intense peaks to retain, any remaining peaks are discarded. We recommend a value of 800.
min_mz: Minimum peak m/z allowed, peaks with smaller m/z are discarded. We recommend a value of 1.
max_mz: Maximum peak m/z allowed, peaks with larger m/z are discarded. We recommend a value of 6500.
min_intensity: Min peak intensity allowed, less intense peaks are discarded. We recommend a value of 0.0.
log_level: Controls the verbosity of the model's logging output. Available levels (from most to least verbose):
DEBUG: Detailed information for debugging and developmentINFO: General information about model operation (default)WARNING: Only potentially concerning situationsERROR: Only error messages that affect operationCRITICAL: Only critical errors that prevent operation
For minimal logging output (e.g., in production), use ERROR or CRITICAL. For development and debugging, use DEBUG
or INFO. Example in config.yaml:
# Logging configuration
log_level: "ERROR" # Minimal logging for production useNote!!!!!!!!!!!!!!!!!!: All the following steps should be performed under the main directory. Do not use
cd XuanjiNovo !!!!!!!!!!!!!!!!!!!
We provide a script to download and verify model checkpoints and test data. The script supports both Google Drive and HTTPS mirrors, with SHA-256 checksum verification to ensure file integrity.
- Model Checkpoints:
XuanjiNovo_100M_massnet.ckpt: Base model trained on MassNetXuanjiNovo_130M_massnet_massivekb.ckpt: Model finetuned on 30M MassiveKB
- Test Data:
Bacillus.10k.mgf: Test MGF file
# Install download dependencies
pip install gdown requests tqdm
# Download all files (uses Google Drive by default)
python download_data.py
# Download specific files
python download_data.py --files XuanjiNovo_100M_massnet.ckpt Bacillus.10k.mgf
# Use HTTPS mirrors instead of Google Drive
python download_data.py --prefer-mirror
# Specify output directory
python download_data.py --output-dir ./dataThe script will:
- Download files from either Google Drive or HTTPS mirrors
- Verify SHA-256 checksums to ensure file integrity
- Skip already downloaded files if checksums match
- Automatically retry with alternative source if primary source fails
If you prefer manual downloads, you can use either Google Drive or Hugging Face links:
Google Drive Links:
Hugging Face Links: All files are also available in our Hugging Face repository:
After manual download, (optional) verify file integrity:
# View expected checksums
cat checksums.txt
# Calculate checksum of downloaded file
sha256sum XuanjiNovo_100M_massnet.ckptXuanjiNovo supports two ways to configure its behavior:
- config.yaml: The primary configuration file where you can set all parameters
- Command-line arguments: Override specific config.yaml settings for quick adjustments
Command-line arguments take precedence over config.yaml settings. For example, if you set n_beams: 40 in config.yaml
but run with --n-beams 60, the value 60 will be used.
Here's the complete help output for command-line options:
Usage: python -m XuanjiNovo.XuanjiNovo [OPTIONS]
Options:
--mode TEXT Operation mode: 'eval' for labeled data evaluation,
'denovo' for unlabeled data analysis [required]
--peak_path TEXT Path to input peak files (MGF format) [required]
--model TEXT Path to model checkpoint file [required]
--config TEXT Path to config file (default: config.yaml)
--output TEXT Path to output directory (default: current directory)
--batch_size INTEGER Batch size for inference (default: from config)
--n_beams INTEGER Number of beams for CTC beam search (default: from config)
--pmc-enable/--no-pmc-enable
Enable/disable Precise Mass Control module (default: from config)
--mass_control_tol FLOAT Mass tolerance for PMC module (default: from config)
--refine_iters INTEGER Number of refinement iterations (default: from config)
--gpu TEXT Comma-separated list of GPU IDs to use (default: all available)
--help Show this help message and exit
Examples:
# Evaluate labeled data
python -m XuanjiNovo.XuanjiNovo --mode=eval --peak_path=./data.mgf --model=./model.ckpt
# De novo analysis of unlabeled data with custom settings
python -m XuanjiNovo.XuanjiNovo --mode=denovo --peak_path=./unknown.mgf --model=./model.ckpt \
--PMC_enable=True --mass_control_tol=0.1 --refine_iters=3 --n_beams=40
# Use specific GPUs (multi-GPU requires torchrun)
torchrun --nproc_per_node=2 -m XuanjiNovo.XuanjiNovo --mode=eval --peak_path=./data.mgf --model=./model.ckpt --gpu=0,1
Notes:
- The 'eval' mode requires labeled data and provides evaluation metrics
- The 'denovo' mode is for unlabeled data analysis
- All parameters can be set in config.yaml and optionally overridden via command line
- Command-line arguments take precedence over config.yaml settings
- GPU memory requirements depend on batch size and number of beams
- For permanent changes, prefer updating config.yaml
- Use command-line arguments for temporary overrides or quick experiments
Below is an example command to run the container (using the CUDA 11 version):
docker run --gpus all --rm \
-v /local/data:/data \
massnet-dda:cuda11_v1.0 \
--mode=eval \
--peak_path=/data/bacillus.10k.mgf \
--model=/data/XuanjiNovo_100M_massnet.ckpt \
--output /data/output_cuda11_v1.0
⚠️ Note: The -v flag maps the directory /local/data on the host machine to the path /data inside the docker. Please change /local/data to the actual path on your local machine where the data and model weights are stored.
Execute the command with your desired options. For example:
# Basic evaluation with setting from file config.yaml in default directory
python -m XuanjiNovo.XuanjiNovo --mode=eval --peak_path=./Bacillus.10k.mgf --model=./XuanjiNovo_100M_massnet.ckptpython -m XuanjiNovo.XuanjiNovo --mode=denovo --peak_path=./Bacillus.10k.mgf --model=./XuanjiNovo_100M_massnet.ckptThe model supports both single-GPU and multi-GPU execution. For multi-GPU training and inference, you must use
torchrun instead of python:
# Single GPU usage
python -m XuanjiNovo.XuanjiNovo --mode=eval --gpu=0 # Use only GPU 0
# Multi-GPU usage with torchrun
# Note: --nproc_per_node should match the number of GPUs you want to use
torchrun --nproc_per_node=2 -m XuanjiNovo.XuanjiNovo --mode=eval --gpu=0,1 # Use 2 GPUs (0,1)
torchrun --nproc_per_node=4 -m XuanjiNovo.XuanjiNovo --mode=eval --gpu=0,1,2,3 # Use 4 GPUs
torchrun --nproc_per_node=8 -m XuanjiNovo.XuanjiNovo --mode=eval # Use all 8 GPUs
# Examples with full command:
# Training on 2 GPUs
torchrun --nproc_per_node=2 -m XuanjiNovo.XuanjiNovo \
--mode=train \
--peak_path=./training_data.mgf \
--model=./model.ckpt \
--gpu=0,1
# Evaluation on 4 GPUs
torchrun --nproc_per_node=4 -m XuanjiNovo.XuanjiNovo \
--mode=eval \
--peak_path=./test_data.mgf \
--model=./model.ckpt \
--gpu=0,1,2,3Important notes for multi-GPU usage:
- The
--nproc_per_nodevalue must match the number of GPUs specified in--gpu - Training batch size is automatically scaled by the number of GPUs
- Data is distributed across GPUs using PyTorch's DistributedDataParallel
- Memory usage is balanced across selected GPUs
- For optimal performance, ensure all specified GPUs have similar specifications
For developers who want to contribute to the project, we have additional tools for code quality and consistency:
- Install development dependencies:
pip install -r requirements-dev.txt- Set up pre-commit hooks:
pre-commit installThis will set up automatic checks that run before each commit:
- Code formatting with
black - Import sorting with
isort - Linting with
flake8 - Static type checking with
mypy - Other checks like trailing whitespace and file endings
To run the checks manually on all files:
pre-commit run --all-filesTo ensure the correctness of core components, we provide unit tests based on the pytest framework.
- Install pytest:
pip install pytest- Run a specific test (e.g., de novo peptide processing):
pytest tests/test_denovo.py::test_denovo_process -v -s
For detailed documentation of the MSDT format, please refer to https://guomics.com/software/massnet/exp.html
