This project contains 4 workflows:
-
(1) to run hybrid assembly of genomes using both long and short sequencing reads;
-
(2) to run annotation of resistance genes, direct repeats and IS elements;
-
(3) to run genetic analysis of heteroresistant mutants;
-
(4) to run core-genome based phylogeny.
And several R notebooks to do machine learning, EDA and model comparison.
Snakemake workflow management system is required to run the workflows. Refer to the official documentation for installation instructions. The safest way to reproduce the results is to run the analysis inside containers. To build the required Singularuty/Apptainer containers read this. If you opt to not use containers, Snakemake will install and deploy all required software automatically using Mamba.
To download the code to your computer, clone the repository:
git clone https://github.com/andrewgull/HeteroRThen navigate to the HeteroR directory (in the following steps, we will assume that you are inside this directory).
-
The sequencing reads used for this project are deposited in NCBI SRA database under project number PRJNA1165464. Important naming convention: long reads have
.fastq.gzextension and paired short reads have.fq.gzextension. By default reads downloaded from SRA have names like "SRR followed by 8 digits", this project assumes theat all the read files are named after the in-house naming scheme which is "DA followed by 6 digits". To determine how the SRR numbers correspond to DA numbers, refer to the table in SRA Run Selector or toconfigs/da_srr_mapping.csv. The sequencing data can be placed anywhere on your computer. -
One more piece of data you need is CARD database. You can use the version we used for this project which is in
localDB.tgzarchive. Just unpack it and make sure it's inside the project's directory, i.e. insideHeteroR:
tar -xf localDB.tgzAlternatively, you can download the newest version of the database from the CARD website:
wget https://card.mcmaster.ca/latest/dataIf you choose to do so, the final results of the analysis might look a little different.
Once you have installed snakemake and downloaded the data (and built the containers), you can run the workflows.
Execution of the workflows is governed by the configuration files in the configs directory. The main config.yaml defines which workflow to run and where to find the data and container images.
You will need to edit this file to point to the location of the raw sequencing data.
To run the assembly pipeline, use this command:
snakemake --config run_assembly=True --use-apptainer --apptainer-args "--bind /path/to/data" --cores 10Then you can run the annotation pipeline:
snakemake --config run_annotation=True --use-apptainer --apptainer-args "--bind /path/to/data" --cores 10Alternatively, you can use the apptainer profile config file profiles/apptainer/config.yaml to avoid typing apptainer args every time:
snakemake --config run_annotation=True --profile profiles/apptainerAfter these pipelines have finished, you can run the three R notebooks (but not necessarily all of them):
- to generate features table:
notebooks/modelling/features.qmd(also, a pre-compiled table is available herenotebooks/modelling/data/features_strain.csv) - for exploratory data anlysis:
notebooks/modelling/EDA.qmd(pre-compiled HTML file is available herenotebooks/modelling/EDA.html.gz) - to run training and validation:
notebooks/modelling/training_and_validation.Rmd(pre-compiled HTML file is available herenotebooks/modelling/training_and_validation.html.gz) - for comparison and analysis of the models:
notebooks/modelling/models_analysis.Rmd(pre-compiled HTML file is available herenotebooks/modelling/models_analysis.html.gz)
To ensure that you use the same versions of R packages as were used in these notebooks, install renv package and run renv::restore() (here you can find renv documentation).
Overall, the idea is the same - just set the workflow name to True(and don't run more than one workflow at a time):
# mutants
snakemake --config run_mutants=True --profile profiles/apptainer
# phylogeny
snakemake --config run_phylogeny=True --profile profiles/apptainerNB: to find NCBI accession numbers of the reference strains used for the phylogenetic analysis, refer to file configs/strain_mapping/reference_strains.csv.
The raw sequencing reads used in this project are available from NCBI SRA under BioProjects
- PRJNA1165464 (474 parental strains),
- PRJNA1083935 and PRJNA1160527 (mutants).
The pre-compiled features table is available under notebooks/modelling/data/features_strain.csv
The final models (trained LLR and GBT) are available under notebooks/modelling/models.
- Assembly
- Annotation
- Mutants
- Phylogeny
Read this in case the assembly script crashes due to system library problems
(1) Hybrid assembly script will not work if your system doesn't have libgsl.so.25 and libcblas.so.3 (both required by bcftools which is required by medaka=1.6.0).
For some reason, the wrong versions of these libraries are installed by conda when solving the environment. In such case use files in workflow/libs - copy or link them to the environment's lib directory.
This should not be a problem if you use containers.



