A Python implementation of a part of compareDEtools.
comPyDEtools can ...
- Generate simulated dataset (KIRC, Bottomly, mKdB or mBdK)
- Run DE analysis (using
subprocess.run()) - Generate the figures like Fig 2 in Baik 2020
and can't ...
- SEQC benchmark (like Fig 1 in Baik 2020)
- False positive count comparison (like Fig 3 in Baik 2020)
- etc
pip install https://github.com/136s/comPyDEtools.git-
Make a condition file like compydetools/data/synthetic_conditions.yaml.
-
Run
python -m compydetools condition.yaml # specify your condition file made at step 1.or
run in Python
from compydetools.condition import CONDITION, set_condition from compydetools.core import Paper from compydetools.utils import run_commands set_condition("condition.yaml") # specify your condition file made at Usage 1. paper = Paper(nrep=CONDITION.nrep) paper.generate_datasets() for anal_res in run_commands(CONDITION.analysis.cmds): print(anal_res) paper.make()
-
Check generated files
input/: simulated RNA-seq data- dataset structure
- first line is header
Gene_IDcolumn: sequential numbers from 1 to the number of genesGene_Symbolcolumn: "LOC" +Gene_IDDescriptioncolumn: "up" (upregulated), "dn" (downregulated) or "ns" (not significant)- remaining columns: simulated expression counts for each samples and smaple names are "TRT-*" (treatment sample) or "CTRL-*" (control sample) (* is a sequential number for each condition)
- dataset property
- file path:
{simul_data}_{disp_type}_upFrac{frac_up}_{nsample}spc_{outlier_mode}_{nde}DE/{simul_data}_{disp_type}_upFrac{frac_up}_{nsample}spc_{outlier_mode}_{nde}DE_rep{seed}.tsv - newline character: LF
- enxoding: UTF-8
- file path:
- dataset structure
result/: plots of performance comparison
-
analysis: configuration of DE analysis-
cmds: a list of DE analysis commands -
res: a reguler expression of a path to result files- "{count_stem}" replaced by dataset path stem
- "{method_type}" replaced by method_type
-
de_true: column name of deg regulation (up, dn or ns) in each result files (defaults to "Description") -
de_score: column name of deg score like p-value in each result files (defaults to "padj") -
de_score_threshold: threshold ofde_score(DEGs'de_scoreis lower thande_score_threshold)
-
-
dirs: directories of generated files-
dataset: generated simulated datasets -
result: plots of performance comparison, csv of metrics values and pickle ofPaperinstance
-
-
simul_data: KIRC, Bottomly, mKdB or mBdK -
disp_type: same or differnt -
frac_up: fraction upregulated in DEGs (float,$[0, 1]$ ) -
nsample: number of samples per groups (int, 3<=) -
outlier_mode: D, R, OS, or DL -
pde: percent of DE in all genes (float,$(0, 100]$ ) -
metrics_type: auc, tpr, fdr, cutoff, f1score or kapppa- if you want to add any metrics, modify
const.Metricsandutils.calc_metrics()by fork or PR
- if you want to add any metrics, modify
-
method_type: specify your DE analysis method (defaults to {"deseq2": "Deseq2"})- comPyDEtools recognizes the type of DE analysis method only by the output folder path (
analysis.resin the condition file)
- comPyDEtools recognizes the type of DE analysis method only by the output folder path (
-
nrep: number of simulation repetition under one condition (int,$3<=$ )
erDiagram
Paper |o--|{ Figure : "has a list of"
Figure |o--|{ Plot : "has a list of"
Plot ||--|{ DataPool : "has a list of"
DataPool ||--|{ Dataset : "has a list of"
DataPool ||--|{ Result : "has a list of"
Dataset ||--|| Result : ""
Paper {
int nrep "number of repetition in a data pool (3<=)"
int seed "global random seed"
list[Figure] figures
}
Figure {
Simul simul_data PK "simulation data (KIRC, Bottomly, mKdB or mBdK)"
Disp disp_type PK "dispersion type (same or differnt)"
float frac_up PK "fraction upregulated ([0, 1])"
list[Plot] plots
}
Plot {
int nsample PK "number of samples per condition (3<=)"
Outlier outlier_mode PK "outlier mode (D, R, OS, or DL)"
list[DataPool] datapools
}
DataPool {
float pde PK "percent of DE in all genes ((0, 100])"
list[Dataset] datasets
list[DataPool] datapools
}
Dataset {
int seed PK "random seed for each dataset generated from global seed"
DataFrame counts "simulated count matrix"
}
Result {
int seed PK "random seed for each dataset"
list[Method] method_types "a list of DE analysis methods to be compared"
list[Metrics] metrics_types "a list of metrics to comprere DE analysis methods"
}
Paperclass represents all figures in the condition fileFigureclass represents a figure (like Fig 2)Plotclass represents a sub figure (like Fig 2A)DataPoolclass represents same condtion datasets (containsnrepdatasets)Datasetclass represents a simulated count matrixResultclass represents a results of aDatasetunder each method and metrics
| property | Class | Paper |
Figure |
Plot |
DataPool |
Dataset |
Result |
|---|---|---|---|---|---|---|
| a list of | Figure |
Plot |
DataPool |
Dataset, Result |
||
number of repetition (nrep) |
1 | 1 | 1 | 1 | ||
simulation data (simul_data) |
1 | 1 | 1 | 1 | 1 | |
dispersion type (disp_type) |
1 | 1 | 1 | 1 | 1 | |
fraction upregulated (frac_up) |
1 | 1 | 1 | 1 | 1 | |
number of samples (nsample) |
1 | 1 | 1 | 1 | ||
outlier mode (outlier_mode) |
1 | 1 | 1 | 1 | ||
percent of DE in all genes (pde) |
1 | 1 | 1 | |||
| simulated count matrix | 1 | 1 | ||||
method type (method_type) |
* | |||||
metrics type (metrics_type) |
* |
Table: Class / property correspondence (*: many)
Simulclass is a list of simulation dataset namessimul_datain the condition file
Dispclass is a list of dispersion conditiondisp_typein the condition file
Outlierclass is a list of outlier modeoutlier_modein the condition file
Metricsclass is a list of metrics of performance comparisonmetrics_typein the condition file
Methodclass is a list of DE analysis methodmethod_typein the condition file
This is a partial port of unistbig/compareDEtools.