README

Introduction

The main goal of this project is to analyze the relationship between medullary thyroid carcinoma (MTC) and associated gene/clinical data using machine learning. We employ a RandomForestClassifier to classify and evaluate models on various test datasets, and we use SHAP to interpret feature importance.

Core Workflow

Data Loading: Load gene features and clinical data from multiple Excel files.
Data Preprocessing: Fill missing values, convert categorical variables (e.g., tumor_grade, sex, Heredity, T, N, M, TNM stage) into numerical codes.
Feature Selection: Use RandomForestClassifier together with SelectFromModel to identify the most discriminative features.
Model Training & Evaluation:
- Run K-fold cross-validation to choose the best model on the training set.
- Evaluate on independent test sets to output performance metrics (precision, recall, confusion matrix, AUC, etc.).
SHAP Interpretability: Compute SHAP values to evaluate the contribution and impact of each feature on model predictions.

Project Structure

your_project/
│
├─ main.py         # Python script containing the core logic
└─ README.md

Note: The actual names of the Excel files depend on your local files.
The script assumes they reside in the data/ folder.

Requirements

Python 3.7+
Key Libraries:
- pandas
- numpy
- scikit-learn
- shap
- argparse

Install the dependencies via pip:

pip install pandas numpy scikit-learn shap argparse

Usage

Clone or Download
Clone or download this project to your local environment. Ensure the data/ directory contains the required Excel files.

Install Dependencies

pip install -r requirements.txt

Or manually:

pip install pandas numpy scikit-learn shap argparse

Run the Script
From the terminal, navigate to the project folder and execute:
```
python main.py --seed 42
```
Argument details:
- --seed: Random seed (default is 10).
Output
The script outputs:
- Cross-validation scores and the best validation score.
- Test set performance metrics (precision, recall, confusion matrix, AUC).
- Prediction results (predicted labels, probabilities, true labels).
- Selected important features.
- SHAP values for feature interpretation.

Key Functions

load_data()
Loads data from multiple Excel files into dictionaries for further processing.
print_metrics(y_true, y_pred, y_pred_proba)
Calculates and displays classification metrics including precision, recall, confusion matrix, and AUC.
print_metrics_conf_matrix(y_test, y_pred, y_pred_proba)
Similar to print_metrics but emphasizes the confusion matrix.
train(gene_discovery_data, gene_test1_data, gene_test2_data)
Trains a RandomForestClassifier with K-fold cross-validation, selects the best model, and evaluates on test datasets.
Clinical_train_with_slectfeature(gene_discovery_data, gene_test1_data, gene_test2_data, max_features=10, random_state=1)
Performs feature selection, model training, evaluation, and SHAP-based feature importance analysis.

Interpreting Results

Cross-validation
- Outputs accuracy scores for each fold in K-fold CV.
- Selects the model with the best overall cross-validation accuracy.
Test Set Performance
Metrics include precision, recall, F1 Score, confusion matrix, and AUC.
Feature Importance
Evaluates feature importance using both RandomForest’s built-in feature importances and SHAP values. Higher SHAP values indicate more influential features.

Frequently Asked Questions (FAQ)

Excel File Names or Versions Differ
Update file paths and sheet names in load_data() accordingly.
Library Compatibility Issues
Use virtual environments (e.g., conda, venv) to resolve potential dependency conflicts.
Missing Data Issues
Consider alternative strategies (e.g., median, mean, interpolation) if current filling strategy is unsuitable.
Categorical Variable Encoding
Update the dictionary r_dict in the script if category mappings need to be modified.

Contributing & Extensions

You are encouraged to modify the code by trying different models, feature selection strategies, or visualizations. Contributions, suggestions, and improvements are welcome via issues or pull requests.

License

You can release this project under your laboratory’s or company’s preferred license. If not specified, consider using the MIT License.

We hope this README helps you and others easily understand and use the project. For further questions or clarifications, please reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MTC_NMF_20250408_codecheck.Rmd		MTC_NMF_20250408_codecheck.Rmd
MTC_drug_screen_chhcodeCheck.Rmd		MTC_drug_screen_chhcodeCheck.Rmd
MTC_figure2.Rmd		MTC_figure2.Rmd
README.md		README.md
load_data.py		load_data.py
main_fs.py		main_fs.py
main_train.py		main_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

Introduction

Core Workflow

Project Structure

Requirements

Usage

Key Functions

Interpreting Results

Frequently Asked Questions (FAQ)

Contributing & Extensions

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

guomics-lab/MTC

Folders and files

Latest commit

History

Repository files navigation

README

Introduction

Core Workflow

Project Structure

Requirements

Usage

Key Functions

Interpreting Results

Frequently Asked Questions (FAQ)

Contributing & Extensions

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages