Self-supervised autoencoder for structured clinical data that handles missing values via input masking and a type-aware loss. Embeddings feed a downstream ANN to predict cardiovascular death within 8 years on an IHD cohort.
The primary objective of this project is to develop an efficient method for representing complex clinical data in a lower-dimensional space. By leveraging autoencoders, we aim to generate embeddings that capture the underlying structure of the data while preserving important information about the patients' health status. Our ultimate goal is to handle datasets with missing data and learn to perform imputation, creating consistent embeddings even for patients with missing data.
Our poster for MIEO was displayed at the 20th International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2025), held at Politecnico di Milano, Milan, Italy on 10–12 September 2025.
View the poster (PDF) · CIBB 2025 website / program
The main dataset used in this project is OrmoniTiroidei3Aprile2024.xlsx, which contains real clinical data related to thyroid disorders. Additionally, we augment the main dataset with additional data from other datasets pertaining to the same patients, enhancing the richness and diversity of the data.
We utilize an autoencoder architecture to encode the clinical data into a lower-dimensional space. The autoencoder comprises an encoder network that compresses the input data into a latent space representation and a decoder network that reconstructs the original input from the encoded representation. By training the autoencoder on the augmented dataset, we aim to learn meaningful embeddings that capture the essence of the patients' health data. We focus on handling missing data and aim to learn imputation techniques to create consistent embeddings for patients with missing data.
To evaluate the effectiveness of our embeddings, we employ them as features to train a neural network for a classification task. The targets for the classification task are specified within the Cleaning_Data.ipynb notebook, where we also perform data cleaning and preprocessing.
- Angelo Nardone
- Davide Borghini
- Davide Marchi
- Giordano Scerra
To get started with the project, follow these steps:
- Clone the repository to your local machine.
- Install the necessary dependencies listed in the
requirements.txtfile. - Explore the codebase and experiment with different configurations and parameters.
- Run the provided scripts to train the autoencoder on the augmented dataset and generate embeddings.
- Use the embeddings as features to train a neural network for the classification task specified in the
Cleaning_Data.ipynbnotebook.
To avoid
.venv/lib/python3.8/site-packages/skorch/net.py:2231: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True.
We changed on skorch:
From: load_kwargs = {'map_location': map_location}
To: load_kwargs = {'map_location': map_location, 'weights_only':True}
