Marine Omics Demos

Marine metagenomics platform NBs to get the science started. This work is part of FAIR-EASE project, specifically Pilot 5 for metagenomics to provide as many tools to for emo-bon data.

Having problems, encountered bugs, or developing ideas? Open issues and PRs with your dream workflow suggestions.

Design principles

Flexibilty: Running, locally, on public jupyterHub server (binder and google colab), deployable to VREs and other Jupyter servers.
Simplicity over speed, however performance is considered.
UDAL queries which will eventually enable you to query many types of data apart from EMO-BON.
API calls to other services, such as Galaxy.

Workflow notebooks

Notebooks generate panel apps for user friendly interactions. To initiate the dashbord, click the panel icon .

What if the dashboards are not enough to you? Each dashboard NB has an ..._interactive.ipynb sibling in the same directory which is a good starting point. Combine the momics methods (repository marine-omics-methods), panel widgets to extend the NB capabilities.

IMPORTANT NOTE

At the moment, NBs/dashboards loaded over Google Colab run FASTER than the binder equivalents. However, in order to display a dashboard or even separate dashboard components (widgets), you will need ngrok account, which basically:

after registering, you will need to put your token in the .env file, which you upload to the root of you GColab session
- NGROK_TOKEN="..........."
- alternatively you manually define it directly in a new cell os.environ["NGROK_TOKEN"] = "my_secret_token"
In practice, ngrok creates a tunnel to a separate url with the dashboard
will show you a link as below, which contains the dashboard

WF0, Landing page showing sequencing progress

General statistics of EMO-BON sequencing efforts. The total amount of sampling events has reached more than a 1000 recently. Unfortunately leafmap widgets have problem with ngrok tunnel, so only binder integration is possible.

WF1, Visualize metaGOflow pipeline intermediate products

Notebook is located in wf1_metagoflow/quality_control.ipynb folder.

There are almost 60 output files from the metaGOflow pipeline. This dashboard provides interface to the most relevant and not too big metaGOflow pipeline outputs, including:

fastap Qaulity Control report with interactive QC plots.
Reads Quality control, both trimmed and merged reads.
Interactive Krona plots from SSU and LSU taxonomy tables, respectively.
Functional annotation summaries expressed in number of reads matched to respective databases.

WF2, Genetic diversity

Basic diveristy dashboard

NB provides visualization of alpha and beta diversities of the metaGOflow analyses. NB is located in wf2_diversity/diversities_panel.ipynb.

ADVANCED diversity dashboard

Heavier, but contains pivot tables on taxonomy LSU and SSU tables.

Tables pivot species according to certain pre-selected taxa. Select, filter and visualize PCoA of the taxonomy in respect to categorical variables. In addition, calculate permanova on those subsampled taxonomy selections.

Taxonomy Finder

Allows you to search taxa across all the SSU and LSU taxonomy tables of the EMO-BON and display them on the beta-diversity plot, as well as it returns the filtered taxonomy abundance table.

WF3, Biosynthetic gene clusters (BGCs)

You will need an account on the galaxy earth-system for this NBs to work. Your Galaxy access data should be stored as environmental variables in the .env file at the root of the repository

GALAXY_EARTH_URL="https://earth-system.usegalaxy.eu/"
GALAXY_EARTH_KEY="..."

Running GECCO jobs on Galaxy

BUG: For unknown reason the Binder version of the dashboard does not work. Dashboard illustrating submission of jobs to galaxy (GECCO tool) in wf3_gene_clusters/bgc_run_gecco_job.ipynb.

Upload and run workflow.
Or start the workflow with existing data and in existing history on Galaxy.
Monitor the job.

Analyze GECCO BGC output

Upload local data or query results of the GECCO from the Galaxy.
Identifying Biosynthetic Gene Clusters (BGCs).
Violin plot of the identified BGCs.
API calls to query pfam protein domain descriptions
- BUG: progress bar does not update correctly
tokenize, embed, cluster the domains by the textual domain description using simple sklearn and KMeans.

GECCO pfam queries	GECCO domains clustering

Note: if you have problems with data upload, because of the filesize, locally you can do:

jupyter lab --generate-config

and then in the jupyter_lab_config.py, you add

c.ServerApp.tornado_settings = {'websocket_max_message_size': 150 * 1024 * 1024}
c.ServerApp.max_buffer_size = 150 * 1024 * 1024

Comparative GECCO BGC analysis

Compare two samples in respect to each other.
The intended analysis employing complex networks is possible to here for now.
Please open discussion as issues to help me improve this formulation.

WF4, Co-occurrence networks

Complex network analysis of either LSU or SSU taxonomy tables. Based on the Spearman's correlation associations, significant interaction networks showing positive and negative axxociations between the taxa. Taxonom table can be split into groups based on selected categorical factor from the metadata table.

WF5, Integrate MGnify pipeline and data

dependencies not yet fixed

The examples are heavily inspired and taken from the MGnify project itself

How to query data and make basic plots such as Sankey from the MGnify database wf5_MGnify/query_data.ipynb
Protein families comparison?

Other ideas

Please reach out if you are interested in these or have your own proposal.

r-/k- communities not started
- Correlate with Essential Ocean Variables (EOVs)?
R, Julia not started
- Demonstrate usage of some relevant R and julia packages, use DIVAnd or similar.
DL package? not started
- In the future, BC 2026 might have GPU support
- Irrespective, try AI4EOSC perhaps? Q: Have not seen there much or any metagenomics though

Installation

Local jupyter

Consider creating (and activating) a new virtual environmenment for the project

# if you are using conda
conda create -n "momics-demos" python=3.10  # or higher
conda activate momics-demos

# using venv is platform dependent (Unix)
python -m venv momics-demos
source momic-demos/bin/activate

#(Win)
python.exe -m venv momics-demos
./Scripts/activate

Clone and install the repository

# clone the repository into newly created folder
git clone https://github.com/emo-bon/momics-demos.git

cd momics-demos

# install dependencies using pip
pip install -e .

Setup a jupyter kernel

ipython kernel install --user --name "momics-demos"

Start the jupyterlab

python -m jupyterlab

For existing Jupyter Hub server

Create shared environment for all the users (or your system admin already did)

conda create -p <PATH>  # for example /srv/scratch/momics-demos

Each user needs to activate the environment and setup their own kernel. Launch terminal session clicking on ➕ icon and select terminal.

conda activate /srv/scratch/momics-demos
ipython kernel install --user --name "momics-demos"

This kernel, you will select for the NBs which serve the dashboards. If you want to develop parallel NBs, you can setup another environment/kernel.

Technical notes

General

Currently venv is enough, no need for setting up conda, meaning that the dependencies are pure python.
Utility functionalities are developed in parallel in this repo. Currently not distributed with PyPI, install with pip install https://github.com/emo-bon/marine-omics-methods.git.
(NOT implemented yet) Request access to the hosted version at the Blue cloud 2026 (BC) Virtual lab environment (VRE) here.

Dashboards

Dashboards are developed in panel
- If you put the NB code in the script, you can serve the dashboard in the browser using panel serve app.py --dev.
- You can however serve the NB as well, panel serve app.ipynb --dev.
- Note: if you want to run on Google Colab, you will need a pyngrok and ngrok token from here
- Binder integration is better in terms of running dashboards, but loading the repo might take time or crash, so GColab in that case is a better option.

Data

For statistics, we use pingouin and scikit-bio.
Data part is handled by pandas, numpy etc. This might be upgraded to polars/fire-ducks.

Galaxy

Galaxy support is built upon bioblend.

Vizualization

Visualization are now interactive using hvPlot (documentation).

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
.github/workflows		.github/workflows
assets		assets
biodata		biodata
data		data
scripts		scripts
test		test
wf0_landing_page		wf0_landing_page
wf1_metagoflow		wf1_metagoflow
wf2_diversity		wf2_diversity
wf3_gene_clusters		wf3_gene_clusters
wf4_co-occurrence		wf4_co-occurrence
wf5_MGnify		wf5_MGnify
wfs_extra		wfs_extra
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Marine Omics Demos

Table of Contents

Design principles

Workflow notebooks

IMPORTANT NOTE

WF0, Landing page showing sequencing progress

WF1, Visualize metaGOflow pipeline intermediate products

WF2, Genetic diversity

Basic diveristy dashboard

ADVANCED diversity dashboard

Taxonomy Finder

WF3, Biosynthetic gene clusters (BGCs)

Running GECCO jobs on Galaxy

Analyze GECCO BGC output

Comparative GECCO BGC analysis

WF4, Co-occurrence networks

WF5, Integrate MGnify pipeline and data

Other ideas

Installation

Local jupyter

For existing Jupyter Hub server

Technical notes

General

Dashboards

Data

Galaxy

Vizualization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

emo-bon/momics-demos

Folders and files

Latest commit

History

Repository files navigation

Marine Omics Demos

Table of Contents

Design principles

Workflow notebooks

IMPORTANT NOTE

WF0, Landing page showing sequencing progress

WF1, Visualize metaGOflow pipeline intermediate products

WF2, Genetic diversity

Basic diveristy dashboard

ADVANCED diversity dashboard

Taxonomy Finder

WF3, Biosynthetic gene clusters (BGCs)

Running GECCO jobs on Galaxy

Analyze GECCO BGC output

Comparative GECCO BGC analysis

WF4, Co-occurrence networks

WF5, Integrate MGnify pipeline and data

Other ideas

Installation

Local jupyter

For existing Jupyter Hub server

Technical notes

General

Dashboards

Data

Galaxy

Vizualization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages