Prints progress while running and gets rid of an unneccessary dictionary. #450

ElekLamoureux · 2025-06-24T11:59:36Z

Made it so that progress is displayed to the user while running and deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.

sgalkina

The output file should be updated with each cluster, currently it's getting rewritten

vamb/__main__.py

… deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.

sgalkina

Variable naming and remove global variable

vamb/__main__.py

sgalkina

Minor comments

vamb/__main__.py

vamb/vambtools.py

…s files that only had one line.

vamb/__main__.py

…der managing

…s aren't unneccessarily opened.

vamb/__main__.py

…escriptive variables.

vamb/__main__.py

…emented a function for dealing with files, removed a redundant print line.

…n, changed the value of a variable. Formatted using ruff.

vamb/__main__.py

…Corrected formatting.

jakobnissen

I like it, thank you!
I'm a little worried that this is turning into spaghetti code, with the logic of writing bins being spread out between too many functions, that each do too many things. That is probably unavoidable to some extent, because we just DO want to do a lot of variations on the same concept.
So it would be best if we could rethink how to do this more elegantly. However, it may not be possible, and it's not a priorty.
Thank you for your work here, @ElekLamoureux

jakobnissen · 2025-07-31T07:12:12Z

vamb/__main__.py

+            open(unsplit_path, "a") as unsplit_clusters_file,
+            open(split_path, "a") as split_clusters_file,


I think this should be "w" mode - when cluster_and_write_files is called, we expect new files to be created.

jakobnissen · 2025-07-31T07:15:14Z

vamb/__main__.py

    bin_prefix: Optional[str],
    binsplitter: vamb.vambtools.BinSplitter,
    base_clusters_name: str,  # e.g. /foo/bar/vae -> /foo/bar/vae_unsplit.tsv
    clusters: dict[str, set[str]],


Consider making this argument Iterable[tuple[str, set[str]]]. As far as I cal tell, we don't need an actual dict here. Adjust the calls to .items() in this function accordingly. Then, you can avoid creating the single element dict in write_clusters_table when calling this function

jakobnissen · 2025-07-31T07:18:17Z

vamb/__main__.py

+    file_path: Optional[str],
+    clusters: dict[str, set[str]],
+    to_file: bool,
+):


Please add return type here for readability

jakobnissen · 2025-07-31T07:20:23Z

vamb/__main__.py

    # Open unsplit clusters and split them
    if binsplitter.splitter is not None:
-        split_path = Path(base_clusters_name + "_split.tsv")
        clusters = dict(binsplitter.binsplit(clusters.items()))


Consider not instantiating the dict here and using the iterator instead, if possible.

jakobnissen · 2025-07-31T07:27:29Z

vamb/vambtools.py

 def write_clusters(
-    io: IO[str],
-    clusters: Iterable[tuple[str, set[str]]],
+    io: IO[str], clusters: Iterable[tuple[str, set[str]]], print_line: bool = True


Rename to print_header

jakobnissen · 2025-07-31T07:27:41Z

vamb/__main__.py

+    file_handle: Optional[TextIO],
+    file_path: Optional[str],
+    clusters: dict[str, set[str]],
+    to_file: bool,


Rename the bool argument to print_header

jakobnissen · 2025-07-31T07:35:47Z

vamb/__main__.py

+                if processed_contigs >= comparer:
+                    comparer += progress_step
+                    progress += 10
+                    logger.info(f"{progress}% of contigs clustered")


The logic here is brittle. For example, what if a single bin has 25% of all contigs? Then comparer still only increments by 10%.
Instead, make a variable called "next_reporting_threshold" - a better name for "comparer". When updating it, update it to be the next 10% higher. E.g. if we moved from 15% of contigs to 32% of contigs, update it to be 40%. You can achieve this with -((processed_contigs * 10 + 1) // - num_contigs) / 10.
Also, fix up the logic for the progress variable: If a single bin brings is from e.g. 9% to 32%, it needs to print the update for 10, 20 and 30 in one go.

jakobnissen · 2025-07-31T07:36:41Z

vamb/__main__.py

+                if processed_contigs == num_contigs:
+                    if binsplitter.splitter is not None:
+                        msg = f"\tClustered {processed_contigs} contigs in {total_split} split bins ({total_unsplit} clusters)"
+                    else:
+                        msg = f"\tClustered {processed_contigs} contigs in {total_unsplit} unsplit bins"
+                    logger.info(msg)
+                    elapsed = round(time.time() - begintime, 2)
+                    logger.info(f"\tWrote cluster file(s) in {elapsed} seconds.")
+
+                    if fasta_output is not None:
+                        logger.info(
+                            f"\tWrote {max(total_split, total_unsplit)} bins with {processed_contigs} sequences in {elapsed} seconds."
+                        )


Move this branch outside the for loop, no need to gate it behind an else branch

jakobnissen · 2025-07-31T07:40:16Z

vamb/__main__.py

    # Cluster and output the Y clusters
    assert opt.common.clustering.max_clusters is None
-    write_clusters_and_bins(
+    export_binning_results(


Doesn't this function call miss a bunch of arguments?

jakobnissen · 2025-07-31T07:42:26Z

vamb/vambtools.py


        # Print bin to file
-        with open(directory.joinpath(binname + ".fna"), "wb") as file:
+        with open(directory.joinpath(str(binname) + ".fna"), "wb") as file:


Why is this str necessary? According to the function's type signature, it should already be a string. If it is always a string, do not use str here. If it is not always, then the type signature needs to be updated, or the caller needs to make sure the binname is always a string (in order to conform to the signature)

…ile loop to fix printing logic

sgalkina requested changes Jun 24, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

Made it so that progress is printed to the screen while running. Also…

da72c55

… deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.

sgalkina requested changes Jun 25, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jun 25, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/vambtools.py Outdated Show resolved Hide resolved

sgalkina mentioned this pull request Jun 25, 2025

Output clusters during clustering #428

Closed

Removed a redundant line and cleaned up some messy code. Fixed output…

f8a26f9

…s files that only had one line.

jakobnissen requested changes Jun 25, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

Elek Lamoureux added 3 commits June 25, 2025 15:47

Fixed variable names, removed comments, and changed the method of hea…

5b66138

…der managing

Changed some things to follow the codebase's conventions.

f299c74

Fixed output problem by showing percent done instead. Made it so file…

2a92899

…s aren't unneccessarily opened.

sgalkina reviewed Jun 30, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

Removed function attributes

6cb1867

sgalkina reviewed Jul 3, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 3, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

Fixed incorrect output, and removed redundant variables. Added more d…

116be51

…escriptive variables.

sgalkina reviewed Jul 7, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 7, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 7, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

Elek Lamoureux added 2 commits July 7, 2025 14:49

Removed unnecessary return, fixed funtionality for unsplit runs, impl…

ea7a849

…emented a function for dealing with files, removed a redundant print line.

Changed output appearance, added argument types to function definitio…

0a3afce

…n, changed the value of a variable. Formatted using ruff.

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

sgalkina reviewed Jul 8, 2025

View reviewed changes

vamb/__main__.py Outdated Show resolved Hide resolved

vamb/__main__.py Outdated Show resolved Hide resolved

Elek Lamoureux added 3 commits July 8, 2025 10:51

Got minfasta output working and fixed minfasta without binsplitting. …

4f6c9b4

…Corrected formatting.

Reformatted using ruff

e9d3444

Fixed reporting erros

192c8bf

Fixed some function names for clarity and reverted incorrect changes.

9fd683a

sgalkina requested a review from jakobnissen July 30, 2025 13:03

Merge branch 'master' into clustering

a5f07fd

sgalkina self-requested a review July 31, 2025 07:07

sgalkina approved these changes Jul 31, 2025

View reviewed changes

jakobnissen approved these changes Jul 31, 2025

View reviewed changes

Made minor tweaks, replaced a dictionary with an iterable, added a wh…

4a1ff82

…ile loop to fix printing logic

sgalkina linked an issue Aug 1, 2025 that may be closed by this pull request

Output clusters during clustering #428

Closed

sgalkina assigned ElekLamoureux Aug 1, 2025

Elek Lamoureux added 3 commits August 1, 2025 10:57

Made sure clusters_with_prefix is always defined

58a86da

Fixed error with calling recluster

e17a817

Finally works for all cases

0b9c8e3

sgalkina merged commit a0056e2 into RasmussenLab:master Aug 11, 2025
8 checks passed

		open(unsplit_path, "a") as unsplit_clusters_file,
		open(split_path, "a") as split_clusters_file,

Prints progress while running and gets rid of an unneccessary dictionary. #450

Prints progress while running and gets rid of an unneccessary dictionary. #450

Uh oh!

Conversation

ElekLamoureux commented Jun 24, 2025

Uh oh!

sgalkina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgalkina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgalkina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakobnissen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants