Skip to content

Conversation

@ElekLamoureux
Copy link

Made it so that progress is displayed to the user while running and deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.

Copy link
Member

@sgalkina sgalkina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output file should be updated with each cluster, currently it's getting rewritten

… deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.
Copy link
Member

@sgalkina sgalkina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable naming and remove global variable

Copy link
Member

@sgalkina sgalkina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Elek Lamoureux added 2 commits July 7, 2025 14:49
…emented a function for dealing with files, removed a redundant print line.
…n, changed the value of a variable. Formatted using ruff.
@sgalkina sgalkina requested a review from jakobnissen July 30, 2025 13:03
@sgalkina sgalkina self-requested a review July 31, 2025 07:07
Copy link
Member

@jakobnissen jakobnissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, thank you!
I'm a little worried that this is turning into spaghetti code, with the logic of writing bins being spread out between too many functions, that each do too many things. That is probably unavoidable to some extent, because we just DO want to do a lot of variations on the same concept.
So it would be best if we could rethink how to do this more elegantly. However, it may not be possible, and it's not a priorty.
Thank you for your work here, @ElekLamoureux

vamb/__main__.py Outdated
Comment on lines 1219 to 1220
open(unsplit_path, "a") as unsplit_clusters_file,
open(split_path, "a") as split_clusters_file,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "w" mode - when cluster_and_write_files is called, we expect new files to be created.

vamb/__main__.py Outdated
bin_prefix: Optional[str],
binsplitter: vamb.vambtools.BinSplitter,
base_clusters_name: str, # e.g. /foo/bar/vae -> /foo/bar/vae_unsplit.tsv
clusters: dict[str, set[str]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this argument Iterable[tuple[str, set[str]]]. As far as I cal tell, we don't need an actual dict here. Adjust the calls to .items() in this function accordingly. Then, you can avoid creating the single element dict in write_clusters_table when calling this function

vamb/__main__.py Outdated
file_path: Optional[str],
clusters: dict[str, set[str]],
to_file: bool,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add return type here for readability

vamb/__main__.py Outdated
# Open unsplit clusters and split them
if binsplitter.splitter is not None:
split_path = Path(base_clusters_name + "_split.tsv")
clusters = dict(binsplitter.binsplit(clusters.items()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider not instantiating the dict here and using the iterator instead, if possible.

def write_clusters(
io: IO[str],
clusters: Iterable[tuple[str, set[str]]],
io: IO[str], clusters: Iterable[tuple[str, set[str]]], print_line: bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to print_header

vamb/__main__.py Outdated
file_handle: Optional[TextIO],
file_path: Optional[str],
clusters: dict[str, set[str]],
to_file: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the bool argument to print_header

vamb/__main__.py Outdated
Comment on lines 1268 to 1271
if processed_contigs >= comparer:
comparer += progress_step
progress += 10
logger.info(f"{progress}% of contigs clustered")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is brittle. For example, what if a single bin has 25% of all contigs? Then comparer still only increments by 10%.
Instead, make a variable called "next_reporting_threshold" - a better name for "comparer". When updating it, update it to be the next 10% higher. E.g. if we moved from 15% of contigs to 32% of contigs, update it to be 40%. You can achieve this with -((processed_contigs * 10 + 1) // - num_contigs) / 10.
Also, fix up the logic for the progress variable: If a single bin brings is from e.g. 9% to 32%, it needs to print the update for 10, 20 and 30 in one go.

vamb/__main__.py Outdated
Comment on lines 1272 to 1284
if processed_contigs == num_contigs:
if binsplitter.splitter is not None:
msg = f"\tClustered {processed_contigs} contigs in {total_split} split bins ({total_unsplit} clusters)"
else:
msg = f"\tClustered {processed_contigs} contigs in {total_unsplit} unsplit bins"
logger.info(msg)
elapsed = round(time.time() - begintime, 2)
logger.info(f"\tWrote cluster file(s) in {elapsed} seconds.")

if fasta_output is not None:
logger.info(
f"\tWrote {max(total_split, total_unsplit)} bins with {processed_contigs} sequences in {elapsed} seconds."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this branch outside the for loop, no need to gate it behind an else branch

# Cluster and output the Y clusters
assert opt.common.clustering.max_clusters is None
write_clusters_and_bins(
export_binning_results(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this function call miss a bunch of arguments?


# Print bin to file
with open(directory.joinpath(binname + ".fna"), "wb") as file:
with open(directory.joinpath(str(binname) + ".fna"), "wb") as file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this str necessary? According to the function's type signature, it should already be a string. If it is always a string, do not use str here. If it is not always, then the type signature needs to be updated, or the caller needs to make sure the binname is always a string (in order to conform to the signature)

@sgalkina sgalkina linked an issue Aug 1, 2025 that may be closed by this pull request
@sgalkina sgalkina merged commit a0056e2 into RasmussenLab:master Aug 11, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Output clusters during clustering

3 participants