-
Notifications
You must be signed in to change notification settings - Fork 49
Prints progress while running and gets rid of an unneccessary dictionary. #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sgalkina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output file should be updated with each cluster, currently it's getting rewritten
… deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.
sgalkina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable naming and remove global variable
sgalkina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
…s files that only had one line.
…escriptive variables.
…emented a function for dealing with files, removed a redundant print line.
…n, changed the value of a variable. Formatted using ruff.
jakobnissen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, thank you!
I'm a little worried that this is turning into spaghetti code, with the logic of writing bins being spread out between too many functions, that each do too many things. That is probably unavoidable to some extent, because we just DO want to do a lot of variations on the same concept.
So it would be best if we could rethink how to do this more elegantly. However, it may not be possible, and it's not a priorty.
Thank you for your work here, @ElekLamoureux
vamb/__main__.py
Outdated
| open(unsplit_path, "a") as unsplit_clusters_file, | ||
| open(split_path, "a") as split_clusters_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "w" mode - when cluster_and_write_files is called, we expect new files to be created.
vamb/__main__.py
Outdated
| bin_prefix: Optional[str], | ||
| binsplitter: vamb.vambtools.BinSplitter, | ||
| base_clusters_name: str, # e.g. /foo/bar/vae -> /foo/bar/vae_unsplit.tsv | ||
| clusters: dict[str, set[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this argument Iterable[tuple[str, set[str]]]. As far as I cal tell, we don't need an actual dict here. Adjust the calls to .items() in this function accordingly. Then, you can avoid creating the single element dict in write_clusters_table when calling this function
vamb/__main__.py
Outdated
| file_path: Optional[str], | ||
| clusters: dict[str, set[str]], | ||
| to_file: bool, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add return type here for readability
vamb/__main__.py
Outdated
| # Open unsplit clusters and split them | ||
| if binsplitter.splitter is not None: | ||
| split_path = Path(base_clusters_name + "_split.tsv") | ||
| clusters = dict(binsplitter.binsplit(clusters.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider not instantiating the dict here and using the iterator instead, if possible.
vamb/vambtools.py
Outdated
| def write_clusters( | ||
| io: IO[str], | ||
| clusters: Iterable[tuple[str, set[str]]], | ||
| io: IO[str], clusters: Iterable[tuple[str, set[str]]], print_line: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to print_header
vamb/__main__.py
Outdated
| file_handle: Optional[TextIO], | ||
| file_path: Optional[str], | ||
| clusters: dict[str, set[str]], | ||
| to_file: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the bool argument to print_header
vamb/__main__.py
Outdated
| if processed_contigs >= comparer: | ||
| comparer += progress_step | ||
| progress += 10 | ||
| logger.info(f"{progress}% of contigs clustered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is brittle. For example, what if a single bin has 25% of all contigs? Then comparer still only increments by 10%.
Instead, make a variable called "next_reporting_threshold" - a better name for "comparer". When updating it, update it to be the next 10% higher. E.g. if we moved from 15% of contigs to 32% of contigs, update it to be 40%. You can achieve this with -((processed_contigs * 10 + 1) // - num_contigs) / 10.
Also, fix up the logic for the progress variable: If a single bin brings is from e.g. 9% to 32%, it needs to print the update for 10, 20 and 30 in one go.
vamb/__main__.py
Outdated
| if processed_contigs == num_contigs: | ||
| if binsplitter.splitter is not None: | ||
| msg = f"\tClustered {processed_contigs} contigs in {total_split} split bins ({total_unsplit} clusters)" | ||
| else: | ||
| msg = f"\tClustered {processed_contigs} contigs in {total_unsplit} unsplit bins" | ||
| logger.info(msg) | ||
| elapsed = round(time.time() - begintime, 2) | ||
| logger.info(f"\tWrote cluster file(s) in {elapsed} seconds.") | ||
|
|
||
| if fasta_output is not None: | ||
| logger.info( | ||
| f"\tWrote {max(total_split, total_unsplit)} bins with {processed_contigs} sequences in {elapsed} seconds." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this branch outside the for loop, no need to gate it behind an else branch
| # Cluster and output the Y clusters | ||
| assert opt.common.clustering.max_clusters is None | ||
| write_clusters_and_bins( | ||
| export_binning_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this function call miss a bunch of arguments?
vamb/vambtools.py
Outdated
|
|
||
| # Print bin to file | ||
| with open(directory.joinpath(binname + ".fna"), "wb") as file: | ||
| with open(directory.joinpath(str(binname) + ".fna"), "wb") as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this str necessary? According to the function's type signature, it should already be a string. If it is always a string, do not use str here. If it is not always, then the type signature needs to be updated, or the caller needs to make sure the binname is always a string (in order to conform to the signature)
…ile loop to fix printing logic
Made it so that progress is displayed to the user while running and deleted the unnecessary use of a dictionary and replaced it by using the direct output of a generator.