Skip to content

Conversation

@pettyalex
Copy link

Use StringIO instead of repeated string concatenation for faster extraction line building.

My teammates are suffering with weeks-long runtimes on extract_tracts.py so I hit it with a profiler to see where it is slow: https://github.com/plasma-umass/scalene . With uncompressed output, extract_tracts.py spent more than 80% of its time on these two lines alone:

output_lines[f"dos{j}"] += "\t" + str(counts[j])
output_lines[f"ancdos{j}"] += "\t" + str(anc_counts[j])

String concatenation in Python is slow because every single time this will copy all the contents of the existing strings into a new one. Ways to do this faster are either using .join() or StringIO, which will allocate a single buffer and add to it rather than copying every time. The change to StringIO significantly increases speed, although the details will depend on your own dataset and the computer you're running it on.

If you're open to this and other improvements, it would also be possible to use a library like https://pypi.org/project/xopen/ or anything similar to make compressed output 5x faster or more, by moving compression and decompression to another core and using a more efficient library to do the work.

@nirav572
Copy link
Contributor

nirav572 commented Mar 5, 2025

Hi Alex, Thank you for the suggestion. Can you please reach out to me at nirav.shah@bcm.edu.

@michaelofrancis
Copy link

Hi, I ran into the same slowness issue and made this script to chunk extract_tracts.py into an arbitrary number of processes which can be run in parallel, and then merge them at the end in sorted numerical order of the chunks. The result is identical to the original script.

Use:
python3 chunk_extract_tracts.py --vcf $vcf --msp $msp --num-ancs $n_anc --output-dir $out --total-chunks 1000 --chunk-index $i

chunk-extract-tracts.py.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants