Faster extraction via StringIO #45

pettyalex · 2025-03-03T21:48:21Z

Use StringIO instead of repeated string concatenation for faster extraction line building.

My teammates are suffering with weeks-long runtimes on extract_tracts.py so I hit it with a profiler to see where it is slow: https://github.com/plasma-umass/scalene . With uncompressed output, extract_tracts.py spent more than 80% of its time on these two lines alone:

output_lines[f"dos{j}"] += "\t" + str(counts[j])
output_lines[f"ancdos{j}"] += "\t" + str(anc_counts[j])

String concatenation in Python is slow because every single time this will copy all the contents of the existing strings into a new one. Ways to do this faster are either using .join() or StringIO, which will allocate a single buffer and add to it rather than copying every time. The change to StringIO significantly increases speed, although the details will depend on your own dataset and the computer you're running it on.

If you're open to this and other improvements, it would also be possible to use a library like https://pypi.org/project/xopen/ or anything similar to make compressed output 5x faster or more, by moving compression and decompression to another core and using a more efficient library to do the work.

…action line building.

nirav572 · 2025-03-05T17:35:24Z

Hi Alex, Thank you for the suggestion. Can you please reach out to me at nirav.shah@bcm.edu.

michaelofrancis · 2025-05-20T19:56:57Z

Hi, I ran into the same slowness issue and made this script to chunk extract_tracts.py into an arbitrary number of processes which can be run in parallel, and then merge them at the end in sorted numerical order of the chunks. The result is identical to the original script.

Use:
python3 chunk_extract_tracts.py --vcf $vcf --msp $msp --num-ancs $n_anc --output-dir $out --total-chunks 1000 --chunk-index $i

chunk-extract-tracts.py.zip

Use StringIO instead of repeated string concatenation for faster extr…

643a020

…action line building.

pettyalex mentioned this pull request Mar 3, 2025

Speed up extract_tracts.py step for large vcf #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster extraction via StringIO #45

Faster extraction via StringIO #45

Uh oh!

pettyalex commented Mar 3, 2025

Uh oh!

nirav572 commented Mar 5, 2025 •

edited

Loading

Uh oh!

michaelofrancis commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Faster extraction via StringIO #45

Are you sure you want to change the base?

Faster extraction via StringIO #45

Uh oh!

Conversation

pettyalex commented Mar 3, 2025

Uh oh!

nirav572 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelofrancis commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nirav572 commented Mar 5, 2025 •

edited

Loading