Faster extraction via StringIO #45
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use StringIO instead of repeated string concatenation for faster extraction line building.
My teammates are suffering with weeks-long runtimes on extract_tracts.py so I hit it with a profiler to see where it is slow: https://github.com/plasma-umass/scalene . With uncompressed output, extract_tracts.py spent more than 80% of its time on these two lines alone:
String concatenation in Python is slow because every single time this will copy all the contents of the existing strings into a new one. Ways to do this faster are either using .join() or StringIO, which will allocate a single buffer and add to it rather than copying every time. The change to StringIO significantly increases speed, although the details will depend on your own dataset and the computer you're running it on.
If you're open to this and other improvements, it would also be possible to use a library like https://pypi.org/project/xopen/ or anything similar to make compressed output 5x faster or more, by moving compression and decompression to another core and using a more efficient library to do the work.