Improve runtime performance by adding channels #54

MillarCD · 2025-10-21T02:27:38Z

Hellooo!! again @wdecoster,

I was working on improving the performance of Chopper.

I noticed that the major bottleneck occurs due to printing records across all threads.
Although the filtering processing happens in parallel, printing is sequential: each thread must wait to use the standard output. This also involves expensive system calls.

To solve this, I separated the filter function into two tasks, assigning different roles to the threads: processing records and writing records.

Writer: A dedicated worker responsible for receiving valid records and writing them to STDOUT.
It uses a standard buffer to reduce system calls and further improve performance. Only one thread acts as a writer.

Filter: Threads that validate records based on quality and apply trimming.
When a record is valid, it sends the record to the writer. All remaining threads act as filters.

I used one channel per worker, which slightly improved performance.
I also leveraged the crossbeam-channel crate to manage simultaneous channel reading efficiently.

Finally, I created a separate function to run sequentially when only one thread is set.

Benchmark

To evaluate performance improvements, I measured the runtime across different numbers of threads and percentages of records retained after filtering.

Based on my hardware specifications, I selected 1, 2, 4, and 6 threads.
For the filtering levels, I tested 0%, 50%, 80% and 100% of records kept (I tested different parameter values to achieve these results).

Each scenario was executed 20 times per Chopper version, and the average runtime was calculated.
The benchmark used a FASTQ file of approximately 9 GB.

Hardware Specifications

Hardware	Specifications
CPU	AMD Ryzen 5500U – 6 cores, 12 threads
RAM	8 GB DDR4 3200MHz

Results: Average Runtime per Number of Threads (in seconds)

Record keeping %	Ch-0.11 (t1)	Ch-new (t1)	Ch-0.11 (t2)	Ch-new (t2)	Ch-0.11 (t4)	Ch-new (t4)	Ch-0.11 (t6)	Ch-new (t6)
0%	87.72	87.92	54.48	88.14	37.78	41.23	33.72	32.56
50%	119.43	105.29	81	98.6	63.05	53.65	99.61	45.01
80%	137.58	110.24	92.75	102.28	94.7	53.94	165.41	46.68
100%	156.94	124.41	113.08	108.40	154.7	72.54	217.89	64.99

In summary, the proposed version significantly improves performance in most scenarios,
especially as the number of threads and the percentage of records kept increase.

The only exception occurs when using two threads, where the original version performs
slightly better due to lower contention when both threads write directly to STDOUT.
In the proposed implementation, the dedicated writer thread remains idle at times while
waiting for new records, which slightly impacts efficiency in this specific case.

Overall, the new design achieves better scalability and more consistent performance across
multiple thread configurations, reducing I/O bottlenecks and improving throughput for larger workloads.

Refactorings

To achieve this performance, I refactored several functions:

filter function: I initialized the main variables (Aligner and Trimmer) and then call a specific function depending on whether the workflow is sequential or parallel.
I also extracted the main logic that validates quality and applies trimming into a new function called get_valid_segments.
write_record function: I created a new struct called WritableRecord with a method write_on_buffer that writes the valid segments into a buffer.
I also added a helper function record_to_string to generate the string representation of a record.
Tests module: I moved all unit tests of the main.rs file, inside a dedicated module in main.rs following Rust recommendations.
I added tests for get_valid_segments to verify the double validation of minimum length.

Limitations

One of the main drawbacks of the new approach is related to testing.
Because the implementation of filter function uses Stdout.lock to manage writes. So, the unit tests for the filter function print all records to the standard output when executed.

To avoid this issue, I had to mark these tests with #[ignore] momentarily, so they are skipped during regular test runs.

A possible improvement would be to test the entire program through an integration test or a GitHub workflow, capturing and validating the output without directly relying on Stdout.

Final Comments

Thank you for taking the time to review these changes. I hope they provide a meaningful improvement.

- Use a `Rayon::scope` to spawn two tasks: one for filtering reads, one for printing. - Introduce `WritableRecord` struct to hold data needed for printing. - Move the `write_record` function into the `WritableRecord` implementation.

- Use BufWriter to reduce syscall overhead when writing to stdout. - Move `output_reads_` counting to the writer thread to avoid parallel writes. - Relax atomic ordering from `SeqCst` to `Relaxed` for read counting. - Store pre-formatted FASTQ strings instead of `fastq::Record` to leverage parallel processing before sending through the channel.

…ng reads - Add crossbeam crate to provide unbounded channels - Use Select struct to manage multiple receivers

- Introduced `get_valid_segment` to encapsulate record validation logic. - Added `sequential_filter` and `parallel_filter` for more efficient execution. - Refactored `filter` to delegate to the appropriate function. - Fixed typo: renamed variable `writter` to `writer`.

…ncies

- Mark tests that call the `filter` function with `#[ignore]` to prevent them from running by default

- Move `record_to_string` outside `WritableRecord` - Refactor `WritableRecord::write_on_buffer to accept` `BufWriter<W: Write>`

…h` before and after trimming

…t that it returns multiple segments

wdecoster · 2025-10-21T12:32:57Z

Hi @MillarCD

Wow, that's great. Let me look into all of this :-)
It's interesting that things behave so differently with two threads, but I understand. I believe most likely you will want to remove a few bad reads while keeping most of them, so the more realistic scenarios would be keeping 80 and 100% of the records (when there is more writing involved). Would you agree? In those cases, the two threads do not perform (much) worse. It would probably be worth investigating what happens with 3 threads - is then everything again in favor of your new implementation?

Wouter

MillarCD · 2025-10-22T02:24:10Z

Yes, from my limited experience working with ONT data, I usually keep around 90% of the reads.

Regarding the two-thread case, a lot of time is wasted waiting for new reads to be written to the standard input, while only one thread is actively filtering. So, it ends up being a scenario very similar to using just one thread.

One possible solution could be to use asynchronous programming, so that the thread responsible for writing records could also filter reads when there are no reads available to write yet (maybe that could be the next pull request).

I also ran the comparison with three threads, and here are the results. Apparently, starting from three threads, the new approach improves performance.

wdecoster · 2025-10-28T20:57:02Z

Awesome! Thanks for contributing!

MillarCD · 2025-10-29T02:38:55Z

Glad to contribute, and thanks for merging! 😁

MillarCD added 12 commits September 8, 2025 21:21

fix: add missing newline at end of generated FASTQ record string

83ef52b

Create a dedicated channel per worker to avoid bottlenecks when sendi…

7b4c4a9

…ng reads - Add crossbeam crate to provide unbounded channels - Use Select struct to manage multiple receivers

Replace crossbeam with crossbeam-channel to reduce unused depende…

6e7ba0c

…ncies

Merge branch 'master' into 'add-channel-to-print'

b5f8189

- Mark tests that call the `filter` function with `#[ignore]` to prevent them from running by default

Fix bug in trimming and add check for minimum segment length

605a907

Add unit test for record_to_string function

f915343

- Move `record_to_string` outside `WritableRecord` - Refactor `WritableRecord::write_on_buffer to accept` `BufWriter<W: Write>`

Add tests module following Rust's conventions

0381024

Add unit tests for get_valid_segment function to validate `minlengt…

b96707e

…h` before and after trimming

Rename function get_valid_segment to get_valid_segments to reflec…

0a126eb

…t that it returns multiple segments

wdecoster merged commit b281f6e into wdecoster:master Oct 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve runtime performance by adding channels #54

Improve runtime performance by adding channels #54

Uh oh!

MillarCD commented Oct 21, 2025

Uh oh!

wdecoster commented Oct 21, 2025

Uh oh!

MillarCD commented Oct 22, 2025

Uh oh!

wdecoster commented Oct 28, 2025

Uh oh!

Uh oh!

MillarCD commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve runtime performance by adding channels #54

Improve runtime performance by adding channels #54

Uh oh!

Conversation

MillarCD commented Oct 21, 2025

Benchmark

Hardware Specifications

Results: Average Runtime per Number of Threads (in seconds)

Refactorings

Limitations

Final Comments

Uh oh!

wdecoster commented Oct 21, 2025

Uh oh!

MillarCD commented Oct 22, 2025

Uh oh!

wdecoster commented Oct 28, 2025

Uh oh!

Uh oh!

MillarCD commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants