Skip to content

Conversation

@MillarCD
Copy link
Contributor

Hellooo!! again @wdecoster,

I was working on improving the performance of Chopper.

I noticed that the major bottleneck occurs due to printing records across all threads.
Although the filtering processing happens in parallel, printing is sequential: each thread must wait to use the standard output. This also involves expensive system calls.

To solve this, I separated the filter function into two tasks, assigning different roles to the threads: processing records and writing records.

Writer: A dedicated worker responsible for receiving valid records and writing them to STDOUT.
It uses a standard buffer to reduce system calls and further improve performance. Only one thread acts as a writer.

Filter: Threads that validate records based on quality and apply trimming.
When a record is valid, it sends the record to the writer. All remaining threads act as filters.

I used one channel per worker, which slightly improved performance.
I also leveraged the crossbeam-channel crate to manage simultaneous channel reading efficiently.

Finally, I created a separate function to run sequentially when only one thread is set.

Benchmark

To evaluate performance improvements, I measured the runtime across different numbers of threads and percentages of records retained after filtering.

Based on my hardware specifications, I selected 1, 2, 4, and 6 threads.
For the filtering levels, I tested 0%, 50%, 80% and 100% of records kept (I tested different parameter values to achieve these results).

Each scenario was executed 20 times per Chopper version, and the average runtime was calculated.
The benchmark used a FASTQ file of approximately 9 GB.

Hardware Specifications

Hardware Specifications
CPU AMD Ryzen 5500U – 6 cores, 12 threads
RAM 8 GB DDR4 3200MHz

Results: Average Runtime per Number of Threads (in seconds)

Record keeping % Ch-0.11 (t1) Ch-new (t1) Ch-0.11 (t2) Ch-new (t2) Ch-0.11 (t4) Ch-new (t4) Ch-0.11 (t6) Ch-new (t6)
0% 87.72 87.92 54.48 88.14 37.78 41.23 33.72 32.56
50% 119.43 105.29 81 98.6 63.05 53.65 99.61 45.01
80% 137.58 110.24 92.75 102.28 94.7 53.94 165.41 46.68
100% 156.94 124.41 113.08 108.40 154.7 72.54 217.89 64.99
runtime_comparison

In summary, the proposed version significantly improves performance in most scenarios,
especially as the number of threads and the percentage of records kept increase.

The only exception occurs when using two threads, where the original version performs
slightly better due to lower contention when both threads write directly to STDOUT.
In the proposed implementation, the dedicated writer thread remains idle at times while
waiting for new records, which slightly impacts efficiency in this specific case.

Overall, the new design achieves better scalability and more consistent performance across
multiple thread configurations, reducing I/O bottlenecks and improving throughput for larger workloads.

Refactorings

To achieve this performance, I refactored several functions:

  1. filter function: I initialized the main variables (Aligner and Trimmer) and then call a specific function depending on whether the workflow is sequential or parallel.
    I also extracted the main logic that validates quality and applies trimming into a new function called get_valid_segments.

  2. write_record function: I created a new struct called WritableRecord with a method write_on_buffer that writes the valid segments into a buffer.
    I also added a helper function record_to_string to generate the string representation of a record.

  3. Tests module: I moved all unit tests of the main.rs file, inside a dedicated module in main.rs following Rust recommendations.
    I added tests for get_valid_segments to verify the double validation of minimum length.

Limitations

One of the main drawbacks of the new approach is related to testing.
Because the implementation of filter function uses Stdout.lock to manage writes. So, the unit tests for the filter function print all records to the standard output when executed.

To avoid this issue, I had to mark these tests with #[ignore] momentarily, so they are skipped during regular test runs.

A possible improvement would be to test the entire program through an integration test or a GitHub workflow, capturing and validating the output without directly relying on Stdout.

Final Comments

Thank you for taking the time to review these changes. I hope they provide a meaningful improvement.

- Use a `Rayon::scope` to spawn two tasks: one for filtering reads, one for printing.
- Introduce `WritableRecord` struct to hold data needed for printing.
- Move the `write_record` function into the `WritableRecord` implementation.
- Use BufWriter to reduce syscall overhead when writing to stdout.
- Move `output_reads_` counting to the writer thread to avoid parallel writes.
- Relax atomic ordering from `SeqCst` to `Relaxed` for read counting.
- Store pre-formatted FASTQ strings instead of `fastq::Record` to leverage
  parallel processing before sending through the channel.
…ng reads

- Add crossbeam crate to provide unbounded channels
- Use Select struct to manage multiple receivers
- Introduced `get_valid_segment` to encapsulate record validation logic.
- Added `sequential_filter` and `parallel_filter` for more efficient execution.
- Refactored `filter` to delegate to the appropriate function.
- Fixed typo: renamed variable `writter` to `writer`.
- Mark tests that call the `filter` function with `#[ignore]` to prevent them from running by default
- Move `record_to_string` outside `WritableRecord`
- Refactor `WritableRecord::write_on_buffer to accept` `BufWriter<W: Write>`
@wdecoster
Copy link
Owner

Hi @MillarCD

Wow, that's great. Let me look into all of this :-)
It's interesting that things behave so differently with two threads, but I understand. I believe most likely you will want to remove a few bad reads while keeping most of them, so the more realistic scenarios would be keeping 80 and 100% of the records (when there is more writing involved). Would you agree? In those cases, the two threads do not perform (much) worse. It would probably be worth investigating what happens with 3 threads - is then everything again in favor of your new implementation?

Wouter

@MillarCD
Copy link
Contributor Author

Yes, from my limited experience working with ONT data, I usually keep around 90% of the reads.

Regarding the two-thread case, a lot of time is wasted waiting for new reads to be written to the standard input, while only one thread is actively filtering. So, it ends up being a scenario very similar to using just one thread.

One possible solution could be to use asynchronous programming, so that the thread responsible for writing records could also filter reads when there are no reads available to write yet (maybe that could be the next pull request).

I also ran the comparison with three threads, and here are the results. Apparently, starting from three threads, the new approach improves performance.

runtime_comparison_with_3_threads

@wdecoster
Copy link
Owner

Awesome! Thanks for contributing!

@wdecoster wdecoster merged commit b281f6e into wdecoster:master Oct 28, 2025
1 check passed
@MillarCD
Copy link
Contributor Author

Glad to contribute, and thanks for merging! 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants