Skip to content

Conversation

@brownag
Copy link
Contributor

@brownag brownag commented Sep 9, 2025

This PR adds argument raw to pdf_text() (and poppler_pdf_text()) to allow selecting page::raw_order_layout instead of page::physical_layout.

I have several workflows that rely on parsing legacy PDF documents using pdftotext command line tool -raw flag and then reading the output into R. I currently can't use pdftools::pdf_text() for these use cases because the pages have a complex multi-column format that is much easier to handle in the stream order raw layout rather than the physical layout.

I love the pdftools package and use it frequently. Thanks for your consideration!

@jeroen jeroen merged commit c863efc into ropensci:master Sep 9, 2025
9 checks passed
@jeroen
Copy link
Member

jeroen commented Sep 10, 2025

This has landed on CRAN in pdftools 3.6.0

@brownag
Copy link
Contributor Author

brownag commented Sep 10, 2025

Wonderful, thanks for all of your great work, @jeroen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants