Support text extraction from eml files and other text extraction improvements #36

martinburchell · 2025-05-14T05:57:32Z

document_to_text() now supports .eml files, with attachments processed by any supported document converters. The exception here is we don't try to run the fallback strings converter because we would end up with a lot of rubbish from images etc.

I've had to rename the email module to avoid conflict with the Python standard library. At the same time I've renamed json and profile to avoid any future conflicts.

A few other minor changes to text extraction:

Distinguish between missing and empty blobs
Make the HTML parser more tolerant of dodgy encoding
Fix warnings from BeautifulSoup

This will cause the tests directory to be treated as a package and allow two test files to have the same name in different directories.

This is preventing us from doing this in extract_text.py: from email.parser import Parser as EmailParser Because email already exists as a module

I don't know if this is deviating from the standard but I have seen one example of this in the real world

It is possible to have an email with an empty body. Other scenarios (empty HTML, docx etc) are pretty unlikely

RudolfCardinal

Looks good - thank you! I note:

In get_filelikeobject you've changed if not filename and not blob to if filename is None and blob is None; that does remove a check for empty strings or empty byte-like objects. Perhaps for blob that was the intention, particularly for empty e-mail content? There's another instance of this later (line 1428) too.

martinburchell · 2025-05-14T08:34:30Z

Looks good - thank you! I note:

* In `get_filelikeobject` you've changed `if not filename and not blob` to `if filename is None and blob is None`; that does remove a check for empty strings or empty byte-like objects. Perhaps for blob that was the intention, particularly for empty e-mail content? There's another instance of this later (line 1428) too.

Yes it was necessary for empty email content (supported by the tests) but I'll fix it so there is a better error message for an empty filename.

martinburchell added 30 commits April 29, 2025 13:36

Add tests __init__.py

04266f3

This will cause the tests directory to be treated as a package and allow two test files to have the same name in different directories.

Test document_to_text exceptions

6e9f343

Test document_to_text CSV extraction

9d78d2c

Test doc extraction

5a8d542

Test dot file extraction

a8f8cb5

Update docs

2cb2866

Test DOCX conversion

699645d

Test HTML conversion

78873eb

Test log file conversion

9219cab

Test ODT file conversion

04b0c37

Test PDF file conversion

21e2b81

Test RTF file conversion

82737b1

Install Faker when building docs and running tests

1427e82

Test TXT file conversion

d1f8977

Test XML and anything else converted to text

10b1ac0

Fix name clashes with python built-ins

37d3257

This is preventing us from doing this in extract_text.py: from email.parser import Parser as EmailParser Because email already exists as a module

Ignore shadowing of python built-ins

4ba8610

Remove check for conflicting email import

b0520cb

Update docs

d1b00b0

Fixups following module renaming

dc511c7

extract_text.py type hints

be15403

Use html.parser for BeautifulSoup

c9a06ce

Support .eml text extraction

761e404

Replace deprecated BeautifulStoneSoup as advised

75b9ce6

Default to UTF-8 when no charset in emails

e58d8fd

Default to UTF-8 when no content type header in emails

4a11b49

Allow docx files to include document files with document[nn].xml form

5fb204f

I don't know if this is deviating from the standard but I have seen one example of this in the real world

Allow blobs to be empty when extracting text

de72344

It is possible to have an email with an empty body. Other scenarios (empty HTML, docx etc) are pretty unlikely

Fix docx filename generation to yield string, not bytes

87f7754

Fix missing return value

e17023e

martinburchell added 8 commits May 12, 2025 16:24

Workaround BeautifulSoup not handling empty byte array correctly

bdc9983

Note BS4 bug report

499f994

Replace illegal multibyte sequences when encoding emails

dc92a17

Handle invalid surrogate characters in HTML conversion

51e9295

Better names for test methods

fdccb76

Fix test comment

dba72a9

Update changelog

97b5a0a

Align version of faker-file used in docs to that used in tests

32cfc58

martinburchell requested a review from RudolfCardinal May 14, 2025 05:57

RudolfCardinal approved these changes May 14, 2025

View reviewed changes

Revert empty filename check when extracting text

b00e82e

martinburchell merged commit b344963 into master May 14, 2025
5 checks passed

martinburchell deleted the email-text-extraction branch May 14, 2025 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support text extraction from eml files and other text extraction improvements #36

Support text extraction from eml files and other text extraction improvements #36

Uh oh!

martinburchell commented May 14, 2025

Uh oh!

RudolfCardinal left a comment

Uh oh!

martinburchell commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support text extraction from eml files and other text extraction improvements #36

Support text extraction from eml files and other text extraction improvements #36

Uh oh!

Conversation

martinburchell commented May 14, 2025

Uh oh!

RudolfCardinal left a comment

Choose a reason for hiding this comment

Uh oh!

martinburchell commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants