Skip to content

Conversation

@martinburchell
Copy link
Collaborator

document_to_text() now supports .eml files, with attachments processed by any supported document converters. The exception here is we don't try to run the fallback strings converter because we would end up with a lot of rubbish from images etc.

I've had to rename the email module to avoid conflict with the Python standard library. At the same time I've renamed json and profile to avoid any future conflicts.

A few other minor changes to text extraction:

  • Distinguish between missing and empty blobs
  • Make the HTML parser more tolerant of dodgy encoding
  • Fix warnings from BeautifulSoup

This will cause the tests directory to be treated as a package and allow
two test files to have the same name in different directories.
This is preventing us from doing this in extract_text.py:
from email.parser import Parser as EmailParser

Because email already exists as a module
I don't know if this is deviating from the standard but I have seen one example of this in the real world
It is possible to have an email with an empty body. Other scenarios (empty HTML, docx etc)
are pretty unlikely
Copy link
Owner

@RudolfCardinal RudolfCardinal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thank you! I note:

  • In get_filelikeobject you've changed if not filename and not blob to if filename is None and blob is None; that does remove a check for empty strings or empty byte-like objects. Perhaps for blob that was the intention, particularly for empty e-mail content? There's another instance of this later (line 1428) too.

@martinburchell
Copy link
Collaborator Author

Looks good - thank you! I note:

* In `get_filelikeobject` you've changed `if not filename and not blob` to `if filename is None and blob is None`; that does remove a check for empty strings or empty byte-like objects. Perhaps for blob that was the intention, particularly for empty e-mail content? There's another instance of this later (line 1428) too.

Yes it was necessary for empty email content (supported by the tests) but I'll fix it so there is a better error message for an empty filename.

@martinburchell martinburchell merged commit b344963 into master May 14, 2025
5 checks passed
@martinburchell martinburchell deleted the email-text-extraction branch May 14, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants