-
Notifications
You must be signed in to change notification settings - Fork 5
Support text extraction from eml files and other text extraction improvements #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This will cause the tests directory to be treated as a package and allow two test files to have the same name in different directories.
This is preventing us from doing this in extract_text.py: from email.parser import Parser as EmailParser Because email already exists as a module
I don't know if this is deviating from the standard but I have seen one example of this in the real world
It is possible to have an email with an empty body. Other scenarios (empty HTML, docx etc) are pretty unlikely
RudolfCardinal
approved these changes
May 14, 2025
Owner
RudolfCardinal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thank you! I note:
- In
get_filelikeobjectyou've changedif not filename and not blobtoif filename is None and blob is None; that does remove a check for empty strings or empty byte-like objects. Perhaps for blob that was the intention, particularly for empty e-mail content? There's another instance of this later (line 1428) too.
Collaborator
Author
Yes it was necessary for empty email content (supported by the tests) but I'll fix it so there is a better error message for an empty filename. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
document_to_text()now supports.emlfiles, with attachments processed by any supported document converters. The exception here is we don't try to run the fallbackstringsconverter because we would end up with a lot of rubbish from images etc.I've had to rename the
emailmodule to avoid conflict with the Python standard library. At the same time I've renamedjsonandprofileto avoid any future conflicts.A few other minor changes to text extraction:
BeautifulSoup