Skip to content

Conversation

@garrethlee
Copy link

@garrethlee garrethlee commented Sep 27, 2024

Description

Refactored extraction logic to separate HTML cleaning and text extraction into distinct steps. This allows chaining the cleaning step from one library with the extraction step from another, enhancing flexibility and interoperability.

Context

  • Most extractors follow a two-step process:
    1. Clean raw HTML into a sanitized representation (usually a stripped down version of HTML)
    2. Convert the cleaned HTML to plaintext.
  • Readability, for example, only provides an HTML cleaning method and lacks built-in plaintext conversion. To handle such cases, we now support chaining steps across libraries (e.g., clean_html from one library and extract from another).
  • Direct use cases, such as Trafilatura, remain unaffected—its extract function works independently, while clean_html is reserved for interoperability scenarios like inscriptis.

Thus, we break down the extraction functionality into the two phases referenced above, in the form of a clean_html and extract method in each Extractor.

Changes

  • Added clean_html as a standalone method in extractors
  • Refactored the logic in applicable extractors to separate cleaning and extracting processes.
  • Integrated new text extraction libraries (readabilipy, readability, resiliparse) to extend functionality and improve coverage.

garrethlee and others added 27 commits September 24, 2024 17:06
… initialization

- Added a default `clean_html` method to the `BaseExtractor` class, providing a warning for extractors that do not implement their own.
- Implemented specific `clean_html` methods in `Inscriptis`, `Justext`, `ReadabiliPy`, `Readability`, and `Trafilatura` extractors to handle HTML cleaning.
- Updated the `Inscriptis` extractor to accept a preprocessor during initialization.
- Modified the `extract` methods in `ReadabiliPy` and `Readability` to utilize the new `clean_html` method.
- Adjusted the `Justext` extractor to remove the default English language parameter from `get_stoplist`.
- Updated tests to reflect changes in extractor initialization and functionality.
… initialization

- Added a default `clean_html` method to the `BaseExtractor` class, providing a warning for extractors that do not implement their own.
- Implemented specific `clean_html` methods in `Inscriptis`, `Justext`, `ReadabiliPy`, `Readability`, and `Trafilatura` extractors to handle HTML cleaning.
- Updated the `Inscriptis` extractor to accept a preprocessor during initialization.
- Modified the `extract` methods in `ReadabiliPy` and `Readability` to utilize the new `clean_html` method.
- Adjusted the `Justext` extractor to remove the default English language parameter from `get_stoplist`.
- Updated tests to reflect changes in extractor initialization and functionality.
@garrethlee garrethlee marked this pull request as ready for review December 21, 2024 23:50
@garrethlee garrethlee changed the title Add several open-source text extraction libraries Add open-source text extraction libraries Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants