Skip to content

miyako/topic-document-to-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 

Repository files navigation

topic-document-to-text

goal is to develop text extractor tools for common file formats that is compatible with permissive BSD licensing.

format solution compatibility licensing implementation
pdf PDFium BSD pdfium-parser
pdf Xpdf 🚫 GPL
pdf poppler 🚫 GPL
pdf MuPDF 🚫 AGPL
pdf PoDoFo ⚠️ LGPL-2
docx OPC BSD opc-parser
xls libxls BSD
xls FreeXL ⚠️ LGPL-2
xlsx OPC BSD opc-parser
pptx OPC BSD opc-parser
eml GMime ⚠️ LGPL-2.1
eml mimetic MIT
eml libcmime MIT
ost libpff ⚠️ LGPL-3.0 pff-parser
pst libpff ⚠️ LGPL-3.0 pff-parser
pab libpff ⚠️ LGPL-3.0 pff-parser
msg libgsf ⚠️ LGPL-2.1
msg libolecf ⚠️ LGPL-3.0 olecf-parser
html tidy-html5 W3C tidy-parser
html lexbor Apache 2.0 lexbor-parser
doc libolecf ⚠️ LGPL-3.0 olecf-parser
doc antiword 🚫 GPL
doc librevenge1 Apache-2.0
doc catdoc 🚫 GPL
doc wvWare 🚫 GPL
rtf UnRTF 🚫 GPL
rtf librtf2 ⚠️ LGPL-2.1 rtf-parser
rtf platform api ✅️ rtf-parser
rtf rtfreader 🚫 GPL
ppt libolecf ⚠️ LGPL-3.0 olecf-parser

compatibility

⚠️ conditionally compatible

LGPL 2.1 says

Provide the object code or source code needed to relink the application with a modified version of the Library, so users can replace the Library in a statically linked or dynamically linked work.”

each repository shall provide the object code (.a for macOS and .lib for Windows) and build tools (Xcode project for macOS and Visual Studio solution for Windows) for compatibility.

✅ fully compatible

BSD or MIT is permissive.

🚫 not compatible

GPL is copyleft; the library licensing propagates to the whole program.

parser implementation

  • CLI for macOS (arm64 x86_64 universal library) and Windows (x86_64)
  • supoort both stdIn/stdOut and input/output file
  • support unicode file path
  • structured data in JSON

next step

once we have a reasonable set of parsers, we could consider using rust-based tools for chunking.

final step

once we have a set of parsers and a tool to chunk extracted plain text, we can develop a tool that extracts, chunks, and vectorises text using 4D AIKit. the idea is to use the offical component, not develop a custom fork.

Footnotes

  1. librevenge is a CFBF parser library. reportedly the libmwaw filter supports MacWrite, AppleWorks, etc. thelibwps filter supports Microsoft Works. there are no known filters for Microsoft Word (.doc) documents.

  2. even with UTF-8 and surrogate pair patches, librtf is fundamentatlly flawed and not reliable for processing .rtf especially Microsoft Outlook messages. the parser implementation now uses platform APIs (NSAttributedString and Riched20.dll).

  3. this CLI utility internally uses semchunk-rs to chunk text before extrating meaningful information thanks to AI. it is not designed for chunking only.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published