goal is to develop text extractor tools for common file formats that is compatible with permissive BSD licensing.
| format | solution | compatibility | licensing | implementation |
|---|---|---|---|---|
| PDFium | ✅ | BSD | pdfium-parser | |
| Xpdf | 🚫 | GPL | ||
| poppler | 🚫 | GPL | ||
| MuPDF | 🚫 | AGPL | ||
| PoDoFo | LGPL-2 | |||
| docx | OPC | ✅ | BSD | opc-parser |
| xls | libxls | ✅ | BSD | |
| xls | FreeXL | LGPL-2 | ||
| xlsx | OPC | ✅ | BSD | opc-parser |
| pptx | OPC | ✅ | BSD | opc-parser |
| eml | GMime | LGPL-2.1 | ||
| eml | mimetic | ✅ | MIT | |
| eml | libcmime | ✅ | MIT | |
| ost | libpff | LGPL-3.0 | pff-parser | |
| pst | libpff | LGPL-3.0 | pff-parser | |
| pab | libpff | LGPL-3.0 | pff-parser | |
| msg | libgsf | LGPL-2.1 | ||
| msg | libolecf | LGPL-3.0 | olecf-parser | |
| html | tidy-html5 | ✅ | W3C | tidy-parser |
| html | lexbor | ✅ | Apache 2.0 | lexbor-parser |
| doc | libolecf | LGPL-3.0 | olecf-parser | |
| doc | antiword | 🚫 | GPL | |
| doc | ✅ | Apache-2.0 | ||
| doc | catdoc | 🚫 | GPL | |
| doc | wvWare | 🚫 | GPL | |
| rtf | UnRTF | 🚫 | GPL | |
| rtf | LGPL-2.1 | rtf-parser | ||
| rtf | platform api | ✅️ | rtf-parser | |
| rtf | rtfreader | 🚫 | GPL | |
| ppt | libolecf | LGPL-3.0 | olecf-parser |
LGPL 2.1 says
Provide the object code or source code needed to relink the application with a modified version of the Library, so users can replace the Library in a statically linked or dynamically linked work.”
each repository shall provide the object code (.a for macOS and .lib for Windows) and build tools (Xcode project for macOS and Visual Studio solution for Windows) for compatibility.
✅ fully compatible
BSD or MIT is permissive.
🚫 not compatible
GPL is copyleft; the library licensing propagates to the whole program.
- CLI for macOS (
arm64x86_64universal library) and Windows (x86_64) - supoort both
stdIn/stdOutand input/output file - support unicode file path
- structured data in JSON
once we have a reasonable set of parsers, we could consider using rust-based tools for chunking.
once we have a set of parsers and a tool to chunk extracted plain text, we can develop a tool that extracts, chunks, and vectorises text using 4D AIKit. the idea is to use the offical component, not develop a custom fork.
Footnotes
-
librevengeis a CFBF parser library. reportedly thelibmwawfilter supports MacWrite, AppleWorks, etc. thelibwpsfilter supports Microsoft Works. there are no known filters for Microsoft Word (.doc) documents. ↩ -
even with UTF-8 and surrogate pair patches,
librtfis fundamentatlly flawed and not reliable for processing.rtfespecially Microsoft Outlook messages. the parser implementation now uses platform APIs (NSAttributedStringandRiched20.dll). ↩ -
this CLI utility internally uses
semchunk-rsto chunk text before extrating meaningful information thanks to AI. it is not designed for chunking only. ↩