topic-document-to-text

goal is to develop text extractor tools for common file formats that is compatible with permissive BSD licensing.

format	solution	compatibility	licensing	implementation
pdf	PDFium	✅	BSD	pdfium-parser
pdf	Xpdf	🚫	GPL
pdf	poppler	🚫	GPL
pdf	MuPDF	🚫	AGPL
pdf	PoDoFo	⚠️	LGPL-2
docx	OPC	✅	BSD	opc-parser
xls	libxls	✅	BSD
xls	FreeXL	⚠️	LGPL-2
xlsx	OPC	✅	BSD	opc-parser
pptx	OPC	✅	BSD	opc-parser
eml	GMime	⚠️	LGPL-2.1
eml	mimetic	✅	MIT
eml	libcmime	✅	MIT
ost	libpff	⚠️	LGPL-3.0	pff-parser
pst	libpff	⚠️	LGPL-3.0	pff-parser
pab	libpff	⚠️	LGPL-3.0	pff-parser
msg	libgsf	⚠️	LGPL-2.1
msg	libolecf	⚠️	LGPL-3.0	olecf-parser
html	tidy-html5	✅	W3C	tidy-parser
html	lexbor	✅	Apache 2.0	lexbor-parser
doc	libolecf	⚠️	LGPL-3.0	olecf-parser
doc	antiword	🚫	GPL
doc	~~librevenge~~¹	✅	Apache-2.0
doc	catdoc	🚫	GPL
doc	wvWare	🚫	GPL
rtf	UnRTF	🚫	GPL
rtf	~~librtf~~²	⚠️	LGPL-2.1	rtf-parser
rtf	platform api	✅️		rtf-parser
rtf	rtfreader	🚫	GPL
ppt	libolecf	⚠️	LGPL-3.0	olecf-parser

compatibility

⚠️ conditionally compatible

LGPL 2.1 says

Provide the object code or source code needed to relink the application with a modified version of the Library, so users can replace the Library in a statically linked or dynamically linked work.”

each repository shall provide the object code (.a for macOS and .lib for Windows) and build tools (Xcode project for macOS and Visual Studio solution for Windows) for compatibility.

✅ fully compatible

BSD or MIT is permissive.

🚫 not compatible

GPL is copyleft; the library licensing propagates to the whole program.

parser implementation

CLI for macOS (arm64 x86_64 universal library) and Windows (x86_64)
supoort both stdIn/stdOut and input/output file
support unicode file path
structured data in JSON

next step

once we have a reasonable set of parsers, we could consider using rust-based tools for chunking.

final step

once we have a set of parsers and a tool to chunk extracted plain text, we can develop a tool that extracts, chunks, and vectorises text using 4D AIKit. the idea is to use the offical component, not develop a custom fork.

librevenge is a CFBF parser library. reportedly the libmwaw filter supports MacWrite, AppleWorks, etc. thelibwps filter supports Microsoft Works. there are no known filters for Microsoft Word (.doc) documents. ↩
even with UTF-8 and surrogate pair patches, librtf is fundamentatlly flawed and not reliable for processing .rtf especially Microsoft Outlook messages. the parser implementation now uses platform APIs (NSAttributedString and Riched20.dll). ↩
this CLI utility internally uses semchunk-rs to chunk text before extrating meaningful information thanks to AI. it is not designed for chunking only. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

topic-document-to-text

compatibility

parser implementation

next step

final step

About

Uh oh!

Releases

Packages

License

miyako/topic-document-to-text

Folders and files

Latest commit

History

Repository files navigation

topic-document-to-text

compatibility

parser implementation

next step

final step

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages