This repo contains automation and helper scripts to convert AK documentation from raw html source code to markdown.
AK documentation currently is stored in version control as raw html, css and js files. It makes use of specific httpd features such as "Server Side Includes" and Handlebars JS to render templated html content.
To host this is a markdown based site, we will employ Hugo a static site generator from markdown, and docsy theme for structuring and styling the site.
The project now includes a fully automated, end-to-end workflow system that handles the entire conversion process from cloning the Kafka site repository to generating the final markdown structure for Hugo.
The workflow is broken down into composable, idempotent stages:
- Clone Stage - Clones/updates the Kafka website repository
- Pre-processing Stage - Converts raw HTML to intermediate markdown
- Post-processing Stage - Restructures markdown for better organization
- Validation Stage - Verifies the output meets expectations
Each stage is further broken down into individual steps that can be customized, reordered, or selectively applied.
# Basic usage
python main.py --workspace ./my_workspace
# Start from a specific stage
python main.py --workspace ./my_workspace --start-stage pre-process
# Enable debug logging
python main.py --workspace ./my_workspace --debug
# Skip validation stage
python main.py --workspace ./my_workspace --skip-validation- main.py - Main entry point and workflow orchestration
- workflow_steps.py - Fine-grained steps and processors for each stage
- process.yaml - Configuration for document sections and processing rules
- Composable - Each stage and step can run independently
- Idempotent - Safe to run multiple times
- Debuggable - Detailed logging and clear error messages
- Configurable - Easy to customize through process.yaml
- Extensible - New steps can be easily added to the registry
The workflow includes specialized processors for handling unique files:
- committers.md - Extracts committer information and generates
data/committers.json - powered-by.html - Extracts testimonials from the Powered By page and generates
data/testimonials.json
These special processors can be run with:
python main.py --workspace ./my_workspace --start-stage special-filesThe workflow automatically handles static content such as:
- Images - Both root level and version-specific image directories
- Diagrams - Diagram files used in documentation
- Logos - Brand assets and logos
- Generated Content - Auto-generated documentation files
- JavaDoc - Java API documentation
Static directories are:
- Identified via the
static_dirslist inprocess.yaml - Copied to the appropriate location in the
staticoutput directory - Referenced correctly in markdown via the link updates in
process.yaml - Validated to ensure they exist and contain files
Process directories and files, going through the raw html, css, js, and other static content in the AK site and doing the first pass of converting it to markdown.
As part of this, we address the following:
- Sanitize input HTML. Some input HTML files contain characters such as
\w,\cetc., that messes up regex search and subsitutue operations. Escape them correctly. - There are a bunch of files where raw html code is placed under
<script/>tags to be handled by HandleBars.js file in the original documentation. We will need to process them offline and convert them to raw HTML so that markdown conversion works. - Process SSI tags. Wherever
#include virtualSSI tags are present, insert a custom hugo short code `{{ }} that will do the job of inserting the content of the specified html file safely at the location of reference. - Convert sanitized and pre-processed raw HTML files to markdown.
- Add markdown front matter so that
hugocan process them. - There are a bunch of places in the original raw html source code where the heading levels are not used consistently, and in many cases contain manually coded numeric values in the headings. Uplevel headings where applicable and remove these numeric characters.
Converting the raw html files to markdown is only half the work. These files will further need to be processed so that we can rearrange and structure them for better readability, maintainability.
Based on the rules defined in process.yaml:
- There are multiple versions of the documentation stored in this repo.
- Each version has several sections that need to be organized and restrucutred.
- As such, the policy defines what these sections are.
- Supports a few strategies to organize these sections:
arrange. Arranges the files from pre-process phase in the order specified. Automatically assigns weights (so that they are ordered as we need them) and takes care of renaming, placing the files in the right location etc.,split_markdown_by_heading. There are some very large files in the code base and for better readability/maintainability, we would want to split them into different files in a given section. Collect the content b/w specified heading levels and automatically generate intermediate files, arrange them.
- Go through
linksin the markdown files in the pre-process phase and modify them as specified inlink_updatessection. Addprefix,replaceorsubtitutepart of the link. - Move
generatedandjavadocfiles intostaticfolder so that we don't have to convert them into markdown, but instead use theinclude-htmlshortcode to pull in the content from HTML.
After processing the HTML to markdown, you need to sync the output to your Hugo site. The project includes a dedicated sync script with different strategies for different content types.
# Dry run to see what will be synced
python sync_to_hugo.py --dry-run
# Sync to the Hugo site
python sync_to_hugo.py
# Or use the all-in-one script
./build_and_sync.shThe sync script uses two strategies:
- REPLACE - Deletes destination directory and copies entire source directory (used for doc versions and static content)
- MERGE - Only copies files from source, preserving existing files in destination (used for blog, community, and data)
| Source | Strategy | Description |
|---|---|---|
content/en/{doc_version} |
REPLACE | All doc version directories (07, 08, 10-41, 0100-0110, etc.) |
content/en/blog |
MERGE | Blog posts |
content/en/community |
MERGE | Community content |
data |
MERGE | Data files (JSON, YAML) |
static |
REPLACE | Static assets (images, CSS, JS) |
# Sync to custom destination
python sync_to_hugo.py --dest /path/to/hugo/site
# Use custom config file
python sync_to_hugo.py --config /path/to/process.yaml
# Use custom source directory
python sync_to_hugo.py --source /path/to/workspace/outputFor more details, see README_SYNC.md.
# Clone the repository
git clone https://github.com/hvishwanath/ak2md.git
cd ak2md
# Install dependencies
pip install -r requirements.txt
# Run the workflow
python main.py
# Sync to Hugo site
python sync_to_hugo.py --dry-run # Test first
python sync_to_hugo.py # Actually syncFor a complete end-to-end build and sync:
# All-in-one: process and sync
./build_and_sync.sh
# With custom destination
./build_and_sync.sh /path/to/hugo/site
# Dry run mode
./build_and_sync.sh /path/to/hugo/site true