About

This repo contains automation and helper scripts to convert AK documentation from raw html source code to markdown.

How does it work?

AK documentation currently is stored in version control as raw html, css and js files. It makes use of specific httpd features such as "Server Side Includes" and Handlebars JS to render templated html content.

To host this is a markdown based site, we will employ Hugo a static site generator from markdown, and docsy theme for structuring and styling the site.

Automated End-to-End Workflow

The project now includes a fully automated, end-to-end workflow system that handles the entire conversion process from cloning the Kafka site repository to generating the final markdown structure for Hugo.

Workflow Architecture

The workflow is broken down into composable, idempotent stages:

Clone Stage - Clones/updates the Kafka website repository
Pre-processing Stage - Converts raw HTML to intermediate markdown
Post-processing Stage - Restructures markdown for better organization
Validation Stage - Verifies the output meets expectations

Each stage is further broken down into individual steps that can be customized, reordered, or selectively applied.

Using the Workflow

# Basic usage
python main.py --workspace ./my_workspace

# Start from a specific stage
python main.py --workspace ./my_workspace --start-stage pre-process

# Enable debug logging
python main.py --workspace ./my_workspace --debug

# Skip validation stage
python main.py --workspace ./my_workspace --skip-validation

Workflow Components

main.py - Main entry point and workflow orchestration
workflow_steps.py - Fine-grained steps and processors for each stage
process.yaml - Configuration for document sections and processing rules

Features

Composable - Each stage and step can run independently
Idempotent - Safe to run multiple times
Debuggable - Detailed logging and clear error messages
Configurable - Easy to customize through process.yaml
Extensible - New steps can be easily added to the registry

Special File Processors

The workflow includes specialized processors for handling unique files:

committers.md - Extracts committer information and generates data/committers.json
powered-by.html - Extracts testimonials from the Powered By page and generates data/testimonials.json

These special processors can be run with:

python main.py --workspace ./my_workspace --start-stage special-files

Static Content Handling

The workflow automatically handles static content such as:

Images - Both root level and version-specific image directories
Diagrams - Diagram files used in documentation
Logos - Brand assets and logos
Generated Content - Auto-generated documentation files
JavaDoc - Java API documentation

Static directories are:

Identified via the static_dirs list in process.yaml
Copied to the appropriate location in the static output directory
Referenced correctly in markdown via the link updates in process.yaml
Validated to ensure they exist and contain files

Pre-processing

Process directories and files, going through the raw html, css, js, and other static content in the AK site and doing the first pass of converting it to markdown.

As part of this, we address the following:

Sanitize input HTML. Some input HTML files contain characters such as \w, \c etc., that messes up regex search and subsitutue operations. Escape them correctly.
There are a bunch of files where raw html code is placed under <script/> tags to be handled by HandleBars.js file in the original documentation. We will need to process them offline and convert them to raw HTML so that markdown conversion works.
Process SSI tags. Wherever #include virtual SSI tags are present, insert a custom hugo short code `{{ }} that will do the job of inserting the content of the specified html file safely at the location of reference.
Convert sanitized and pre-processed raw HTML files to markdown.
Add markdown front matter so that hugo can process them.
There are a bunch of places in the original raw html source code where the heading levels are not used consistently, and in many cases contain manually coded numeric values in the headings. Uplevel headings where applicable and remove these numeric characters.

Post-processing

Converting the raw html files to markdown is only half the work. These files will further need to be processed so that we can rearrange and structure them for better readability, maintainability.

Based on the rules defined in process.yaml:

There are multiple versions of the documentation stored in this repo.
Each version has several sections that need to be organized and restrucutred.
As such, the policy defines what these sections are.
Supports a few strategies to organize these sections:
- arrange. Arranges the files from pre-process phase in the order specified. Automatically assigns weights (so that they are ordered as we need them) and takes care of renaming, placing the files in the right location etc.,
- split_markdown_by_heading. There are some very large files in the code base and for better readability/maintainability, we would want to split them into different files in a given section. Collect the content b/w specified heading levels and automatically generate intermediate files, arrange them.
Go through links in the markdown files in the pre-process phase and modify them as specified in link_updates section. Add prefix, replace or subtitute part of the link.
Move generated and javadoc files into static folder so that we don't have to convert them into markdown, but instead use the include-html shortcode to pull in the content from HTML.

Syncing to Hugo Site

After processing the HTML to markdown, you need to sync the output to your Hugo site. The project includes a dedicated sync script with different strategies for different content types.

Quick Start

# Dry run to see what will be synced
python sync_to_hugo.py --dry-run

# Sync to the Hugo site
python sync_to_hugo.py

# Or use the all-in-one script
./build_and_sync.sh

Sync Strategies

The sync script uses two strategies:

REPLACE - Deletes destination directory and copies entire source directory (used for doc versions and static content)
MERGE - Only copies files from source, preserving existing files in destination (used for blog, community, and data)

Sync Rules

Source	Strategy	Description
`content/en/{doc_version}`	REPLACE	All doc version directories (07, 08, 10-41, 0100-0110, etc.)
`content/en/blog`	MERGE	Blog posts
`content/en/community`	MERGE	Community content
`data`	MERGE	Data files (JSON, YAML)
`static`	REPLACE	Static assets (images, CSS, JS)

Options

# Sync to custom destination
python sync_to_hugo.py --dest /path/to/hugo/site

# Use custom config file
python sync_to_hugo.py --config /path/to/process.yaml

# Use custom source directory
python sync_to_hugo.py --source /path/to/workspace/output

For more details, see README_SYNC.md.

Installation

# Clone the repository
git clone https://github.com/hvishwanath/ak2md.git
cd ak2md

# Install dependencies
pip install -r requirements.txt

# Run the workflow
python main.py

# Sync to Hugo site
python sync_to_hugo.py --dry-run  # Test first
python sync_to_hugo.py            # Actually sync

Complete Pipeline

For a complete end-to-end build and sync:

# All-in-one: process and sync
./build_and_sync.sh

# With custom destination
./build_and_sync.sh /path/to/hugo/site

# Dry run mode
./build_and_sync.sh /path/to/hugo/site true

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
misc		misc
workflow		workflow
.gitignore		.gitignore
README.md		README.md
build_and_sync.sh		build_and_sync.sh
main.py		main.py
process.yaml		process.yaml
requirements.txt		requirements.txt
sync_to_hugo.py		sync_to_hugo.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

How does it work?

Automated End-to-End Workflow

Workflow Architecture

Using the Workflow

Workflow Components

Features

Special File Processors

Static Content Handling

Pre-processing

Post-processing

Syncing to Hugo Site

Quick Start

Sync Strategies

Sync Rules

Options

Installation

Complete Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hvishwanath/ak2md

Folders and files

Latest commit

History

Repository files navigation

About

How does it work?

Automated End-to-End Workflow

Workflow Architecture

Using the Workflow

Workflow Components

Features

Special File Processors

Static Content Handling

Pre-processing

Post-processing

Syncing to Hugo Site

Quick Start

Sync Strategies

Sync Rules

Options

Installation

Complete Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages