Skip to content

Conversation

@richardwooding
Copy link
Collaborator

Summary

This PR adds automatic file extension detection based on code block language identifiers. Extracted files now get appropriate extensions for their language: Go code → .go, Python → .py, JavaScript → .js, etc.

Changes

New Features

  • Auto-detection for 40+ languages: Comprehensive language-to-extension mapping
  • Backward compatible: --extension flag overrides auto-detection
  • Smart fallback: Unknown/missing languages default to .txt
  • Case-insensitive matching: "Go", "GO", "go" all work correctly

Implementation Details

New Files:

  • model/extensions.go: Language-to-extension mapper function
  • model/extensions_test.go: Comprehensive unit tests (50+ test cases)

Modified Files:

  • cmd/root.go: Updated filename generation to use language detection
  • cmd/root_test.go: Added 6 integration tests for new functionality
  • README.md: Comprehensive documentation with examples

Behavior

Scenario --extension flag Code language Result
Auto-detect Not specified go .go
Auto-detect Not specified python .py
User override --extension txt go .txt
Unknown language Not specified foobar .txt

Examples

Before (all files get .txt):

$ codeblocks -i example.md
Saving file: sourcecode-0.txt
Saving file: sourcecode-1.txt

After (auto-detected extensions):

$ codeblocks -i example.md
Saving file: sourcecode-0.go
Saving file: sourcecode-1.py

Override still works:

$ codeblocks -i example.md --extension txt
Saving file: sourcecode-0.txt
Saving file: sourcecode-1.txt

Testing

  • Unit tests: 52 test cases covering language mapping, case sensitivity, fallbacks
  • Integration tests: 6 new tests for auto-detection, override, mixed scenarios
  • All existing tests pass: No regressions
  • Manual testing: Verified with Go, Python, JavaScript, Rust, unknown languages
  • CI pipeline: All checks passing

Test results:

=== RUN   TestLanguageToExtension (52 subtests)
=== RUN   TestLanguageToExtensionConsistency
=== RUN   TestLanguageBasedExtensions
=== RUN   TestExtensionFlagOverride
=== RUN   TestUnknownLanguageFallback
=== RUN   TestMixedLanguagesAndExtensions
=== RUN   TestCaseInsensitiveLanguages
PASS
ok      github.com/spandigitial/codeblocks/model
ok      github.com/spandigitial/codeblocks/cmd

Documentation

  • 📝 Updated README with comprehensive "Language-Based File Extensions" section
  • 📝 Listed all 40+ supported languages
  • 📝 Documented override behavior and fallbacks
  • 📝 Added multiple usage examples
  • 📝 Updated feature list and flag descriptions

Backward Compatibility

Fully backward compatible

  • Users who don't specify --extension: Get new auto-detection behavior (better UX)
  • Users who use --extension: Behavior unchanged (override works as before)
  • Unknown languages: Fallback to .txt (same as before)
  • No breaking changes to CLI interface or config file format

Supported Languages

Automatically recognizes 40+ languages including:

  • Compiled: Go, Rust, C, C++, Java, Kotlin, Swift, C#
  • Scripting: Python, Ruby, Perl, PHP, Lua, R, Julia
  • Web: JavaScript, TypeScript, HTML, CSS, JSX, TSX, Vue, Svelte
  • Shell: Bash, Fish, PowerShell
  • Data: JSON, YAML, TOML, XML
  • Other: Dockerfile, Makefile, SQL, GraphQL, Markdown, LaTeX

Checklist

  • Code compiles and builds successfully
  • All unit tests pass
  • All integration tests pass
  • Manual testing completed
  • Documentation updated (README)
  • Backward compatibility verified
  • No breaking changes
  • CI pipeline passes

🤖 Generated with Claude Code

richardwooding and others added 2 commits January 9, 2026 12:03
Fixed a critical bug where multiple code blocks from the same markdown file
would overwrite each other, resulting in only the last block being saved.

Bug Details:
- Root cause: Logic error in cmd/root.go line 111
- Condition checked 'if l == 0' which is never true inside a loop
- Changed to 'if l == 1' to correctly detect single vs multiple blocks

Behavior Changes:
- Single code block: saves as 'sourcecode.txt' (no index)
- Multiple code blocks: saves as 'sourcecode-0.txt', 'sourcecode-1.txt', etc.

Test Coverage:
Added cmd/root_test.go with 8 comprehensive test cases:
1. TestSingleCodeBlock - Verifies single block naming
2. TestMultipleCodeBlocks - Verifies indexed naming for multiple blocks
3. TestNoCodeBlocks - Verifies empty markdown handling
4. TestDifferentLanguages - Verifies language tag preservation
5. TestCustomExtension - Verifies -e flag functionality
6. TestCustomPrefix - Verifies -f flag functionality
7. TestOutputDirectory - Verifies -o flag functionality
8. TestSpecialCharactersInCode - Verifies special character handling

All tests pass. Manual testing confirms the fix works correctly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Automatically detect file extensions from code block language identifiers.
Go code gets .go, Python gets .py, JavaScript gets .js, etc.

Features:
- Auto-detection for 40+ common programming languages
- --extension flag overrides auto-detection for backward compatibility
- Unknown/missing languages fallback to .txt
- Case-insensitive language matching

Implementation:
- Added model.LanguageToExtension() with comprehensive language mapping
- Modified filename generation to use detected language
- Preserved --extension flag as override mechanism
- Added 6 integration tests and unit tests for language mapper

Backward Compatibility:
- Existing --extension flag behavior unchanged
- Default fallback still .txt for unknown languages
- Users can force uniform extensions with --extension flag

Tests:
- model/extensions_test.go: Unit tests for language mapper
- cmd/root_test.go: Integration tests for auto-detection, override, fallback
- All tests passing (14 tests, 1 skipped)

Documentation:
- Updated README with comprehensive examples
- Added supported language list
- Documented override behavior

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic file extension detection based on code block language identifiers in markdown files. When extracting code blocks, files now receive appropriate extensions for their language (e.g., Go → .go, Python → .py) instead of always using .txt.

Key changes:

  • Language-to-extension mapping for 40+ programming languages
  • Backward compatible: --extension flag overrides auto-detection
  • Smart fallback to .txt for unknown languages

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
model/extensions.go New language-to-extension mapper with case-insensitive matching and fallback handling
model/extensions_test.go Comprehensive unit tests with 52 test cases covering mappings, aliases, and edge cases
cmd/root.go Updated filename generation to auto-detect extensions from language when flag not specified
cmd/root_test.go Added 6 integration tests for auto-detection, override behavior, and mixed scenarios
README.md Documentation with examples, supported languages list, and usage patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/root_test.go Outdated
if l == 1 {
return filenamePrefix + "." + extension
}
return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test also uses string concatenation with rune arithmetic (string(rune('0'+i))) to generate filenames. This approach only works correctly for i values 0-9. For i >= 10, this would produce incorrect Unicode characters instead of "10", "11", etc. Consider using fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) for clarity and correctness, which matches the pattern used in the actual implementation.

Suggested change
return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension)

Copilot uses AI. Check for mistakes.
Comment on lines 5 to 97
// LanguageToExtension maps common programming language identifiers to file extensions.
// It performs case-insensitive matching and returns "txt" for unknown languages.
func LanguageToExtension(language string) string {
// Map common language identifiers to extensions
extensionMap := map[string]string{
// Compiled languages
"go": "go",
"golang": "go",
"rust": "rs",
"c": "c",
"cpp": "cpp",
"c++": "cpp",
"java": "java",
"kotlin": "kt",
"swift": "swift",
"csharp": "cs",
"c#": "cs",
"objc": "m",
"haskell": "hs",
"scala": "scala",

// Scripting languages
"python": "py",
"python3": "py",
"ruby": "rb",
"perl": "pl",
"php": "php",
"lua": "lua",
"r": "R",
"julia": "jl",

// Web languages
"javascript": "js",
"js": "js",
"typescript": "ts",
"ts": "ts",
"html": "html",
"css": "css",
"scss": "scss",
"sass": "sass",
"less": "less",
"jsx": "jsx",
"tsx": "tsx",
"vue": "vue",
"svelte": "svelte",

// Shell
"bash": "sh",
"sh": "sh",
"shell": "sh",
"zsh": "sh",
"fish": "fish",
"powershell": "ps1",
"ps1": "ps1",

// Data formats
"json": "json",
"yaml": "yaml",
"yml": "yaml",
"toml": "toml",
"xml": "xml",
"ini": "ini",
"properties": "properties",

// Markup
"markdown": "md",
"md": "md",
"tex": "tex",
"latex": "tex",

// Database
"sql": "sql",
"postgres": "sql",
"postgresql": "sql",
"mysql": "sql",
"sqlite": "sql",
"plsql": "sql",
"tsql": "sql",

// Other
"dockerfile": "Dockerfile",
"docker": "Dockerfile",
"makefile": "Makefile",
"make": "Makefile",
"graphql": "graphql",
"protobuf": "proto",
"proto": "proto",
"diff": "diff",
"patch": "patch",
}

// Convert to lowercase for case-insensitive matching
ext, found := extensionMap[strings.ToLower(language)]
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new map on every function call is inefficient. Since the extensionMap is constant and never modified, it should be declared as a package-level variable to avoid unnecessary allocations. This will improve performance when processing files with multiple code blocks.

Suggested change
// LanguageToExtension maps common programming language identifiers to file extensions.
// It performs case-insensitive matching and returns "txt" for unknown languages.
func LanguageToExtension(language string) string {
// Map common language identifiers to extensions
extensionMap := map[string]string{
// Compiled languages
"go": "go",
"golang": "go",
"rust": "rs",
"c": "c",
"cpp": "cpp",
"c++": "cpp",
"java": "java",
"kotlin": "kt",
"swift": "swift",
"csharp": "cs",
"c#": "cs",
"objc": "m",
"haskell": "hs",
"scala": "scala",
// Scripting languages
"python": "py",
"python3": "py",
"ruby": "rb",
"perl": "pl",
"php": "php",
"lua": "lua",
"r": "R",
"julia": "jl",
// Web languages
"javascript": "js",
"js": "js",
"typescript": "ts",
"ts": "ts",
"html": "html",
"css": "css",
"scss": "scss",
"sass": "sass",
"less": "less",
"jsx": "jsx",
"tsx": "tsx",
"vue": "vue",
"svelte": "svelte",
// Shell
"bash": "sh",
"sh": "sh",
"shell": "sh",
"zsh": "sh",
"fish": "fish",
"powershell": "ps1",
"ps1": "ps1",
// Data formats
"json": "json",
"yaml": "yaml",
"yml": "yaml",
"toml": "toml",
"xml": "xml",
"ini": "ini",
"properties": "properties",
// Markup
"markdown": "md",
"md": "md",
"tex": "tex",
"latex": "tex",
// Database
"sql": "sql",
"postgres": "sql",
"postgresql": "sql",
"mysql": "sql",
"sqlite": "sql",
"plsql": "sql",
"tsql": "sql",
// Other
"dockerfile": "Dockerfile",
"docker": "Dockerfile",
"makefile": "Makefile",
"make": "Makefile",
"graphql": "graphql",
"protobuf": "proto",
"proto": "proto",
"diff": "diff",
"patch": "patch",
}
// Convert to lowercase for case-insensitive matching
ext, found := extensionMap[strings.ToLower(language)]
// languageExtensionMap maps common programming language identifiers to file extensions.
// It is constant for the lifetime of the program and used by LanguageToExtension.
var languageExtensionMap = map[string]string{
// Compiled languages
"go": "go",
"golang": "go",
"rust": "rs",
"c": "c",
"cpp": "cpp",
"c++": "cpp",
"java": "java",
"kotlin": "kt",
"swift": "swift",
"csharp": "cs",
"c#": "cs",
"objc": "m",
"haskell": "hs",
"scala": "scala",
// Scripting languages
"python": "py",
"python3": "py",
"ruby": "rb",
"perl": "pl",
"php": "php",
"lua": "lua",
"r": "R",
"julia": "jl",
// Web languages
"javascript": "js",
"js": "js",
"typescript": "ts",
"ts": "ts",
"html": "html",
"css": "css",
"scss": "scss",
"sass": "sass",
"less": "less",
"jsx": "jsx",
"tsx": "tsx",
"vue": "vue",
"svelte": "svelte",
// Shell
"bash": "sh",
"sh": "sh",
"shell": "sh",
"zsh": "sh",
"fish": "fish",
"powershell": "ps1",
"ps1": "ps1",
// Data formats
"json": "json",
"yaml": "yaml",
"yml": "yaml",
"toml": "toml",
"xml": "xml",
"ini": "ini",
"properties": "properties",
// Markup
"markdown": "md",
"md": "md",
"tex": "tex",
"latex": "tex",
// Database
"sql": "sql",
"postgres": "sql",
"postgresql": "sql",
"mysql": "sql",
"sqlite": "sql",
"plsql": "sql",
"tsql": "sql",
// Other
"dockerfile": "Dockerfile",
"docker": "Dockerfile",
"makefile": "Makefile",
"make": "Makefile",
"graphql": "graphql",
"protobuf": "proto",
"proto": "proto",
"diff": "diff",
"patch": "patch",
}
// LanguageToExtension maps common programming language identifiers to file extensions.
// It performs case-insensitive matching and returns "txt" for unknown languages.
func LanguageToExtension(language string) string {
// Convert to lowercase for case-insensitive matching
ext, found := languageExtensionMap[strings.ToLower(language)]

Copilot uses AI. Check for mistakes.
cmd/root.go Outdated
Comment on lines 115 to 116
if !userSpecifiedExtension && block.Language != "" {
// Auto-detect extension from language
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if block.Language is not empty before calling LanguageToExtension, but LanguageToExtension already handles empty strings by returning "txt" (as tested in extensions_test.go line 76). This redundant check can be removed for cleaner code. If the intent is to skip auto-detection for empty languages and use the default extension instead, this needs to be documented, but it creates inconsistent behavior since empty and unknown languages would be handled differently.

Suggested change
if !userSpecifiedExtension && block.Language != "" {
// Auto-detect extension from language
if !userSpecifiedExtension {
// Auto-detect extension from language (handles empty and unknown languages)

Copilot uses AI. Check for mistakes.
cmd/root.go Outdated
return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, fileExtension)
}
})
sourceCode.Save(outputDirectory)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned by sourceCode.Save() is not being checked or handled. If saving a file fails, the error will be silently ignored and the loop will continue. This could lead to partial extraction where some files are saved and others are not, with no indication to the user. Consider capturing and handling this error, possibly by collecting errors and returning them, or at least logging them.

Suggested change
sourceCode.Save(outputDirectory)
if err := sourceCode.Save(outputDirectory); err != nil {
return fmt.Errorf("failed to save source code to %s: %w", outputDirectory, err)
}

Copilot uses AI. Check for mistakes.
README.md Outdated
Comment on lines 93 to 120
```bash
# Input markdown with different languages
$ cat example.md
```go
package main
func main() {
println("Hello from Go!")
}
```

```python
def greet():
print("Hello from Python!")
```

```javascript
function greet() {
console.log("Hello from JavaScript!");
}
```

# Extract with auto-detected extensions
$ codeblocks -i example.md
Saving file: sourcecode-0.go in /current/directory
Saving file: sourcecode-1.py in /current/directory
Saving file: sourcecode-2.js in /current/directory
```
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation example is malformed. Line 93 starts a bash code block, but line 96 introduces a nested markdown code fence (```go) within that bash block without proper escaping or closing. The example appears to be trying to show the contents of example.md, but the triple backticks on lines 96, 104, and 109 are being interpreted as markdown syntax rather than as literal text to display. Consider either: (1) using four backticks for the outer bash block to allow three backticks inside, or (2) restructuring to show the command output without embedding the markdown source, similar to the approach used in lines 156-166.

Copilot uses AI. Check for mistakes.
cmd/root_test.go Outdated
Comment on lines 112 to 119
expectedFilename = filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
}

sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string {
if l == 1 {
return filenamePrefix + "." + extension
}
return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses string concatenation with rune arithmetic (string(rune('0'+i))) to generate filenames with numeric suffixes. This approach only works correctly for i values 0-9. For i >= 10, this would produce incorrect Unicode characters instead of "10", "11", etc. While the current tests only use small numbers of blocks, consider using fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) for clarity and correctness, which matches the pattern used in the actual implementation.

Suggested change
expectedFilename = filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
}
sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string {
if l == 1 {
return filenamePrefix + "." + extension
}
return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension
expectedFilename = fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension)
}
sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string {
if l == 1 {
return filenamePrefix + "." + extension
}
return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension)

Copilot uses AI. Check for mistakes.
Fixed all 6 issues raised by Copilot:

1. Performance: Move languageExtensionMap to package-level variable
   - Prevents map recreation on every function call
   - Improves efficiency for repeated calls

2. Error Handling: Add error checking for Save() calls
   - Now captures and returns errors with context
   - Prevents silent failures during file extraction

3. Numeric String Generation: Fix test code for indices ≥10
   - Replace string(rune('0'+i)) with fmt.Sprintf("%s-%d.%s", ...)
   - Ensures correct filename generation for 10+ code blocks

4. Simplify Logic: Remove redundant language check
   - LanguageToExtension() already handles empty strings
   - Simplified condition from "!userSpecifiedExtension && block.Language != """
   - Now just "!userSpecifiedExtension"

5. Documentation: Fix malformed markdown in README
   - Use quadruple backticks (````) for outer fence
   - Separate input and output sections for clarity
   - Fixes rendering issues with nested code blocks

All tests passing (14 tests, 1 skipped).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@richardwooding richardwooding merged commit bcc3dc6 into main Jan 9, 2026
1 check passed
@richardwooding richardwooding deleted the feature/language-based-extensions branch January 9, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant