-
Notifications
You must be signed in to change notification settings - Fork 0
Add language-based file extension detection #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixed a critical bug where multiple code blocks from the same markdown file would overwrite each other, resulting in only the last block being saved. Bug Details: - Root cause: Logic error in cmd/root.go line 111 - Condition checked 'if l == 0' which is never true inside a loop - Changed to 'if l == 1' to correctly detect single vs multiple blocks Behavior Changes: - Single code block: saves as 'sourcecode.txt' (no index) - Multiple code blocks: saves as 'sourcecode-0.txt', 'sourcecode-1.txt', etc. Test Coverage: Added cmd/root_test.go with 8 comprehensive test cases: 1. TestSingleCodeBlock - Verifies single block naming 2. TestMultipleCodeBlocks - Verifies indexed naming for multiple blocks 3. TestNoCodeBlocks - Verifies empty markdown handling 4. TestDifferentLanguages - Verifies language tag preservation 5. TestCustomExtension - Verifies -e flag functionality 6. TestCustomPrefix - Verifies -f flag functionality 7. TestOutputDirectory - Verifies -o flag functionality 8. TestSpecialCharactersInCode - Verifies special character handling All tests pass. Manual testing confirms the fix works correctly. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Automatically detect file extensions from code block language identifiers. Go code gets .go, Python gets .py, JavaScript gets .js, etc. Features: - Auto-detection for 40+ common programming languages - --extension flag overrides auto-detection for backward compatibility - Unknown/missing languages fallback to .txt - Case-insensitive language matching Implementation: - Added model.LanguageToExtension() with comprehensive language mapping - Modified filename generation to use detected language - Preserved --extension flag as override mechanism - Added 6 integration tests and unit tests for language mapper Backward Compatibility: - Existing --extension flag behavior unchanged - Default fallback still .txt for unknown languages - Users can force uniform extensions with --extension flag Tests: - model/extensions_test.go: Unit tests for language mapper - cmd/root_test.go: Integration tests for auto-detection, override, fallback - All tests passing (14 tests, 1 skipped) Documentation: - Updated README with comprehensive examples - Added supported language list - Documented override behavior Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds automatic file extension detection based on code block language identifiers in markdown files. When extracting code blocks, files now receive appropriate extensions for their language (e.g., Go → .go, Python → .py) instead of always using .txt.
Key changes:
- Language-to-extension mapping for 40+ programming languages
- Backward compatible:
--extensionflag overrides auto-detection - Smart fallback to
.txtfor unknown languages
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| model/extensions.go | New language-to-extension mapper with case-insensitive matching and fallback handling |
| model/extensions_test.go | Comprehensive unit tests with 52 test cases covering mappings, aliases, and edge cases |
| cmd/root.go | Updated filename generation to auto-detect extensions from language when flag not specified |
| cmd/root_test.go | Added 6 integration tests for auto-detection, override behavior, and mixed scenarios |
| README.md | Documentation with examples, supported languages list, and usage patterns |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cmd/root_test.go
Outdated
| if l == 1 { | ||
| return filenamePrefix + "." + extension | ||
| } | ||
| return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test also uses string concatenation with rune arithmetic (string(rune('0'+i))) to generate filenames. This approach only works correctly for i values 0-9. For i >= 10, this would produce incorrect Unicode characters instead of "10", "11", etc. Consider using fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) for clarity and correctness, which matches the pattern used in the actual implementation.
| return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension | |
| return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) |
model/extensions.go
Outdated
| // LanguageToExtension maps common programming language identifiers to file extensions. | ||
| // It performs case-insensitive matching and returns "txt" for unknown languages. | ||
| func LanguageToExtension(language string) string { | ||
| // Map common language identifiers to extensions | ||
| extensionMap := map[string]string{ | ||
| // Compiled languages | ||
| "go": "go", | ||
| "golang": "go", | ||
| "rust": "rs", | ||
| "c": "c", | ||
| "cpp": "cpp", | ||
| "c++": "cpp", | ||
| "java": "java", | ||
| "kotlin": "kt", | ||
| "swift": "swift", | ||
| "csharp": "cs", | ||
| "c#": "cs", | ||
| "objc": "m", | ||
| "haskell": "hs", | ||
| "scala": "scala", | ||
|
|
||
| // Scripting languages | ||
| "python": "py", | ||
| "python3": "py", | ||
| "ruby": "rb", | ||
| "perl": "pl", | ||
| "php": "php", | ||
| "lua": "lua", | ||
| "r": "R", | ||
| "julia": "jl", | ||
|
|
||
| // Web languages | ||
| "javascript": "js", | ||
| "js": "js", | ||
| "typescript": "ts", | ||
| "ts": "ts", | ||
| "html": "html", | ||
| "css": "css", | ||
| "scss": "scss", | ||
| "sass": "sass", | ||
| "less": "less", | ||
| "jsx": "jsx", | ||
| "tsx": "tsx", | ||
| "vue": "vue", | ||
| "svelte": "svelte", | ||
|
|
||
| // Shell | ||
| "bash": "sh", | ||
| "sh": "sh", | ||
| "shell": "sh", | ||
| "zsh": "sh", | ||
| "fish": "fish", | ||
| "powershell": "ps1", | ||
| "ps1": "ps1", | ||
|
|
||
| // Data formats | ||
| "json": "json", | ||
| "yaml": "yaml", | ||
| "yml": "yaml", | ||
| "toml": "toml", | ||
| "xml": "xml", | ||
| "ini": "ini", | ||
| "properties": "properties", | ||
|
|
||
| // Markup | ||
| "markdown": "md", | ||
| "md": "md", | ||
| "tex": "tex", | ||
| "latex": "tex", | ||
|
|
||
| // Database | ||
| "sql": "sql", | ||
| "postgres": "sql", | ||
| "postgresql": "sql", | ||
| "mysql": "sql", | ||
| "sqlite": "sql", | ||
| "plsql": "sql", | ||
| "tsql": "sql", | ||
|
|
||
| // Other | ||
| "dockerfile": "Dockerfile", | ||
| "docker": "Dockerfile", | ||
| "makefile": "Makefile", | ||
| "make": "Makefile", | ||
| "graphql": "graphql", | ||
| "protobuf": "proto", | ||
| "proto": "proto", | ||
| "diff": "diff", | ||
| "patch": "patch", | ||
| } | ||
|
|
||
| // Convert to lowercase for case-insensitive matching | ||
| ext, found := extensionMap[strings.ToLower(language)] |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new map on every function call is inefficient. Since the extensionMap is constant and never modified, it should be declared as a package-level variable to avoid unnecessary allocations. This will improve performance when processing files with multiple code blocks.
| // LanguageToExtension maps common programming language identifiers to file extensions. | |
| // It performs case-insensitive matching and returns "txt" for unknown languages. | |
| func LanguageToExtension(language string) string { | |
| // Map common language identifiers to extensions | |
| extensionMap := map[string]string{ | |
| // Compiled languages | |
| "go": "go", | |
| "golang": "go", | |
| "rust": "rs", | |
| "c": "c", | |
| "cpp": "cpp", | |
| "c++": "cpp", | |
| "java": "java", | |
| "kotlin": "kt", | |
| "swift": "swift", | |
| "csharp": "cs", | |
| "c#": "cs", | |
| "objc": "m", | |
| "haskell": "hs", | |
| "scala": "scala", | |
| // Scripting languages | |
| "python": "py", | |
| "python3": "py", | |
| "ruby": "rb", | |
| "perl": "pl", | |
| "php": "php", | |
| "lua": "lua", | |
| "r": "R", | |
| "julia": "jl", | |
| // Web languages | |
| "javascript": "js", | |
| "js": "js", | |
| "typescript": "ts", | |
| "ts": "ts", | |
| "html": "html", | |
| "css": "css", | |
| "scss": "scss", | |
| "sass": "sass", | |
| "less": "less", | |
| "jsx": "jsx", | |
| "tsx": "tsx", | |
| "vue": "vue", | |
| "svelte": "svelte", | |
| // Shell | |
| "bash": "sh", | |
| "sh": "sh", | |
| "shell": "sh", | |
| "zsh": "sh", | |
| "fish": "fish", | |
| "powershell": "ps1", | |
| "ps1": "ps1", | |
| // Data formats | |
| "json": "json", | |
| "yaml": "yaml", | |
| "yml": "yaml", | |
| "toml": "toml", | |
| "xml": "xml", | |
| "ini": "ini", | |
| "properties": "properties", | |
| // Markup | |
| "markdown": "md", | |
| "md": "md", | |
| "tex": "tex", | |
| "latex": "tex", | |
| // Database | |
| "sql": "sql", | |
| "postgres": "sql", | |
| "postgresql": "sql", | |
| "mysql": "sql", | |
| "sqlite": "sql", | |
| "plsql": "sql", | |
| "tsql": "sql", | |
| // Other | |
| "dockerfile": "Dockerfile", | |
| "docker": "Dockerfile", | |
| "makefile": "Makefile", | |
| "make": "Makefile", | |
| "graphql": "graphql", | |
| "protobuf": "proto", | |
| "proto": "proto", | |
| "diff": "diff", | |
| "patch": "patch", | |
| } | |
| // Convert to lowercase for case-insensitive matching | |
| ext, found := extensionMap[strings.ToLower(language)] | |
| // languageExtensionMap maps common programming language identifiers to file extensions. | |
| // It is constant for the lifetime of the program and used by LanguageToExtension. | |
| var languageExtensionMap = map[string]string{ | |
| // Compiled languages | |
| "go": "go", | |
| "golang": "go", | |
| "rust": "rs", | |
| "c": "c", | |
| "cpp": "cpp", | |
| "c++": "cpp", | |
| "java": "java", | |
| "kotlin": "kt", | |
| "swift": "swift", | |
| "csharp": "cs", | |
| "c#": "cs", | |
| "objc": "m", | |
| "haskell": "hs", | |
| "scala": "scala", | |
| // Scripting languages | |
| "python": "py", | |
| "python3": "py", | |
| "ruby": "rb", | |
| "perl": "pl", | |
| "php": "php", | |
| "lua": "lua", | |
| "r": "R", | |
| "julia": "jl", | |
| // Web languages | |
| "javascript": "js", | |
| "js": "js", | |
| "typescript": "ts", | |
| "ts": "ts", | |
| "html": "html", | |
| "css": "css", | |
| "scss": "scss", | |
| "sass": "sass", | |
| "less": "less", | |
| "jsx": "jsx", | |
| "tsx": "tsx", | |
| "vue": "vue", | |
| "svelte": "svelte", | |
| // Shell | |
| "bash": "sh", | |
| "sh": "sh", | |
| "shell": "sh", | |
| "zsh": "sh", | |
| "fish": "fish", | |
| "powershell": "ps1", | |
| "ps1": "ps1", | |
| // Data formats | |
| "json": "json", | |
| "yaml": "yaml", | |
| "yml": "yaml", | |
| "toml": "toml", | |
| "xml": "xml", | |
| "ini": "ini", | |
| "properties": "properties", | |
| // Markup | |
| "markdown": "md", | |
| "md": "md", | |
| "tex": "tex", | |
| "latex": "tex", | |
| // Database | |
| "sql": "sql", | |
| "postgres": "sql", | |
| "postgresql": "sql", | |
| "mysql": "sql", | |
| "sqlite": "sql", | |
| "plsql": "sql", | |
| "tsql": "sql", | |
| // Other | |
| "dockerfile": "Dockerfile", | |
| "docker": "Dockerfile", | |
| "makefile": "Makefile", | |
| "make": "Makefile", | |
| "graphql": "graphql", | |
| "protobuf": "proto", | |
| "proto": "proto", | |
| "diff": "diff", | |
| "patch": "patch", | |
| } | |
| // LanguageToExtension maps common programming language identifiers to file extensions. | |
| // It performs case-insensitive matching and returns "txt" for unknown languages. | |
| func LanguageToExtension(language string) string { | |
| // Convert to lowercase for case-insensitive matching | |
| ext, found := languageExtensionMap[strings.ToLower(language)] |
cmd/root.go
Outdated
| if !userSpecifiedExtension && block.Language != "" { | ||
| // Auto-detect extension from language |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if block.Language is not empty before calling LanguageToExtension, but LanguageToExtension already handles empty strings by returning "txt" (as tested in extensions_test.go line 76). This redundant check can be removed for cleaner code. If the intent is to skip auto-detection for empty languages and use the default extension instead, this needs to be documented, but it creates inconsistent behavior since empty and unknown languages would be handled differently.
| if !userSpecifiedExtension && block.Language != "" { | |
| // Auto-detect extension from language | |
| if !userSpecifiedExtension { | |
| // Auto-detect extension from language (handles empty and unknown languages) |
cmd/root.go
Outdated
| return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, fileExtension) | ||
| } | ||
| }) | ||
| sourceCode.Save(outputDirectory) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned by sourceCode.Save() is not being checked or handled. If saving a file fails, the error will be silently ignored and the loop will continue. This could lead to partial extraction where some files are saved and others are not, with no indication to the user. Consider capturing and handling this error, possibly by collecting errors and returning them, or at least logging them.
| sourceCode.Save(outputDirectory) | |
| if err := sourceCode.Save(outputDirectory); err != nil { | |
| return fmt.Errorf("failed to save source code to %s: %w", outputDirectory, err) | |
| } |
README.md
Outdated
| ```bash | ||
| # Input markdown with different languages | ||
| $ cat example.md | ||
| ```go | ||
| package main | ||
| func main() { | ||
| println("Hello from Go!") | ||
| } | ||
| ``` | ||
|
|
||
| ```python | ||
| def greet(): | ||
| print("Hello from Python!") | ||
| ``` | ||
|
|
||
| ```javascript | ||
| function greet() { | ||
| console.log("Hello from JavaScript!"); | ||
| } | ||
| ``` | ||
|
|
||
| # Extract with auto-detected extensions | ||
| $ codeblocks -i example.md | ||
| Saving file: sourcecode-0.go in /current/directory | ||
| Saving file: sourcecode-1.py in /current/directory | ||
| Saving file: sourcecode-2.js in /current/directory | ||
| ``` |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation example is malformed. Line 93 starts a bash code block, but line 96 introduces a nested markdown code fence (```go) within that bash block without proper escaping or closing. The example appears to be trying to show the contents of example.md, but the triple backticks on lines 96, 104, and 109 are being interpreted as markdown syntax rather than as literal text to display. Consider either: (1) using four backticks for the outer bash block to allow three backticks inside, or (2) restructuring to show the command output without embedding the markdown source, similar to the approach used in lines 156-166.
cmd/root_test.go
Outdated
| expectedFilename = filenamePrefix + "-" + string(rune('0'+i)) + "." + extension | ||
| } | ||
|
|
||
| sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string { | ||
| if l == 1 { | ||
| return filenamePrefix + "." + extension | ||
| } | ||
| return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses string concatenation with rune arithmetic (string(rune('0'+i))) to generate filenames with numeric suffixes. This approach only works correctly for i values 0-9. For i >= 10, this would produce incorrect Unicode characters instead of "10", "11", etc. While the current tests only use small numbers of blocks, consider using fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) for clarity and correctness, which matches the pattern used in the actual implementation.
| expectedFilename = filenamePrefix + "-" + string(rune('0'+i)) + "." + extension | |
| } | |
| sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string { | |
| if l == 1 { | |
| return filenamePrefix + "." + extension | |
| } | |
| return filenamePrefix + "-" + string(rune('0'+i)) + "." + extension | |
| expectedFilename = fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) | |
| } | |
| sourceCode := codeBlock.ToSourceCode(func(block model.FencedCodeBlock) string { | |
| if l == 1 { | |
| return filenamePrefix + "." + extension | |
| } | |
| return fmt.Sprintf("%s-%d.%s", filenamePrefix, i, extension) |
Fixed all 6 issues raised by Copilot:
1. Performance: Move languageExtensionMap to package-level variable
- Prevents map recreation on every function call
- Improves efficiency for repeated calls
2. Error Handling: Add error checking for Save() calls
- Now captures and returns errors with context
- Prevents silent failures during file extraction
3. Numeric String Generation: Fix test code for indices ≥10
- Replace string(rune('0'+i)) with fmt.Sprintf("%s-%d.%s", ...)
- Ensures correct filename generation for 10+ code blocks
4. Simplify Logic: Remove redundant language check
- LanguageToExtension() already handles empty strings
- Simplified condition from "!userSpecifiedExtension && block.Language != """
- Now just "!userSpecifiedExtension"
5. Documentation: Fix malformed markdown in README
- Use quadruple backticks (````) for outer fence
- Separate input and output sections for clarity
- Fixes rendering issues with nested code blocks
All tests passing (14 tests, 1 skipped).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary
This PR adds automatic file extension detection based on code block language identifiers. Extracted files now get appropriate extensions for their language: Go code →
.go, Python →.py, JavaScript →.js, etc.Changes
New Features
--extensionflag overrides auto-detection.txtImplementation Details
New Files:
model/extensions.go: Language-to-extension mapper functionmodel/extensions_test.go: Comprehensive unit tests (50+ test cases)Modified Files:
cmd/root.go: Updated filename generation to use language detectioncmd/root_test.go: Added 6 integration tests for new functionalityREADME.md: Comprehensive documentation with examplesBehavior
go.gopython.py--extension txtgo.txtfoobar.txtExamples
Before (all files get .txt):
After (auto-detected extensions):
Override still works:
Testing
Test results:
Documentation
Backward Compatibility
✅ Fully backward compatible
--extension: Get new auto-detection behavior (better UX)--extension: Behavior unchanged (override works as before).txt(same as before)Supported Languages
Automatically recognizes 40+ languages including:
Checklist
🤖 Generated with Claude Code