Skip to content

[Bug]: Claude Sonnet returns markdown-wrapped JSON despite JSON mode being enabled in generate_schema #1663

@aravindkarnam

Description

@aravindkarnam

crawl4ai version

0.7.8

Expected Behavior

JsonCssExtractionStrategy.generate_schema() crashes with a JSON parsing error when LLMs (particularly Claude Sonnet) return valid JSON wrapped in markdown code blocks:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The method directly attempts to parse the LLM response without handling common formatting issues like:

  • ```json\n{...}\n```
  • ```\n{...}\n```

Current Behavior

The library should be resilient to common LLM response formats and successfully parse JSON even when wrapped in markdown code blocks. The parsing should handle these cases gracefully without requiring users to modify their code.

Is this reproducible?

Yes

Code snippets

import asyncio
from crawl4ai import (
    CrawlerRunConfig,
    AsyncWebCrawler,
    BrowserConfig,
    JsonCssExtractionStrategy,
    LLMConfig,
)
from pprint import pprint


async def generate_schema():
    browser_config = BrowserConfig(
        headless=False,
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.fastenal.com/product/details/925576347",
            config=CrawlerRunConfig(
                remove_overlay_elements=True,
                remove_forms=True,
                wait_for="div#pdp-details",
                css_selector="div#pdp-details",
            ),
        )
        schema = JsonCssExtractionStrategy.generate_schema(
            result.fit_html,
            llm_config=LLMConfig(
                provider="anthropic/claude-sonnet-4-5",
                api_token="env:ANTHROPIC_API_KEY",
            ),
            target_json_example={
                "part_name": """7" Dia x NH Arbor 60+ Grit Coarse Ceramic Purple Fiber Disc""",
                "fastenal_part_no": "925576347",
                "manufacturer_part_no": "7100349674",
                "unspsc": "31191506",
                "manufacturer": "3M",
                "brand": "CUBITRON",
                "attachment_type": "GL",
                "diameter": '7"',
                "arbor_size": "NH",
                "abrasive_material": "Ceramic",
                "grade": "Coarse",
                "grit": "60+",
                "backing_material": "Fiber",
                "coat_type": "Open",
                "color": "Purple",
                "type": "Fiber Disc",
                "operating_speed": "8600 rpm",
                "product_weight": ".1138",
                "uom": "each",
                "country_of_origin": "United States",
                "origin_note": "Origin is subject to change",
            },
        )
        return schema

asyncio.run(generate_schema())

Metadata

Metadata

Assignees

Labels

🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions