Skip to content

Conversation

@gunir
Copy link

@gunir gunir commented Jan 15, 2026

Fix: #34

This is a classic "Mojibake" issue caused by double-decoding. The artifacts you are seeing (→ instead of →) occur when UTF-8 bytes are misinterpreted as Windows-1252 (Latin-1).

The function getRawBodyReader relies on golang.org/x/net/html/charset.NewReader to handle character encoding.

When the upstream website (e.g., data-star.dev) returns Content-Type: text/html without an explicit charset=utf-8 parameter, the Go charset library defaults to Windows-1252 (to be spec-compliant with legacy HTML).

Your proxy takes the valid UTF-8 bytes (e.g., E2 86 92 for →), "decodes" them as Windows-1252 (resulting in â, †, ’), and then re-encodes them as UTF-8 for the output.

The browser receives this re-encoded mess and displays →.

Both StreamRewrite and BufferRewrite unconditionally force this header because they assume the content has been successfully converted to UTF-8 by getRawBodyReader.

Original Response: Content-Type: text/html (No charset specified).

getRawBodyReader: Sees no charset. The golang.org/x/net/html/charset library defaults to Windows-1252 for compatibility. It reads the valid UTF-8 bytes from the server as if they were Windows-1252 bytes.

Result: UTF-8 → (bytes E2 86 92) becomes string →.

StreamRewrite: Sets Content-Type: text/html; charset=utf-8.

Browser: Sees charset=utf-8 header. It renders the string → correctly as those characters, instead of the arrow you wanted.

What does this PR do?

How did you verify your code works?

What are the relevant issues?

Fix: ZenPrivacy#34

This is a classic "Mojibake" issue caused by double-decoding. The artifacts you are seeing (→ instead of →) occur when UTF-8 bytes are misinterpreted as Windows-1252 (Latin-1).

The function getRawBodyReader relies on golang.org/x/net/html/charset.NewReader to handle character encoding.

    When the upstream website (e.g., data-star.dev) returns Content-Type: text/html without an explicit charset=utf-8 parameter, the Go charset library defaults to Windows-1252 (to be spec-compliant with legacy HTML).

    Your proxy takes the valid UTF-8 bytes (e.g., E2 86 92 for →), "decodes" them as Windows-1252 (resulting in â, †, ’), and then re-encodes them as UTF-8 for the output.

    The browser receives this re-encoded mess and displays →.

Both StreamRewrite and BufferRewrite unconditionally force this header because they assume the content has been successfully converted to UTF-8 by getRawBodyReader.

Original Response: Content-Type: text/html (No charset specified).

getRawBodyReader: Sees no charset. The golang.org/x/net/html/charset library defaults to Windows-1252 for compatibility. It reads the valid UTF-8 bytes from the server as if they were Windows-1252 bytes.

    Result: UTF-8 → (bytes E2 86 92) becomes string →.

StreamRewrite: Sets Content-Type: text/html; charset=utf-8.

Browser: Sees charset=utf-8 header. It renders the string → correctly as those characters, instead of the arrow you wanted.

Signed-off-by: Gunir <134402102+gunir@users.noreply.github.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

Walkthrough

Change in httprewrite/rawbody.go modifies content-type parsing to fall back to "text/plain" on parse failure, treats missing charset as UTF-8 for passthrough and post-decompression, and routes decompression/closer logic to a local body variable instead of res.Body.

Changes

Cohort / File(s) Summary
Content-type, charset, and body routing
httprewrite/rawbody.go
- On Content-Type parse failure, use mimeType = "text/plain" rather than returning an error.
- If charset is missing/empty, treat it as UTF-8 for passthrough and decoding decisions.
- When decompression is used, ensure contentType includes charset=utf-8 if absent before creating decoder.
- Use header.Get(...) for header access.
- Redirect decompression input and resource cleanup to a local body variable instead of res.Body.
- Adjust error paths to account for updated contentType/charset handling and decoder creation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing broken encoding in rawbody.go due to Content-Type mishandling, which directly addresses the pull request's core objective.
Description check ✅ Passed The description provides excellent context about the mojibake issue, explains the root cause (Windows-1252 vs UTF-8), and references the linked issue (#34), but the template sections are left blank.
Linked Issues check ✅ Passed The PR directly addresses issue #34 by fixing the Content-Type/charset handling in rawbody.go to prevent UTF-8 bytes from being misinterpreted as Windows-1252, which was causing mojibake on data-star.dev.
Out of Scope Changes check ✅ Passed All changes in rawbody.go are focused on fixing the charset/encoding handling issue identified in #34; no unrelated modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
httprewrite/rawbody.go (1)

110-116: Pre-existing edge case: potential double-close when encoding == "".

When there's no compression but a non-UTF-8 charset is specified (e.g., charset=iso-8859-1), decompressReader returns res.Body unchanged (line 141). The multiCloser then contains [res.Body, res.Body], causing res.Body.Close() to be called twice.

This is a pre-existing issue (not introduced by this PR), and the PR's FIX 1 actually reduces the frequency of hitting this path. While double-closing may be idempotent in practice, it's not guaranteed.

♻️ Optional fix to avoid double-close
+	// Avoid adding res.Body twice when no decompression occurred
+	var closers []io.Closer
+	if decompressedReader != res.Body {
+		closers = []io.Closer{decompressedReader, res.Body}
+	} else {
+		closers = []io.Closer{res.Body}
+	}
+
 	return struct {
 		io.Reader
 		io.Closer
 	}{
 		decodedReader,
-		&multiCloser{[]io.Closer{decompressedReader, res.Body}},
+		&multiCloser{closers},
 	}, mimeType, nil

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f81e182 and 0725a57.

📒 Files selected for processing (1)
  • httprewrite/rawbody.go
🔇 Additional comments (3)
httprewrite/rawbody.go (3)

82-84: LGTM – graceful fallback for malformed Content-Type.

The fallback to "text/plain" is appropriate. Note that when parsing fails, params is nil, but accessing params["charset"] on line 88 safely returns "" in Go (nil map returns zero value), so the subsequent logic correctly handles this case.


86-91: LGTM – correctly prevents Windows-1252 default for uncompressed UTF-8 content.

This fix correctly addresses the mojibake issue from #34. When there's no compression and the charset is either UTF-8 or unspecified (modern web default), passing through the raw body avoids the charset.NewReader call that would incorrectly default to Windows-1252. The case-insensitive comparison via strings.ToLower is appropriate.


98-102: LGTM – correctly forces UTF-8 for compressed content without explicit charset.

This complements FIX 1 by handling the compressed content path. When charset is unspecified, explicitly setting charset=utf-8 in the contentType passed to charset.NewReader prevents the library from defaulting to Windows-1252. Good approach of modifying only the local variable without affecting the actual response header.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
httprewrite/rawbody.go (1)

110-116: Critical: body is nil in multiCloser – will panic on Close().

The multiCloser includes body which is nil. When the returned reader is closed, multiCloser.Close() will attempt to call Close() on a nil interface, causing a panic.

🐛 Proposed fix
 	return struct {
 		io.Reader
 		io.Closer
 	}{
 		decodedReader,
-		&multiCloser{[]io.Closer{decompressedReader, body}},
+		&multiCloser{[]io.Closer{decompressedReader, res.Body}},
 	}, mimeType, nil
🤖 Fix all issues with AI agents
In `@httprewrite/rawbody.go`:
- Around line 79-80: The code uses an undefined variable header when reading
headers (encoding := header.Get("Content-Encoding") and contentType :=
header.Get("Content-Type")), causing a compile error; update those calls to use
the response's header (res.Header.Get(...)) so they reference the existing
res.Header, keeping the same variable names encoding and contentType.
- Around line 86-91: The early return uses the named return variable body which
is never assigned and is nil; change the pass-through branch so it returns the
actual response body (res.Body) to callers instead of the zero-valued
body—either assign res.Body to body before returning or directly return the
response body along with mimeType and nil error in the branch that checks
charsetParam, encoding and params.
- Around line 93-96: The code passes an uninitialized nil variable "body" into
decompressReader which will fail; locate where "body" is declared/should be set
and initialize it with the actual request payload (e.g., the incoming
io.ReadCloser or a bytes.Reader wrapping the read bytes) before calling
decompressReader(enconding), ensuring you read the raw request body into that
variable, handle errors, and preserve/restore the original request body if
needed and close any readers to avoid leaks.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e44ec7 and f81e182.

📒 Files selected for processing (1)
  • httprewrite/rawbody.go
🔇 Additional comments (2)
httprewrite/rawbody.go (2)

82-84: LGTM on the fallback handling.

Falling back to text/plain when Content-Type parsing fails is reasonable. Accessing params["charset"] on a nil map safely returns an empty string in Go.


98-102: Good fix for the Windows-1252 default issue.

This correctly forces charset=utf-8 when the upstream response has compression but no explicit charset, preventing charset.NewReader from defaulting to Windows-1252 per HTML5 legacy behavior. This addresses the root cause described in issue #34.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Update rawbody.go - Fix some variable mismatches

Signed-off-by: Gunir <134402102+gunir@users.noreply.github.com>
@gunir
Copy link
Author

gunir commented Jan 15, 2026

With this the issue is 100% gone, this is screenshot testing with Zen:
image

@anfragment anfragment self-assigned this Jan 16, 2026
// [FIX 2] If we reach here (because of compression), ensure we don't
// let charset.NewReader default to Windows-1252 if charset is missing.
if charsetParam == "" {
contentType = mimeType + "; charset=utf-8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken encoding on data-star.dev

2 participants