Use `beautifulsoup4` instead of `lxml` for URL previews #19301

clokep · 2025-12-11T13:42:43Z

Use beautifulsoup4 instead of lxml for URL previews. This offers some nicer APIs when parsing HTML and avoids using libxml, which is unmaintained.

I haven’t done a full regression against commonly previewed sites, but I expect this will give similar (or better) results.

beautiulsoup also handles decoding the charset for us, which is less custom code.

MadLittleMods · 2025-12-29T16:37:32Z

changelog.d/19301.misc

@@ -0,0 +1 @@
+Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep.


Conflicts to resolve

MadLittleMods · 2025-12-29T16:38:02Z

changelog.d/19301.misc

@@ -0,0 +1 @@
+Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep.


Linting/CI is not passing for typechecking ❌: https://github.com/element-hq/synapse/actions/runs/20170566919/job/57904924809?pr=19301

Yes, I can't see to get the same setup locally. Did something change with how to install the pinned packages?

poetry install --extras all is what I use.

Things have changed behind the scenes but shouldn't affect how you install as a developer:

Switch the build backend from poetry-core to maturin #19234

Update pyproject.toml to be compatible with other standard Python packaging tools #19137

That's what I did. Maybe I'll try creating a new virtualenv. 🤔

MadLittleMods · 2025-12-29T16:41:40Z

docs/setup/installation.md

-This also requires the optional `lxml` python dependency to be  installed. This
-in turn requires the `libxml2` library to be available - on  Debian/Ubuntu this
-means `apt-get install libxml2-dev`, or equivalent for your OS.


As a note, we can remove libxml2 from the flake.nix as well. Something for a nix user to do though ⏩

synapse/flake.nix

Line 103 in f79acff

libxml2

MadLittleMods · 2025-12-29T16:42:33Z

docs/setup/installation.md

-This also requires the optional `lxml` python dependency to be  installed. This
-in turn requires the `libxml2` library to be available - on  Debian/Ubuntu this
-means `apt-get install libxml2-dev`, or equivalent for your OS.


We can remove the libxml2-dev dependency in CI

synapse/.github/workflows/tests.yml

Lines 443 to 448 in f79acff

# There aren't wheels for some of the older deps, so we need to install

# their build dependencies

- run: |

sudo apt-get -qq update

sudo apt-get -qq install build-essential libffi-dev python3-dev \

libxml2-dev libxslt-dev xmlsec1 zlib1g-dev libjpeg-dev libwebp-dev

synapse/.github/workflows/tests.yml

Line 500 in f79acff

- run: sudo apt-get -qq install xmlsec1 libxml2-dev libxslt-dev

MadLittleMods · 2025-12-29T16:43:39Z

docs/setup/installation.md

-This also requires the optional `lxml` python dependency to be  installed. This
-in turn requires the `libxml2` library to be available - on  Debian/Ubuntu this
-means `apt-get install libxml2-dev`, or equivalent for your OS.


libxml2-devel is also mentioned in

synapse/docs/setup/installation.md

Lines 308 to 311 in f79acff

```sh

sudo dnf install libtiff-devel libjpeg-devel libzip-devel freetype-devel \

libwebp-devel libxml2-devel libxslt-devel libpq-devel \

python3-virtualenv libffi-devel openssl-devel python3-devel

MadLittleMods · 2025-12-29T17:24:58Z

synapse/media/oembed.py

-        return None
+        tag = soup.find(
+            "link",
+            rel=("alternate", "alternative"),


Where can I find this syntax?

I've looked through https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree

MadLittleMods · 2025-12-29T17:29:16Z

synapse/media/preview_html.py

-            ),
+        # Check microdata for an image.
+        meta_image = soup.find(
+            "meta", itemprop=re.compile("image", re.I), content=NON_BLANK


Why itemprop?

Seems like we could do the normal image=re.I

itemprop is the key, image is the value.

So it's more obvious, can you share some example HTML that we're parsing?

MadLittleMods · 2025-12-29T17:30:30Z

synapse/media/preview_html.py

+        title = soup.find(("title", "h1", "h2", "h3"), string=True)  # type: ignore[call-overload]
+        if title and title.string:
+            og["og:title"] = title.string.strip()


I assume we have tests to ensure this does the correct thing? string=True -> title.string and we end up with the title/heading content

MadLittleMods · 2025-12-29T17:31:51Z

synapse/media/preview_html.py


-    if tree is None:
-        return
+    from bs4.element import NavigableString, Tag


Special reason for organizing the imports here?

Can we do it at the top like normal?

MadLittleMods · 2025-12-29T17:35:15Z

synapse/media/preview_html.py

-                if len(elements) > stack_limit:
-                    # We've hit our limit for working memory
-                    break


Why don't we care about this in the new implementation?

clokep added 3 commits December 10, 2025 11:48

Use BeautifulSoup instead of LXML directly.

6dec726

Dont use lxml

8e9e333

Update docs

a24d251

clokep requested a review from a team as a code owner December 11, 2025 13:42

clokep and others added 4 commits December 11, 2025 08:44

Create 19301.misc

d5332b0

Merge remote-tracking branch 'upstream/develop' into bs4

18d2746

Lint fixes

3813537

Fix-up references

5940217

MadLittleMods added the A-URL-Preview label Dec 29, 2025

MadLittleMods changed the title ~~Use beautiulsoup4 instead of lxml for URL previews~~ Use beautifulsoup4 instead of lxml for URL previews Dec 29, 2025

MadLittleMods reviewed Dec 29, 2025

View reviewed changes

		@@ -0,0 +1 @@
		Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file

	# There aren't wheels for some of the older deps, so we need to install
	# their build dependencies
	- run: \|
	sudo apt-get -qq update
	sudo apt-get -qq install build-essential libffi-dev python3-dev \
	libxml2-dev libxslt-dev xmlsec1 zlib1g-dev libjpeg-dev libwebp-dev

	```sh
	sudo dnf install libtiff-devel libjpeg-devel libzip-devel freetype-devel \
	libwebp-devel libxml2-devel libxslt-devel libpq-devel \
	python3-virtualenv libffi-devel openssl-devel python3-devel

Use beautifulsoup4 instead of lxml for URL previews #19301

Are you sure you want to change the base?

Use beautifulsoup4 instead of lxml for URL previews #19301

Conversation

clokep commented Dec 11, 2025 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use `beautifulsoup4` instead of `lxml` for URL previews #19301

Use `beautifulsoup4` instead of `lxml` for URL previews #19301

clokep commented Dec 11, 2025 •

edited by MadLittleMods

Loading