Skip to content

Conversation

@clokep
Copy link
Contributor

@clokep clokep commented Dec 11, 2025

Use beautifulsoup4 instead of lxml for URL previews. This offers some nicer APIs when parsing HTML and avoids using libxml, which is unmaintained.

I haven’t done a full regression against commonly previewed sites, but I expect this will give similar (or better) results.

beautiulsoup also handles decoding the charset for us, which is less custom code.

@clokep clokep requested a review from a team as a code owner December 11, 2025 13:42
@MadLittleMods MadLittleMods changed the title Use beautiulsoup4 instead of lxml for URL previews Use beautifulsoup4 instead of lxml for URL previews Dec 29, 2025
@@ -0,0 +1 @@
Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts to resolve

@@ -0,0 +1 @@
Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can't see to get the same setup locally. Did something change with how to install the pinned packages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poetry install --extras all is what I use.

Things have changed behind the scenes but shouldn't affect how you install as a developer:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did. Maybe I'll try creating a new virtualenv. 🤔

Comment on lines -636 to -638
This also requires the optional `lxml` python dependency to be installed. This
in turn requires the `libxml2` library to be available - on Debian/Ubuntu this
means `apt-get install libxml2-dev`, or equivalent for your OS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note, we can remove libxml2 from the flake.nix as well. Something for a nix user to do though ⏩

libxml2

Comment on lines -636 to -638
This also requires the optional `lxml` python dependency to be installed. This
in turn requires the `libxml2` library to be available - on Debian/Ubuntu this
means `apt-get install libxml2-dev`, or equivalent for your OS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the libxml2-dev dependency in CI

# There aren't wheels for some of the older deps, so we need to install
# their build dependencies
- run: |
sudo apt-get -qq update
sudo apt-get -qq install build-essential libffi-dev python3-dev \
libxml2-dev libxslt-dev xmlsec1 zlib1g-dev libjpeg-dev libwebp-dev

- run: sudo apt-get -qq install xmlsec1 libxml2-dev libxslt-dev

Comment on lines -636 to -638
This also requires the optional `lxml` python dependency to be installed. This
in turn requires the `libxml2` library to be available - on Debian/Ubuntu this
means `apt-get install libxml2-dev`, or equivalent for your OS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libxml2-devel is also mentioned in

```sh
sudo dnf install libtiff-devel libjpeg-devel libzip-devel freetype-devel \
libwebp-devel libxml2-devel libxslt-devel libpq-devel \
python3-virtualenv libffi-devel openssl-devel python3-devel

return None
tag = soup.find(
"link",
rel=("alternate", "alternative"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find this syntax?

I've looked through https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree

),
# Check microdata for an image.
meta_image = soup.find(
"meta", itemprop=re.compile("image", re.I), content=NON_BLANK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why itemprop?

Seems like we could do the normal image=re.I

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemprop is the key, image is the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's more obvious, can you share some example HTML that we're parsing?

Comment on lines +204 to +206
title = soup.find(("title", "h1", "h2", "h3"), string=True) # type: ignore[call-overload]
if title and title.string:
og["og:title"] = title.string.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we have tests to ensure this does the correct thing? string=True -> title.string and we end up with the title/heading content


if tree is None:
return
from bs4.element import NavigableString, Tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special reason for organizing the imports here?

Can we do it at the top like normal?

Comment on lines -482 to -484
if len(elements) > stack_limit:
# We've hit our limit for working memory
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we care about this in the new implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants