-
Notifications
You must be signed in to change notification settings - Fork 434
Use beautifulsoup4 instead of lxml for URL previews
#19301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
beautifulsoup4 instead of lxml for URL previews
| @@ -0,0 +1 @@ | |||
| Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts to resolve
| @@ -0,0 +1 @@ | |||
| Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linting/CI is not passing for typechecking ❌: https://github.com/element-hq/synapse/actions/runs/20170566919/job/57904924809?pr=19301
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can't see to get the same setup locally. Did something change with how to install the pinned packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poetry install --extras all is what I use.
Things have changed behind the scenes but shouldn't affect how you install as a developer:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I did. Maybe I'll try creating a new virtualenv. 🤔
| This also requires the optional `lxml` python dependency to be installed. This | ||
| in turn requires the `libxml2` library to be available - on Debian/Ubuntu this | ||
| means `apt-get install libxml2-dev`, or equivalent for your OS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note, we can remove libxml2 from the flake.nix as well. Something for a nix user to do though ⏩
Line 103 in f79acff
| libxml2 |
| This also requires the optional `lxml` python dependency to be installed. This | ||
| in turn requires the `libxml2` library to be available - on Debian/Ubuntu this | ||
| means `apt-get install libxml2-dev`, or equivalent for your OS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the libxml2-dev dependency in CI
synapse/.github/workflows/tests.yml
Lines 443 to 448 in f79acff
| # There aren't wheels for some of the older deps, so we need to install | |
| # their build dependencies | |
| - run: | | |
| sudo apt-get -qq update | |
| sudo apt-get -qq install build-essential libffi-dev python3-dev \ | |
| libxml2-dev libxslt-dev xmlsec1 zlib1g-dev libjpeg-dev libwebp-dev |
synapse/.github/workflows/tests.yml
Line 500 in f79acff
| - run: sudo apt-get -qq install xmlsec1 libxml2-dev libxslt-dev |
| This also requires the optional `lxml` python dependency to be installed. This | ||
| in turn requires the `libxml2` library to be available - on Debian/Ubuntu this | ||
| means `apt-get install libxml2-dev`, or equivalent for your OS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libxml2-devel is also mentioned in
synapse/docs/setup/installation.md
Lines 308 to 311 in f79acff
| ```sh | |
| sudo dnf install libtiff-devel libjpeg-devel libzip-devel freetype-devel \ | |
| libwebp-devel libxml2-devel libxslt-devel libpq-devel \ | |
| python3-virtualenv libffi-devel openssl-devel python3-devel |
| return None | ||
| tag = soup.find( | ||
| "link", | ||
| rel=("alternate", "alternative"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I find this syntax?
I've looked through https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
| ), | ||
| # Check microdata for an image. | ||
| meta_image = soup.find( | ||
| "meta", itemprop=re.compile("image", re.I), content=NON_BLANK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why itemprop?
Seems like we could do the normal image=re.I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itemprop is the key, image is the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's more obvious, can you share some example HTML that we're parsing?
| title = soup.find(("title", "h1", "h2", "h3"), string=True) # type: ignore[call-overload] | ||
| if title and title.string: | ||
| og["og:title"] = title.string.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we have tests to ensure this does the correct thing? string=True -> title.string and we end up with the title/heading content
|
|
||
| if tree is None: | ||
| return | ||
| from bs4.element import NavigableString, Tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special reason for organizing the imports here?
Can we do it at the top like normal?
| if len(elements) > stack_limit: | ||
| # We've hit our limit for working memory | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we care about this in the new implementation?
Use
beautifulsoup4instead oflxmlfor URL previews. This offers some nicer APIs when parsing HTML and avoids usinglibxml, which is unmaintained.I haven’t done a full regression against commonly previewed sites, but I expect this will give similar (or better) results.
beautiulsoup also handles decoding the charset for us, which is less custom code.