Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Python. Clear filter

fastest way to turn HTML into text in Python

08 January 2021 2 comments   Python


tl;dr; selectolax is best for stripping HTML down to plain text.

The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a html_strip text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?

PyQuery

from pyquery import PyQuery as pq

text = pq(html).text()

selectolax

from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

regular expression

import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

Results

I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete <html> documents (with a <head> and <body> etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

I've run it a bunch of times. The results are pretty stable.

Point is: selectolax is ~7 times faster than PyQuery

Regex? Really?

No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is <p>Foo &amp; Bar</p>, I expect the plain text transformation should be Foo & Bar, not Foo &amp; Bar.

More pressing, both PyQuery and selectolax supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:

<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

That can never be done with a regex.

Version 2.0

So my requirement will probably change but basically, I want to delete certain tags. E.g. <div class="warning"> and <div class="hidden"> and <div style="display: none">. So let's implement that:

PyQuery

from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax

from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

This actually works. When I now run the same benchmark for 10,000 of these are the new results:

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

Again, selectolax beats PyQuery by a factor of ~6.

Conclusion

Regular expressions are fast but weak in power. Makes sense.

This selectolax is very impressive.
I got the inspiration from this blog post which sets out to do something very similar to what I'm doing.

I hope this helps someone. Thank you Artem Golubin of selectolax and @lexborisov for Modest which selectolax is built upon.

Gcm - git checkout master or main

21 December 2020 1 comment   Python


I love git on the command line and I actually never use a GUI to navigate git branches. But sometimes, I need scripting to make abstractions that make life more convenient. What often happens is that I need to go back to the "main" branch. I write main in quotation marks because it's not always called main. Sometimes it's called master. And it's tedious to have to remember which one is the default.

So I wrote a script called Gcm:

#!/usr/bin/env python3
import subprocess


def run(*args):
    default_branch = get_default_branch()
    current_branch = get_current_branch()
    if default_branch != current_branch:
        checkout_branch(default_branch)
    else:
        print(f"Already on {default_branch}")
        return 1


def checkout_branch(branch_name):
    subprocess.run(f"git checkout {branch_name}".split())


def get_default_branch():
    res = subprocess.run(
        "git config --worktree --get git-checkout-default.default-branch".split(),
        check=False,
        capture_output=True,
    )
    if res.returncode == 0:
        return res.stdout.decode("utf-8").strip()

    origin_name = "origin"
    res = subprocess.run(
        f"git remote show {origin_name}".split(),
        check=True,
        capture_output=True,
    )
    for line in res.stdout.decode("utf-8").splitlines():
        if line.strip().startswith("HEAD branch:"):
            default_branch = line.replace("HEAD branch:", "").strip()
            subprocess.run(
                f"git config --worktree git-checkout-default.default-branch "
                f"{default_branch}".split(),
                check=True,
            )
            return default_branch

    raise NotImplementedError(f"No remote called {origin_name!r}")


def get_current_branch():
    res = subprocess.run("git branch --show-current".split(), capture_output=True)
    for line in res.stdout.decode("utf-8").splitlines():
        return line.strip()

    res = subprocess.run("git show -s --pretty=%d HEAD".split(), capture_output=True)
    for line in res.stdout.decode("utf-8").splitlines():
        return line.strip()

    raise NotImplementedError("Don't know what to do!")


if __name__ == "__main__":
    import sys

    sys.exit(run(*sys.argv[1:]))

It ain't pretty or a spiffy one-liner, but it works. It assumes that the repo has a remote called origin which doesn't matter if it's the upstream or your fork. Put this script into a file called ~/bin/Gcm and run chmox +x ~/bin/Gcm.

Now, whenever I want to go back to the main branch I type Gcm and it takes me there.

Gcm in action

It might seem silly, and it might not be for you, but I love it and use it many times per day. Perhaps by sharing this tip, it'll inspire someone else to set up something similar for themselves.

Why it's spelled with an uppercase G

I have a pattern (or rule?) that all scripts that I write myself are always capitalized like that. It avoids clashes with stuff I install with brew or other bash/zsh aliases.

For example:

ls -l ~/bin/RemoteVSCodePeterbecom.sh
ls -l ~/bin/Cleanupfiles
ls -l ~/bin/RandomString.py

UPDATE: July 2021

Jason @silverjam Mobarak on Twitter suggested an optimization. I've incorporated his suggestion into the original blog post above. See this twitter thread.
Essentially, it now uses git config --worktree to set and get the outcome of the default branch name so the next time it's needed it does not depend on network.
But watch out, if you, like me, change the default branch of some existing repos from master to main; then you'll need to run: git config --worktree --unset git-checkout-default.default-branch.

Generating random avatar images in Django/Python

28 October 2020 1 comment   Web development, Django, Python


tl;dr; <img src="/avatar.random.png" alt="Random avataaar"> generates this image:

Random avataaar
(try reloading to get a random new one. funny aren't they?)

When you use Gravatar you can convert people's email addresses to their mugshot.
It works like this:

<img src="https://www.gravatar.com/avatar/$(md5(user.email))">

But most people don't have their mugshot on Gravatar.com unfortunately. But you still want to display an avatar that is distinct per user. Your best option is to generate one and just use the user's name or email as a seed (so it's always random but always deterministic for the same user). And you can also supply a fallback image to Gravatar that they use if the email doesn't match any email they have. That's where this blog post comes in.

I needed that so I shopped around and found avataaars generator which is available as a React component. But I need it to be server-side and in Python. And thankfully there's a great port called: py-avataaars.

It depends on CairoSVG to convert an SVG to a PNG but it's easy to install. Anyway, here's my hack to generate random "avataaars" from Django:

import io
import random

import py_avataaars
from django import http
from django.utils.cache import add_never_cache_headers, patch_cache_control


def avatar_image(request, seed=None):
    if not seed:
        seed = request.GET.get("seed") or "random"

    if seed != "random":
        random.seed(seed)

    bytes = io.BytesIO()

    def r(enum_):
        return random.choice(list(enum_))

    avatar = py_avataaars.PyAvataaar(
        style=py_avataaars.AvatarStyle.CIRCLE,
        # style=py_avataaars.AvatarStyle.TRANSPARENT,
        skin_color=r(py_avataaars.SkinColor),
        hair_color=r(py_avataaars.HairColor),
        facial_hair_type=r(py_avataaars.FacialHairType),
        facial_hair_color=r(py_avataaars.FacialHairColor),
        top_type=r(py_avataaars.TopType),
        hat_color=r(py_avataaars.ClotheColor),
        mouth_type=r(py_avataaars.MouthType),
        eye_type=r(py_avataaars.EyesType),
        eyebrow_type=r(py_avataaars.EyebrowType),
        nose_type=r(py_avataaars.NoseType),
        accessories_type=r(py_avataaars.AccessoriesType),
        clothe_type=r(py_avataaars.ClotheType),
        clothe_color=r(py_avataaars.ClotheColor),
        clothe_graphic_type=r(py_avataaars.ClotheGraphicType),
    )
    avatar.render_png_file(bytes)

    response = http.HttpResponse(bytes.getvalue())
    response["content-type"] = "image/png"
    if seed == "random":
        add_never_cache_headers(response)
    else:
        patch_cache_control(response, max_age=60, public=True)

    return response

It's not perfect but it works. The URL to this endpoint is /avatar.<seed>.png and if you make the seed parameter random the response is always different.

To make the image not random, you replace the <seed> with any string. For example (use your imagination):

{% for comment in comments %}
  <img src="/avatar.{{ comment.user.id }}.png" alt="{{ comment.user.name }}">
  <blockquote>{{ comment.text }}</blockquote>
  <i>{{ comment.date }}</i>
{% endfor %}

I've put together this test page if you want to see more funny avatar combinations instead of doing work :)

hashin 0.15.0 now copes nicely with under_scores

15 June 2020 0 comments   Python

https://github.com/peterbe/hashin/pull/119


tl;dr hashin 0.15.0 makes package comparison agnostic to underscore or hyphens

See issue #116 for a fuller story. Basically, now it doesn't matter if you write...

hashin python_memcached

...or...

hashin python-memcached

And the same can be said about the contents of your requirements.txt file. Suppose it already had something like this:

python_memcached==1.59 \
    --hash=sha256:4dac64916871bd35502 \
    --hash=sha256:a2e28637be13ee0bf1a8

and you type hashin python-memcached it will do the version comparison on these independent of the underscore or hyphen.

Thank @caphrim007 who implemented this for the benefit of Renovate.

./bin/huey-isnt-running.sh - A bash script to prevent lurking ghosts

10 June 2020 0 comments   Python, Linux, Bash


tl;dr; Here's a useful bash script to avoid starting something when its already running as a ghost process.

Huey is a great little Python library for doing background tasks. It's like Celery but much lighter, faster, and easier to understand.

What cost me almost an hour of hair-tearing debugging today was that I didn't realize that a huey daemon process had gotten stuck in the background with code that wasn't updating as I made changes to the tasks.py file in my project. I just couldn't understand what was going on.

The way I start my project is with honcho which is a Python Foreman clone. The Procfile looks something like this:

elasticsearch: cd /Users/peterbe/dev/PETERBECOM/elasticsearch-7.7.0 && ./bin/elasticsearch -q
web: ./bin/run.sh web
minimalcss: cd minimalcss && PORT=5000 yarn run start
huey: ./manage.py run_huey --flush-locks --huey-verbose
adminui: cd adminui && yarn start
pulse: cd pulse && yarn run dev

And you start that with simply typing:

honcho start

When you Ctrl-C, it kills all those processes but somehow somewhere it doesn't always kill everything. Restarting the computer isn't a fun alternative.

So, to prevent my sanity from draining I wrote this script:

#!/usr/bin/env bash
set -eo pipefail

# This is used to make sure that before you start huey, 
# there isn't already one running the background.
# It has happened that huey gets lingering stuck as a 
# ghost and it's hard to notice it sitting there 
# lurking and being weird.

bad() {
    echo "Huey is already running!"
    exit 1
}

good() {
    echo "Huey is NOT already running"
    exit 0
}

ps aux | rg huey | rg -v 'rg huey' | rg -v 'huey-isnt-running.sh' && bad || good

(If you're wondering what rg is; it's short for ripgrep)

And I change my Procfile accordingly:

-huey: ./manage.py run_huey --flush-locks --huey-verbose
+huey: ./bin/huey-isnt-running.sh && ./manage.py run_huey --flush-locks --huey-verbose

There really isn't much rocket science or brain surgery about this blog post but I hope it inspires someone who's been in similar trenches that a simple bash script can make all the difference.

Check your email addresses in Python, as a whole

22 May 2020 0 comments   Python, MDN


So recently, in MDN, we changed the setting WELCOME_EMAIL_FROM. Seems harmless right? Wrong, it failed horribly in runtime and we didn't notice until it was in production. Here's the traceback:

SMTPSenderRefused: (552, b"5.1.7 The sender's address was syntactically invalid.\n5.1.7 see : http://support.socketlabs.com/kb/84 for more information.", '=?utf-8?q?Janet?=')
(8 additional frame(s) were not displayed)
...
  File "newrelic/api/function_trace.py", line 151, in literal_wrapper
    return wrapped(*args, **kwargs)
  File "django/core/mail/message.py", line 291, in send
    return self.get_connection(fail_silently).send_messages([self])
  File "django/core/mail/backends/smtp.py", line 110, in send_messages
    sent = self._send(message)
  File "django/core/mail/backends/smtp.py", line 126, in _send
    self.connection.sendmail(from_email, recipients, message.as_bytes(linesep='\r\n'))
  File "python3.8/smtplib.py", line 871, in sendmail
    raise SMTPSenderRefused(code, resp, from_addr)

SMTPSenderRefused: (552, b"5.1.7 The sender's address was syntactically invalid.\n5.1.7 see : http://support.socketlabs.com/kb/84 for more information.", '=?utf-8?q?Janet?=')

Yikes!

So, to prevent this from happening every again we're putting this check in:

from email.utils import parseaddr

WELCOME_EMAIL_FROM = config("WELCOME_EMAIL_FROM", ...)

# If this fails, SMTP will probably also fail.
assert parseaddr(WELCOME_EMAIL_FROM)[1].count('@') == 1, parseaddr(WELCOME_EMAIL_FROM)

You could go to town even more on this. Perhaps use the email validator within django but for now I'd call that overkill. This is just a decent check before anything gets a chance to go wrong.