2023-09-05

Regex Conjunctions

Most regular expression engines make it easy to match alternations (or disjunctions) with the | operator: to match either foo or bar, use foo|bar.

Few regex engines have any provisions for conjunctions, and the syntax is often horrible. Awk makes it easy to match /pat1/ && /pat2/ && /pat3/.

$ cat <<EOF | awk '/bar/ && /foo/'
> foo bar
> bar
> barfy food
> barfly
> EOF
foo bar
barfy food

In the case of a Unix pipeline, the conjunction could also be expressed as a series of pipes: ... | grep pat1 | grep pat2 | grep pat3 | ....

The longest regex that I ever encountered was an enormous alternation—a true horror that shouldn't have …continue.

2023-09-02

Python Enums with Attributes

Python enumerations are useful for grouping related constants in a namespace. You can add additional behaviors to an enum class, but there isn't an easy and obvious way to add attributes to enum members.

class TileState(Enum):
    CORRECT = 1
    PRESENT = 2
    ABSENT  = 3

    def color(self):
        if self is self.CORRECT:
            return "Green"
        elif self is self.PRESENT:

…continue.

2022-12-31

fsymbols for Unicode weirdness

My display name on Twitter currently looks like @ɢᴇᴏʀɢᴇᴠʀᴇɪʟʟʏ@ᴛᴇᴄʜ.ʟɢʙᴛ, an attempt to route around Twitter's apparent censorship of Mastodon information.

I used the FSymbols Generators to produce several variants.

@𝕘𝕖𝕠𝕣𝕘𝕖𝕧𝕣𝕖𝕚𝕝𝕝𝕪@𝕥𝕖𝕔𝕙.𝕝𝕘𝕓𝕥
ʇqƃʅ.ɥɔǝʇ@ʎʅʅᴉǝɹʌǝƃɹoǝƃ@
@𝗀𝖾𝗈𝗋𝗀𝖾𝗏𝗋𝖾𝗂𝗅𝗅𝗒@𝗍𝖾𝖼𝗁.𝗅𝗀𝖻𝗍
@𝘨𝘦𝘰𝘳𝘨𝘦𝘷𝘳𝘦𝘪𝘭𝘭𝘺@𝘵𝘦𝘤𝘩.𝘭𝘨𝘣𝘵
@𝑔𝑒𝑜𝑟𝑔𝑒𝑣𝑟𝑒𝑖𝑙𝑙𝑦@𝑡𝑒𝑐ℎ.𝑙𝑔𝑏𝑡
@𝙜𝙚𝙤𝙧𝙜𝙚𝙫𝙧𝙚𝙞𝙡𝙡𝙮@𝙩𝙚𝙘𝙝.𝙡𝙜𝙗𝙩
@𝚐𝚎𝚘𝚛𝚐𝚎𝚟𝚛𝚎𝚒𝚕𝚕𝚢@𝚝𝚎𝚌𝚑.𝚕𝚐𝚋𝚝
@𝔤𝔢𝔬𝔯𝔤𝔢𝔳𝔯𝔢𝔦𝔩𝔩𝔶@𝔱𝔢𝔠𝔥.𝔩𝔤𝔟𝔱

Many of these variants come from Unicode Block "Mathematical Alphanumeric Symbols".

There are a lot more things you can do with Unicode than just upside-down text.

2022-12-19

Backwards Ranges in Python

In Python, if you want to specify a sequence of numbers from a up to (but excluding) b, you can write range(a, b). This generates the sequence a, a+1, a+2, ..., b-1. You start at a and keep going until the next number would be b.

In Python 3, range is lazy and the values in the sequence do not materialize until you consume the range.

>>> range(3,12)
range(3, 12)
>>> list(range(3,12))
[3, 4, 5, 6, 7, 8, 9, 10, 11]

Trey Hunner makes the point that range is a lazy iterable rather than an iterator.

You can also step by an increment other than one: range(a, b, s). This generates a, a+s, a+2*s, ..., b-s (assuming that …continue.

2022-01-31

Diffing a fragment of a file

A while back, I had extracted some code out of a large file into a separate file and made some modifications. I wanted to check that the differences were minimal. Let's say that the extracted code had been between lines 123 and 456 of large_old_file.

diff -u <(sed -n '123,456p;457q' large_old_file) new_file

What's happening here?

sed -n '123,456p' is printing lines 123–456 of large_old_file.
The 457q tells sed to abandon the file at line 457. Otherwise, it will keep reading all the way to the end.
The <(sed ...) is an example of process substitution. The output of the sed invocation becomes the first input of the diff command.

A similar example: Diff …continue.

2021-10-04

Accidentally Quadratic: Python List Membership

We had a performance regression in a test suite recently when the median test time jumped by two minutes.

We tracked it down to this (simplified) code fragment:

task_inclusions = [ some_collection_of_tasks() ]
invalid_tasks = [t.task_id() for t in airflow_tasks
                 if t.task_id() not in task_inclusions]

This looks fairly innocuous—and it was—until the size of the result returned from some_collection_of_tasks() jumped from a few hundred to a few thousand.

The in comparison operator conveniently works with all of Python's standard sequences and collections, but its efficiency varies. For a list and other sequences, in must search …continue.

2020-04-23

Now You Have 32 Problems

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

— Jaime Zawinksi

A Twitter thread about very long regexes reminded me of the longest regex that I ever ran afoul of, a particularly horrible multilevel mess that had worked acceptably on the 32-bit .NET CLR, but brought the 64-bit CLR to its knees.

Whenever I ran our ASP.NET web application [on Win64], it would go berserk, eat up all 4GB of my physical RAM, push the working set of IIS's w3wp.exe to 12GB, and max out one of my 4 cores! The only way to maintain any sanity was to run iisreset every 20 minutes to …continue.

2017-12-26

Git Diff Tips

The Git Diff utility is much more functional than the standard command-line diff.

To see changes relative to the staging area (aka the index), use git diff.

To see staged changes, use git diff --staged (or --cached).

To see changes side by side on a line (where it makes sense), use the --color-word option.

To compare two arbitrary files in the file system, use git diff --no-index.

To try some other diff algorithms, use the --patience, --histogram, or --minimal options. The default diff algorithm is --myers.

Lots more at the docs.

2017-02-21

OrderedDict Initialization

An OrderedDict is a Python dict which remembers insertion order. When iterating over an OrderedDict, items are returned in that order. Ordinary dicts return their items in an unspecified order.

Ironically, most of the ways of constructing an initialized OrderedDict end up breaking the ordering in Python 2.x and in Python 3.5 and below. Specifically, using keyword arguments or passing a dict (mapping) will not retain the insertion order of the source code.

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
>>> from collections import OrderedDict

>>> odict = OrderedDict()
>>> odict['one'] = 1
>>> odict['two'] = 2
>>> odict['three'] = 3
>>> odict['four'] = 4
>>> odict['five'] = 5
>>> odict.items()
[('one', 1), ('two', 2), ('three',

…continue.

2017-02-18

HTML5 tables require tr inside thead

When I learned HTML tables back in the 90s, at some point I discovered the <thead> element for grouping the <th> column headers. What I missed was there should be a <tr> element between the two. In other words, a well-formed HTML table with a header looks like this:

<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Value</th>
            <th>Date</th>

…continue.