George V. Reilly

Regex Conjunctions

Most regular expression engines make it easy to match al­ter­na­tions (or dis­junc­tions) with the | operator: to match either foo or bar, use foo|bar.

Few regex engines have any provisions for con­junc­tions, and the syntax is often horrible. Awk makes it easy to match /pat1/ && /pat2/ && /pat3/.

$ cat <<EOF | awk '/bar/ && /foo/'
> foo bar
> bar
> barfy food
> barfly
> EOF
foo bar
barfy food

In the case of a Unix pipeline, the con­junc­tion could also be expressed as a series of pipes: ... | grep pat1 | grep pat2 | grep pat3 | ....

The longest regex that I ever en­coun­tered was an enormous al­ter­na­tion—a true horror that shouldn't have continue.

Python Enums with Attributes

Python enu­mer­a­tions are useful for grouping related constants in a namespace. You can add additional behaviors to an enum class, but there isn't an easy and obvious way to add attributes to enum members.

class TileState(Enum):
    CORRECT = 1
    PRESENT = 2
    ABSENT  = 3

    def color(self):
        if self is self.CORRECT:
            return "Green"
        elif self is self.PRESENT:
          
continue.

fsymbols for Unicode weirdness

My display name on Twitter currently looks like @ɢᴇᴏʀɢᴇᴠʀᴇɪʟʟʏ@ᴛᴇᴄʜ.ʟɢʙᴛ, an attempt to route around Twitter's apparent censorship of Mastodon in­for­ma­tion.

I used the FSymbols Generators to produce several variants.

@𝕘𝕖𝕠𝕣𝕘𝕖𝕧𝕣𝕖𝕚𝕝𝕝𝕪@𝕥𝕖𝕔𝕙.𝕝𝕘𝕓𝕥
ʇqƃʅ.ɥɔǝʇ@ʎʅʅᴉǝɹʌǝƃɹoǝƃ@
@𝗀𝖾𝗈𝗋𝗀𝖾𝗏𝗋𝖾𝗂𝗅𝗅𝗒@𝗍𝖾𝖼𝗁.𝗅𝗀𝖻𝗍
@𝘨𝘦𝘰𝘳𝘨𝘦𝘷𝘳𝘦𝘪𝘭𝘭𝘺@𝘵𝘦𝘤𝘩.𝘭𝘨𝘣𝘵
@𝑔𝑒𝑜𝑟𝑔𝑒𝑣𝑟𝑒𝑖𝑙𝑙𝑦@𝑡𝑒𝑐ℎ.𝑙𝑔𝑏𝑡
@𝙜𝙚𝙤𝙧𝙜𝙚𝙫𝙧𝙚𝙞𝙡𝙡𝙮@𝙩𝙚𝙘𝙝.𝙡𝙜𝙗𝙩
@𝚐𝚎𝚘𝚛𝚐𝚎𝚟𝚛𝚎𝚒𝚕𝚕𝚢@𝚝𝚎𝚌𝚑.𝚕𝚐𝚋𝚝
@𝔤𝔢𝔬𝔯𝔤𝔢𝔳𝔯𝔢𝔦𝔩𝔩𝔶@𝔱𝔢𝔠𝔥.𝔩𝔤𝔟𝔱

Many of these variants come from Unicode Block "Math­e­mat­i­cal Al­phanu­mer­ic Symbols".

There are a lot more things you can do with Unicode than just upside-down text.

Backwards Ranges in Python

In Python, if you want to specify a sequence of numbers from a up to (but excluding) b, you can write range(a, b). This generates the sequence a, a+1, a+2, ..., b-1. You start at a and keep going until the next number would be b.

In Python 3, range is lazy and the values in the sequence do not ma­te­ri­al­ize until you consume the range.

>>> range(3,12)
range(3, 12)
>>> list(range(3,12))
[3, 4, 5, 6, 7, 8, 9, 10, 11]

Trey Hunner makes the point that range is a lazy iterable rather than an iterator.

You can also step by an increment other than one: range(a, b, s). This generates a, a+s, a+2*s, ..., b-s (assuming that continue.

Diffing a fragment of a file

A while back, I had extracted some code out of a large file into a separate file and made some mod­i­fi­ca­tions. I wanted to check that the dif­fer­ences were minimal. Let's say that the extracted code had been between lines 123 and 456 of large_old_­file.

diff -u <(sed -n '123,456p;457q' large_old_file) new_file

What's happening here?

A similar example: Diff continue.

Accidentally Quadratic: Python List Membership

We had a per­for­mance regression in a test suite recently when the median test time jumped by two minutes.

We tracked it down to this (simplified) code fragment:

task_inclusions = [ some_collection_of_tasks() ]
invalid_tasks = [t.task_id() for t in airflow_tasks
                 if t.task_id() not in task_inclusions]

This looks fairly in­nocu­ous—and it was—until the size of the result returned from some_­col­lec­tion_of_­tasks() jumped from a few hundred to a few thousand.

The in comparison operator con­ve­nient­ly works with all of Python's standard sequences and col­lec­tions, but its efficiency varies. For a list and other sequences, in must search continue.

Now You Have 32 Problems

Some people, when confronted with a problem, think “I know, I'll use regular ex­pres­sions.” Now they have two problems.

— Jaime Zawinksi

A Twitter thread about very long regexes reminded me of the longest regex that I ever ran afoul of, a par­tic­u­lar­ly horrible multilevel mess that had worked acceptably on the 32-bit .NET CLR, but brought the 64-bit CLR to its knees.

Whenever I ran our ASP.NET web ap­pli­ca­tion [on Win64], it would go berserk, eat up all 4GB of my physical RAM, push the working set of IIS's w3wp.exe to 12GB, and max out one of my 4 cores! The only way to maintain any sanity was to run iisreset every 20 minutes to continue.

Git Diff Tips

The Git Diff utility is much more functional than the standard command-line diff.

To see changes relative to the staging area (aka the index), use git diff.

To see staged changes, use git diff --staged (or --cached).

To see changes side by side on a line (where it makes sense), use the --color-word option.

To compare two arbitrary files in the file system, use git diff --no-index.

To try some other diff algorithms, use the --patience, --histogram, or --minimal options. The default diff algorithm is --myers.

Lots more at the docs.

OrderedDict Initialization

An Or­dered­Dict is a Python dict which remembers insertion order. When iterating over an Or­dered­Dict, items are returned in that order. Ordinary dicts return their items in an un­spec­i­fied order.

Ironically, most of the ways of con­struct­ing an ini­tial­ized Or­dered­Dict end up breaking the ordering in Python 2.x and in Python 3.5 and below. Specif­i­cal­ly, using keyword arguments or passing a dict (mapping) will not retain the insertion order of the source code.

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
>>> from collections import OrderedDict

>>> odict = OrderedDict()
>>> odict['one'] = 1
>>> odict['two'] = 2
>>> odict['three'] = 3
>>> odict['four'] = 4
>>> odict['five'] = 5
>>> odict.items()
[('one', 1), ('two', 2), ('three', 
continue.

HTML5 tables require tr inside thead

When I learned HTML tables back in the 90s, at some point I discovered the <thead> element for grouping the <th> column headers. What I missed was there should be a <tr> element between the two. In other words, a well-formed HTML table with a header looks like this:

<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Value</th>
            <th>Date</th>
        
continue.
Previous »