George V. Reilly

Accidentally Quadratic: Python List Membership

We had a per­for­mance regression in a test suite recently when the median test time jumped by two minutes.

We tracked it down to this (simplified) code fragment:

task_inclusions = [ some_collection_of_tasks() ]
invalid_tasks = [t.task_id() for t in airflow_tasks
                 if t.task_id() not in task_inclusions]

This looks fairly in­nocu­ous—and it was—until the size of the result returned from some_­col­lec­tion_of_­tasks() jumped from a few hundred to a few thousand.

The in comparison operator con­ve­nient­ly works with all of Python's standard sequences and col­lec­tions, but its efficiency varies. For a list and other sequences, in must search continue.

OrderedDict Initialization

An Or­dered­Dict is a Python dict which remembers insertion order. When iterating over an Or­dered­Dict, items are returned in that order. Ordinary dicts return their items in an un­spec­i­fied order.

Ironically, most of the ways of con­struct­ing an ini­tial­ized Or­dered­Dict end up breaking the ordering in Python 2.x and in Python 3.5 and below. Specif­i­cal­ly, using keyword arguments or passing a dict (mapping) will not retain the insertion order of the source code.

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
>>> from collections import OrderedDict

>>> odict = OrderedDict()
>>> odict['one'] = 1
>>> odict['two'] = 2
>>> odict['three'] = 3
>>> odict['four'] = 4
>>> odict['five'] = 5
>>> odict.items()
[('one', 1), ('two', 2), ('three', 

Alembic: Data Migrations

We use Alembic to perform schema migrations whenever we add (or drop) tables or columns from our databases. It's less well known that Alembic can also perform data migrations, updating existing data in tables.

Here's an example adapted from a migration I put together this afternoon. I added a non-NULL Boolean stooge column to the old_timers table, with a default value of FALSE. I wanted to update certain rows to have stooge=TRUE as part of the migration. The following works with PostgreSQL.

Note the server_de­fault=sa.false() in the de­c­la­ra­tion of the stooge column, which is needed to initially set all instances of stooge=FALSE. I then declare a table which has only the continue.


I recently learned from a Stack­Over­flow question that the rounding behavior in Python 3.x is different from Python 2.x:

The round() function rounding strategy and return type have changed. Exact halfway cases are now rounded to the nearest even result instead of away from zero. (For example, round(2.5) now returns 2 rather than 3.)

The “away from zero” rounding strategy is the one that most of us learned at school. The “nearest even” strategy is also known as “banker’s rounding”.

There are actually five rounding strategies defined in IEEE 754:

Mode / Example Value +11.5 +12.5 −11.5 −12.5
to nearest, ties to even +12.0 +12.0 −12.0 −12.0
to nearest, ties away from zero +12.0 +13.0 −12.0 −13.0
toward 0 (truncation) +11.0 +12.0 −11.0 −12.0
toward +∞ (ceiling) +12.0 +13.0 −11.0 −12.0
toward −∞ (floor) +11.0 +12.0 −12.0 −13.0

Further continue.

SQLAlchemy got me Killed

I ran a script this afternoon that died mys­te­ri­ous­ly without any output. It was using SQLAlchemy to query all the rows from a large table so that they could be trans­formed into JSON Lines to be loaded into Elas­tic­search. When I reran my script, I noticed this time that something had printed Killed at the very end.

A little research convinced me that the OOM Killer was the likely assassin. I looked in /var/log/kern.log and I found that my process had used up almost all of the 8GB on this system before being killed.

The query had to be the problem. A little more research led me to augment my continue.

Python: Joining URLs with posixpath.join

On Mac/Linux, os.path.join is an alias for posixpath.join, which always joins path segments with /. On Windows, os.path.join is an alias for ntpath.join, which always uses \. When dealing with URLs, we always want forward slashes, regardless of platform, so posixpath.join should be used to build URL paths.


from __future__ import print_function

from six.moves.urllib_parse import urljoin as abs_urljoin
from posixpath import join as path_urljoin

def urljoin(site, path):
    return abs_urljoin(site, path)

def test_join(site, path):
    result = urljoin(site, path)
    print("'{0}' + '{1}'\n\t-> '{2}'".format(site, path, result))
    return result

local_path = path_urljoin('2016', '07', '12', 'release', 'index.html')

test_join('', 'foo/bar/quux.js')
test_join('', local_path)
test_join('', local_path)
test_join('', local_path)


'' + 'foo/bar/quux.js'

Logging in Python: Don't use new-fangled format

Python 2.6 introduced the format method to strings. In general, format is now the preferred way to build strings instead of the old % formatting operator.

One exception is with the logging module, where the best practice is to use %s and %d. Why? First, %s is the idiomatic way to use logging, which was built years before format was introduced. Second, if there's a literal % in the in­ter­po­lat­ed values, logging will be unhappy, since there won't be cor­re­spond­ing arguments in the call. It won't fall over, since “The logging package is designed to swallow exceptions which occur while logging in production. This is so that errors which occur while handling logging continue.

HouseCanary PyCon2016 Progamming Challenge

Yesterday, while at PyCon, I whipped up a quick, brute-force answer to the House­Ca­nary PyCon2016 Progamming Challenge in a few minutes. That was sufficient to pass the first two test cases and win me a very pretty House­Ca­nary t-shirt.

The answer ran in O(n⁴) time, so it failed miserably on the larger problem sets in the third and fourth cases. I mulled it over and came up with an O(n²) solution that runs in reasonable time on the larger problem sets. On the second test case, input1.txt, runtime drops from 5.2s to 0.2s.

I submitted my new answer. I'll learn on Monday if I won the speed challenge.

FlyingCloud Documentation Updates

I made a number of updates to the Fly­ing­Cloud Doc­u­men­ta­tion tonight. I hope to give a lightning talk about Fly­ing­Cloud at PyCon on Monday evening or Tuesday morning, and I put together some slides for that too.

io.StringIO and UnicodeCSV DictWriter

I like to use io.StringIO rather than the older cStringIO.StringIO, as it's Python 3–ready io.StringIO is also a context manager: if you use it in a with statement, the string buffer is au­to­mat­i­cal­ly closed as you go out of scope.

I tried using io.StringIO with unicodecsv, as I wanted to capture the CSV output into a string buffer for use with unit tests. unicodecsv is a drop-in re­place­ment for Python's built-in csv module, which supports Unicode strings.

with io.StringIO() as csv_file:
    lines = csv_file.getvalue().split('\r\n')
    return lines[:-1]  # drop empty line after trailing \r\n

It failed horribly with TypeError: unicode argument expected, continue.

Previous »