We had a performance regression in a test suite recently
when the median test time jumped by two minutes.
We tracked it down to this (simplified) code fragment:
task_inclusions = [ some_collection_of_tasks() ]
invalid_tasks = [t.task_id() for t in airflow_tasks
if t.task_id() not in task_inclusions]
This looks fairly innocuous—and it was—until the size of the result returned from some_collection_of_tasks()
jumped from a few hundred to a few thousand.
The in comparison operator conveniently works
with all of Python's standard sequences and collections,
but its efficiency varies.
For a list and other sequences,
in must search …continue.
An OrderedDict is a Python dict which remembers insertion order.
When iterating over an OrderedDict, items are returned in that order.
Ordinary dicts return their items in an unspecified order.
Ironically, most of the ways of constructing an initialized OrderedDict
end up breaking the ordering in Python 2.x and in Python 3.5 and below.
Specifically, using keyword arguments or passing a dict (mapping)
will not retain the insertion order of the source code.
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
>>> from collections import OrderedDict
>>> odict = OrderedDict()
>>> odict['one'] = 1
>>> odict['two'] = 2
>>> odict['three'] = 3
>>> odict['four'] = 4
>>> odict['five'] = 5
>>> odict.items()
[('one', 1), ('two', 2), ('three',
…continue.
We use Alembic to perform schema migrations whenever we add (or drop) tables or columns
from our databases.
It's less well known that Alembic can also perform data migrations,
updating existing data in tables.
Here's an example adapted from a migration I put together this afternoon.
I added a non-NULL Boolean stooge column to the old_timers table,
with a default value of FALSE.
I wanted to update certain rows to have stooge=TRUE as part of the migration.
The following works with PostgreSQL.
Note the server_default=sa.false() in the declaration of the stooge column,
which is needed to initially set all instances of stooge=FALSE.
I then declare a table which has only the …continue.
I recently learned from a StackOverflow question
that the rounding behavior in Python 3.x is different from Python 2.x:
The round() function rounding strategy and return type have changed.
Exact halfway cases are now rounded to the nearest even result
instead of away from zero.
(For example, round(2.5) now returns 2 rather than 3.)
The “away from zero” rounding strategy is the one that most of us learned at school.
The “nearest even” strategy is also known as “banker’s rounding”.
There are actually five rounding strategies defined in IEEE 754:
Mode / Example Value |
+11.5 |
+12.5 |
−11.5 |
−12.5 |
to nearest, ties to even |
+12.0 |
+12.0 |
−12.0 |
−12.0 |
to nearest, ties away from zero |
+12.0 |
+13.0 |
−12.0 |
−13.0 |
toward 0 (truncation) |
+11.0 |
+12.0 |
−11.0 |
−12.0 |
toward +∞ (ceiling) |
+12.0 |
+13.0 |
−11.0 |
−12.0 |
toward −∞ (floor) |
+11.0 |
+12.0 |
−12.0 |
−13.0 |
Further …continue.
I ran a script this afternoon that died mysteriously without any output.
It was using SQLAlchemy to query all the rows from a large table
so that they could be transformed into JSON Lines to be loaded into Elasticsearch.
When I reran my script,
I noticed this time that something had printed Killed at the very end.
A little research convinced me that the OOM Killer was the likely assassin.
I looked in /var/log/kern.log
and I found that my process had used up almost all of the 8GB on this system
before being killed.
The query had to be the problem.
A little more research led me to augment my …continue.
On Mac/Linux, os.path.join is an alias for posixpath.join,
which always joins path segments with /.
On Windows, os.path.join is an alias for ntpath.join,
which always uses \.
When dealing with URLs, we always want forward slashes,
regardless of platform, so posixpath.join should be used to build URL paths.
Running:
from __future__ import print_function
from six.moves.urllib_parse import urljoin as abs_urljoin
from posixpath import join as path_urljoin
def urljoin(site, path):
return abs_urljoin(site, path)
def test_join(site, path):
result = urljoin(site, path)
print("'{0}' + '{1}'\n\t-> '{2}'".format(site, path, result))
return result
local_path = path_urljoin('2016', '07', '12', 'release', 'index.html')
test_join('https://www.example.com', 'foo/bar/quux.js')
test_join('https://www.example.com', local_path)
test_join('https://www.example.com/', local_path)
test_join('https://www.example.com/prefix', local_path)
Yields:
'https://www.example.com' + 'foo/bar/quux.js'
…continue.
Python 2.6 introduced the format method to strings.
In general, format is now the preferred way to build strings
instead of the old % formatting operator.
One exception is with the logging module,
where the best practice is to use %s and %d.
Why?
First, %s is the idiomatic way to use logging,
which was built years before format was introduced.
Second, if there's a literal % in the interpolated values,
logging will be unhappy,
since there won't be corresponding arguments in the call.
It won't fall over, since
“The logging package is designed to swallow exceptions which occur while logging in production.
This is so that errors which occur while handling logging …continue.
Yesterday, while at PyCon,
I whipped up a quick, brute-force answer
to the HouseCanary PyCon2016 Progamming Challenge
in a few minutes.
That was sufficient to pass the first two test cases
and win me a very pretty HouseCanary t-shirt.
The answer ran in O(n⁴) time, so it failed miserably on the larger problem sets
in the third and fourth cases.
I mulled it over and came up with an O(n²) solution that runs in reasonable time
on the larger problem sets.
On the second test case, input1.txt, runtime drops from 5.2s to 0.2s.
I submitted my new answer.
I'll learn on Monday if I won the speed challenge.
I made a number of updates to the FlyingCloud Documentation tonight.
I hope to give a lightning talk about FlyingCloud
at PyCon on Monday evening or Tuesday morning,
and I put together some slides for that too.
I like to use io.StringIO rather than the older cStringIO.StringIO,
as it's Python 3–ready
io.StringIO is also a context manager:
if you use it in a with statement,
the string buffer is automatically closed as you go out of scope.
I tried using io.StringIO with unicodecsv,
as I wanted to capture the CSV output into a string buffer
for use with unit tests.
unicodecsv is a drop-in replacement for Python's built-in csv module,
which supports Unicode strings.
with io.StringIO() as csv_file:
write_csv_rows(csv_file)
lines = csv_file.getvalue().split('\r\n')
return lines[:-1] # drop empty line after trailing \r\n
It failed horribly with
TypeError: unicode argument expected, …continue.
Previous »