# Accidentally Quadratic: Python List Membership

We had a performance regression in a test suite recently when the median test time jumped by two minutes.

We tracked it down to this (simplified) code fragment:

task_inclusions = [ some_collection_of_tasks() ] invalid_tasks = [t.task_id() for t in airflow_tasks if t.task_id() not in task_inclusions]

This looks fairly innocuous—and it was—until the size of the result returned from `some_collection_of_tasks()`
jumped from a few hundred to a few thousand.

The in comparison operator conveniently works
with all of Python’s standard sequences and collections,
but its efficiency varies.
For a `list` and other sequences,
`in` must search linearly through all the elements
until it finds a matching element
*or* the list is exhausted.
In other words, `x in some_list` takes *O*(*n*) time.
For a `set` or a `dict`, however,
`x in container` takes, on average, only *O*(1) time.
See Time Complexity for more.

The `invalid_tasks` list comprehension
was explicitly looping through one list, `airflow_tasks`,
and implicitly doing a linear search through `task_inclusions`
for each value of `t`.
The nested loop was hidden
and its effect only became apparent
when `task_inclusions` grew large.

The list comprehension was actually taking *O*(*n*^{2}) time.
When *n* was comparatively small (a few hundred),
this wasn’t a problem.
When *n* grew to several thousand,
it became a big problem.

This is a classic example of an accidentally quadratic algorithm. Indeed, Nelson describes a very similar problem with Mercurial changegroups.

This performance regression was compounded because this fragment of code
was being called thousands of times—I believe once for each task—
making the overall cost cubic, *O*(*n*^{3}).

The fix here is similar:
Use a `set` instead of a `list`
and get *O*(1) membership testing.
The `invalid_tasks` list comprehension now takes
the expected *O*(*n*) time.

task_inclusions = set( some_collection_of_tasks() ) invalid_tasks = [t.task_id() for t in airflow_tasks if t.task_id() not in task_inclusions]

More at Understanding Big-O Notation and the Big-O Cheat Sheet.