George V. Reilly

URLs from Unicode Strings

[Pre­vi­ous­ly published at the now defunct MetaBrite Dev Blog.]

Some time ago, we made an ill-considered decision to use recipe names for image URLs, which simplified image management with our then-rudi­men­ta­ry tools. For example, the recipe named "Twisted Pasta With Browned Butter, Sage, and Walnuts" becomes a URL ending in "Twist­ed%20­Pas­ta%20With­%20Browned%20But­ter%2C%20Sage%2C%20and%20Wal­nuts.jpg".

Life becomes more in­ter­est­ing when you escape the confines of 7-bit ASCII and use Unicode. How should u"Sautéed crème fraîche Provençale" be handled? The only reasonable thing to do is to first convert the Unicode string to UTF-8 and then hex-encode those octets: "Saut%C3%A9ed%20cr%C3%A8me%20fra%C3%AEche%20Proven%C3%A7ale".

That seems reasonable, but it was giving us in­con­sis­tent results when the images were uploaded to an S3 bucket. When continue.

Obfuscating Passwords in URLs in Python

[Pre­vi­ous­ly published at the now defunct MetaBrite Dev Blog.]

RFC 1738 allows passwords in URLs, in the form <scheme>://<username>:<password>@<host>:<port>/<url-path>. Although passwords are deprecated by RFC 3986 and other newer RFCs, it's oc­ca­sion­al­ly useful. Several important packages in the Python world allow such URLs, including SQLAlchemy ('post­gresql://scott:tiger@localhost:5432/my­data­base') and Celery ('amqp://guest:guest@localhost:5672//'). It's also useful to be able to log such URLs without exposing the password.

Python 2 has urlparse.urlparse (known as urllib.parse.urlparse in Python 3 and six.moves.url­lib_­parse.urlparse in the Six com­pat­i­bil­i­ty library) to split a URL into six components, scheme, netloc, path, parameters, query, and fragment. The netloc cor­re­sponds to <user>:<password>@<host>:<port>.

Un­for­tu­nate­ly, neither Python 2 nor 3's urlparse properly handle the userinfo (username + optional password in the netloc), as continue.