[Previously published at the now defunct MetaBrite Dev Blog.]
Some time ago, we made an ill-considered decision to use recipe names for image URLs, which simplified image management with our then-rudimentary tools. For example, the recipe named "Twisted Pasta With Browned Butter, Sage, and Walnuts" becomes a URL ending in "Twisted%20Pasta%20With%20Browned%20Butter%2C%20Sage%2C%20and%20Walnuts.jpg".
Life becomes more interesting when you escape the confines of 7-bit ASCII and use Unicode. How should u"Sautéed crème fraîche Provençale" be handled? The only reasonable thing to do is to first convert the Unicode string to UTF-8 and then hex-encode those octets: "Saut%C3%A9ed%20cr%C3%A8me%20fra%C3%AEche%20Proven%C3%A7ale".
That seems reasonable, but it was giving us inconsistent results when the images were uploaded to an S3 bucket. When I investigated, I discovered that it was a problem of Unicode equivalence. There are two ways to represent, say, the e-acute character in Unicode: as a composed character, é = ‘LATIN SMALL LETTER E WITH ACUTE’ (U+00E9), or as a decomposed pair, e´ = ‘LATIN SMALL LETTER E’ (U+0065) and ‘COMBINING ACUTE ACCENT’ (U+0301). Thus the name of our fictitious recipe can be represented as u"Saut\u00E9ed cr\u00E8me fra\u00EEche Proven\u00E7ale" (composed, NFC) or u"Saute\u0301ed cre\u0300me frai\u0302che Provenc\u0327ale" (decomposed, NFD).
|é||U+00E9||C3 A9||U+0065 U+0301||65 CC 81|
|è||U+00E8||C3 A8||U+0065 U+0300||65 CC 80|
|î||U+00EE||C3 AE||U+0069 U+0302||69 CC 82|
|ç||U+00E7||C3 A7||U+0063 U+0327||63 CC A7|
When we uploaded our image files to S3 from a Mac, accented letters were decomposed letter–accent pairs, while images uploaded from Linux had composed characters in their S3 key names. This discrepancy was due to the Mac HFS+ filesystem, which stores filenames in decomposed form. (Linus Torvalds has a profane rant about HFS+.) Our client apps always used the composed form of the recipe name and generated image URLs accordingly. This led to 404s for recipe images that had been uploaded from a Mac. If S3 had handled the Unicode equivalence, we’d never have noticed.
We’re switching to using numeric identifiers for the recipe images—they’re much harder to get wrong.