George V. Reilly

Now You Have 32 Problems

Some people, when confronted with a problem, think “I know, I’ll use regular ex­pres­sions.” Now they have two problems.

— Jaime Zawinksi

A Twitter thread about very long regexes reminded me of the longest regex that I ever ran afoul of, a par­tic­u­lar­ly horrible multilevel mess that had worked acceptably on the 32-bit .NET CLR, but brought the 64-bit CLR to its knees.

Whenever I ran our ASP.NET web ap­pli­ca­tion [on Win64], it would go berserk, eat up all 4GB of my physical RAM, push the working set of IIS’s w3wp.exe to 12GB, and max out one of my 4 cores! The only way to maintain any sanity was to run iisreset every 20 minutes to gently kill the process.

WinDbg and Process Explorer showed that the rogue thread was stuck in a loop in mscorjit!Life­times­ListIn­te­ri­or­Block­sHelperIt­er­a­tive<GCIn­fo­Liv­eRe­cord­Ma­nip­u­la­tor>. I passed a minidump on to my former colleagues in IIS, who sent it to the CLR team. They said:

The only thing I can tell is that it is Regex, and some regex expression compiled down to a method with 456KB of IL. That is huge, and yes 12GB of RAM consumed for something like that is expected.

With that clue, I was able to track down the problem, a par­tic­u­lar­ly foul regex, built from a 10KB string, with 32 al­ter­nat­ing ex­pres­sions, each of which contains dozens of alternated subex­pres­sions. The string is built from many smaller strings, so it’s not obvious in the source just how ugly it is.

I never wrote a followup post explaining how I dealt with this beast.

The regex was used on the Cozi calendar to parse ap­point­ments in everyday language, such as “Ann/John Dinner out Friday at 8pm” or “John’s birthday every Dec. 7”. These would get translated into (possibly recurring) iCalendar ap­point­ments.

Some of the subex­pres­sions mentioned above looked like:

I’ve elided the in­ter­me­di­ate values but they were spelled out in the original. Some of the simpler subex­pres­sions were repeated several times, nested inside others.

This all screamed that a grammar and a real parser were needed, but the test suite also screamed here be dragons!

I resisted the temptation to rewrite the ap­point­ment parser from scratch with a proper grammar, or to experiment with a real natural language parser, though it remained on my personal todo list for the rest of my time at Cozi. We were migrating from C# to Python at that point, and the legacy ap­point­ment parser was one of the few remaining pieces that prevented us from shutting down the .NET servers.

Instead, I changed the ap­point­ment parser code so that it didn’t attempt to match the entire 10KB monster in one go. I looped through each of the 32 top-level dis­junc­tions, manually performing the al­ter­na­tion. If any one of those matched, then I had what I needed. Reducing the regexes to a few hundred characters each tamed the com­bi­na­to­r­i­al explosion of back­track­ing state.

Regexes definitely have a place, but do not try to implement a full grammar as a single regular expression.

blog comments powered by Disqus
Weirdest Birthday Ever » « Punctuating James Joyce