A question I hear quite frequently is, how do you match a number within a range in a regular expression? This is a pretty odd question, since regular expressions tend to deal with strings without worrying about the meaning of their content. Most of the time it turns out the person asking is trying to do all of their input validation using regular expressions, which I believe is a singularly bad idea.
Here's an example of what I'm talking about, to match an integer between 0 and 255 (like an octet of an IP address) we could use the following pattern:
25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[0-9][0-9]?
For the pragmatists out there, one arguably valid use case is searching log files for lines about a particular subnet. Anyway, I think that this is an interesting enough problem to tackle without worrying much about practical applications.
About a year ago I came across Justin Rogers' solution to this very problem in C#. As much as I admire his coding skills, his solution seemed to me rather verbose and heavy with special cases. I thought I could do better, or at least understand better how he ended up with that solution by tackling it myself.
I started by writing around 20 test cases, then porting his solution to PHP, so I would have a reference implementation to check my tests against. It took me a while to come up with my own working version, but I was happy to see it passing the carefully crafted tests. I was even happier that it was a fraction of the size of the reference implementation with no special cases to mar its elegance.
However I had a nasty surprise after adding a couple more tests, my solution mysteriously failed them. Evidently my hand written test cases (as usual, some random and some boundary counditions) weren't comprehensive enough. It took me quite a while to find general fixes, but then my happiness returned for a brief visit.
To quell a worry, I decided to generate a hundred more random test cases, AKA. fuzzing. Inevitably more failures emerged, about ten of the new tests failed.
With a decent number of test runs to study, it was quickly apparent that the major issue lay with ranges of the same length which shared a common prefix, e.g. 223-229. My limited test cases from before had led me in the wrong direction.
Eventually I had one failing test out of about 120, which illuminated a subtle error in the logic. After fixing that I ran the code against 50,000 more randomly generated test cases, and was finally assured that it was working.
The code below, I can quite confidently say, works fine. If you find a bug in it then please send me a self-addressed machete so I can thank you properly.
<?php function generateRange($min, $max) { $digit = '[0-9]'; $lenMax = strlen($max); $lenMin = strlen($min); $lenDiff = $lenMax - $lenMin; $min = str_pad($min, $lenMax, 0, STR_PAD_LEFT); $max = (string)$max; // find length of common prefix for ($i = 0; $i < $lenMin && $min[$i] == $max[$i]; $i++); $prefixLength = $i; // add non-conflicting ranges from each end for ($i = $lenMax, $j = 0; $i-- > 1 + $prefixLength; $j++) { $lower = $min[$i]; $upper = $max[$i]; // correct bounds if not final range if ($j) { ++$lower; --$upper; } // lower bound if ($lower < 10) { $char = $lower == 9 ? 9 : '[' . $lower . '-9]'; $pattern[] = ($j >= $lenMin ? '' : substr($min, $lenDiff, $i - $lenDiff)) . $char . str_repeat($digit, $j); } // upper bound if ($upper >= 0) { $char = $upper ? '[0-' . $upper . ']' : 0; $pattern[] = substr($max, 0, $i) . $char . str_repeat($digit, $j); } } // add middle range if (!$j || $max[$prefixLength] - $min[$prefixLength] > 1) { $prefix = substr($min, 0, $prefixLength); $lower = $min[$prefixLength]; $upper = $max[$prefixLength]; // correct bounds if not final range if ($j && $i == $prefixLength) { ++$lower; --$upper; } $char = $lower == $upper ? $lower : '[' . $lower . '-' . $upper . ']'; $pattern[] = $prefix . $char . str_repeat($digit, $lenMax - $prefixLength - 1); } return join('|', $pattern); } ?>
My solution ended up along the same lines as Justin's, and although I had achieved my original aim of understanding his solution better, the real educational experience here was the inadequacy of my initial tests. I think I've now thoroughly learnt the lesson of testing with a comprehensive range of input.
Bearing in mind that this is just one function with a relatively small domain of useful input: If you fuzzed your own applications, what obscure bugs might you find?
As an aside, when I came back to this recently having decided to write an article on the topic, I found an elegant solution, written in Python (of course), which uses the rather different approach of first partitioning the ranges with a recursive function and then constructing a regular expression for each one.
Out of curiosity I ported this to PHP too, but as you might expect it became decidedly less elegant and the depth of recursion segfaulted PHP on all but the most trivial input.
Photo by Jag arts.