Traditionally, regexps have always matched the leftmost, longest sequence of characters. For example, given the text ‘abxxxcd’, the regexp ‘x+’ could match one, two, or all three ‘x’ characters. By defining a regexp to match the longest possible sequence, we know that ‘x+’ will match all three ‘x’s. Such behavior may be termed greedy, since a regexp will match as many characters as possible.
The Perl language introduced “non-greedy,” or shortest-match operators. In the above example, ‘x+’ would only match a single ‘x’.
The 2024 POSIX standard introduced shortest-match operators into POSIX EREs. Syntactically, you create a shortest-match operator by appending a ‘?’ to one of the traditional regexp operators: ‘*’, ‘+’, ‘?’, or ‘{}’.
The shortest-match operators don’t always match what you might think would be the smallest sequence of characters, since the general, leftmost-longest rule still applies to the overall regexp.
To illustrate the differences, we use a program named shortest-match.awk.
This program takes several strings, and replaces the text matching a given
pattern with a single ‘X’. The program uses a number of gawk-specific
features which haven’t yet been described. We therefore delay presenting the
code until Demonstrating Shortest and Longest Match Operators.
For each string, the program prints the following:
Here is the output, with commentary:
"aaaxxxzzz"
shortpat: /x+?/, result: "aaaXxxzzz"
0: (s: 4, l: 1) -> "x"
longpat: /x+/, result: "aaaXzzz"
0: (s: 4, l: 3) -> "xxx"
For the shortest match, only one ‘x’ is replaced. For the longest match, every ‘x’ is replaced. This is pretty straightforward.
"aaaxxxyzzz"
shortpat: /x+?y/, result: "aaaXzzz"
0: (s: 4, l: 4) -> "xxxy"
longpat: /x+y/, result: "aaaXzzz"
0: (s: 4, l: 4) -> "xxxy"
In this case, the result is the same, since the longest possible string of characters is matched.
"aaaxxxxxxxxxxxxxxxxzzz"
shortpat: /(x+?)(x+)(x+?)(x+)/, result: "aaaXzzz"
0: (s: 4, l: 16) -> "xxxxxxxxxxxxxxxx"
1: (s: 4, l: 1) -> "x"
2: (s: 5, l: 13) -> "xxxxxxxxxxxxx"
3: (s: 18, l: 1) -> "x"
4: (s: 19, l: 1) -> "x"
longpat: /(x+)(x+)(x+)(x+)/, result: "aaaXzzz"
0: (s: 4, l: 16) -> "xxxxxxxxxxxxxxxx"
1: (s: 4, l: 13) -> "xxxxxxxxxxxxx"
2: (s: 17, l: 1) -> "x"
3: (s: 18, l: 1) -> "x"
4: (s: 19, l: 1) -> "x"
This case is more interesting. The same total amount of text is matched for both patterns. However, the text matched by the subexpressions differs between the two cases. In the longest-match case, the longest possible subexpressions match first.
"aaaxyxxyxxyxzzz"
shortpat: /((x+)(y+?)(x+))+/, result: "aaaXyxxyxzzz"
0: (s: 4, l: 4) -> "xyxx"
1: (s: 4, l: 4) -> "xyxx"
2: (s: 4, l: 1) -> "x"
3: (s: 5, l: 1) -> "y"
4: (s: 6, l: 2) -> "xx"
longpat: /((x+)(y+)(x+))+/, result: "aaaXzzz"
0: (s: 4, l: 9) -> "xyxxyxxyx"
1: (s: 10, l: 3) -> "xyx"
2: (s: 10, l: 1) -> "x"
3: (s: 11, l: 1) -> "y"
4: (s: 12, l: 1) -> "x"
This case needs careful examination. There is a ‘+’ on the outside, meaning that the total inner expression may match multiple times. For the longest-match expression, that indeed happens. The subexpressions two through four have the start and length values for the last occasion where the inner expression matched. The shortest-match expression only matched the inner expression once.
"aaaxyxxyxxyxzzz"
shortpat: /((x+)(y+?)(x+)){2}/, result: "aaaXyxzzz"
0: (s: 4, l: 7) -> "xyxxyxx"
1: (s: 7, l: 4) -> "xyxx"
2: (s: 7, l: 1) -> "x"
3: (s: 8, l: 1) -> "y"
4: (s: 9, l: 2) -> "xx"
longpat: /((x+)(y+)(x+)){2}/, result: "aaaXyxzzz"
0: (s: 4, l: 7) -> "xyxxyxx"
1: (s: 7, l: 4) -> "xyxx"
2: (s: 7, l: 1) -> "x"
3: (s: 8, l: 1) -> "y"
4: (s: 9, l: 2) -> "xx"
Here, the results are identical; both the nature of the string to be matched and the POSIX longest-leftmost rule, cause this to happen.
The MinRX regular expression matching engine, which is gawk’s
default matcher (see Selecting the Regexp Matching Engine), supports the shortest-match
operators. The older GNU matchers do not. Also, when invoked with the
--traditional option, shortest-match operators are not supported,
since traditional awk does not have these operators.