11.3.12 Demonstrating Shortest and Longest Match Operators

In Shortest Match, or “Non-greedy” Regexp Operators, we described the POSIX 2024 shortest-match operators, and show the output of a program that demonstrated the differences. Here is the program, shortest-match.awk:

# shortest-match.awk --- Demonstrate shortest match operators

BEGIN {
    text[1] = "aaaxxxzzz";
        shortpat[1] = @/x+?/
        longpat[1] = @/x+/
    text[2] = "aaaxxxyzzz";
        shortpat[2] = @/x+?y/
        longpat[2] = @/x+y/
    text[3] = "aaaxxxxxxxxxxxxxxxxzzz";
        shortpat[3] = @/(x+?)(x+)(x+?)(x+)/
        longpat[3] = @/(x+)(x+)(x+)(x+)/
    text[4] = "aaaxyxxyxxyxzzz"
        shortpat[4] = @/((x+)(y+?)(x+))+/
        longpat[4] = @/((x+)(y+)(x+))+/
    text[5] = "aaaxyxxyxxyxzzz"
        shortpat[5] = @/((x+)(y+?)(x+)){2}/
        longpat[5] = @/((x+)(y+)(x+)){2}/

    count = length(text)
    for (i = 1; i <= count; i++) {
        show(text[i], shortpat[i], longpat[i])
    }
}

# show --- show the results of using shortest- and longest-match operators

function show(text, shortpat, longpat,
          s_offsets, l_offsets, i, n)   # locals
{
    printf("\"%s\"\n", text)

    match(text, shortpat, s_offsets)
    subresult = gensub(shortpat, "X", 1, text)
    printf("\tshortpat: /%s/, result: \"%s\"\n", shortpat, subresult)
    dump(text, s_offsets)

    match(text, longpat, l_offsets)
    subresult = gensub(longpat, "X", 1, text)
    printf("\tlongpat: /%s/, result: \"%s\"\n", longpat, subresult)
    dump(text, l_offsets)
}

# dump --- dump out the start and length of subpattern matches

function dump(text, offsets,    i, n, s, l)
{
    n = length(offsets)
    for (i = 0; i <= n; i++) {
        if ((i, "start") in offsets) {
            s = offsets[i, "start"]
            l = offsets[i, "length"]
            printf("\t\t%d: (s: %d, l: %d) -> \"%s\"\n", i,
                s, l, substr(text, s, l))
        }
    }
}

The program makes use of strongly typed regexp constants (see Strongly Typed Regexp Constants), the optional fourth argument to the match() function, and the gensub() and substr() functions (see String-Manipulation Functions).