Regex Backreference Tutorial

What is a regex backreference? Backreferencing is all about repeating characters or substrings. There are occasions where you may want to isolate repeated substrings or characters within your search string. Backreferences work in conjunction with capturing groups to perform subsequent pattern matches. You will match the first occurrence of your regex with a capturing group, and then you can use backreferences to automatically locate subsequent matches. There are two types of backreferences, numbered backreferences and named backreferences.
I'll start off with a simple example, consider my grocery list: "Apples, bananas, and tootsie rolls."
What are the characters that repeat in my grocery list? "Apples, bananas, and tootsie rolls."
We can write a regex to find these repeating characters like this: "(a)\\1", or "(o)\\1", or "(l)\\1". We can also simplify the regex by using the \\w metacharacter - "(\\w)\\1".

This is how a numbered backreference works: (capturing group)\\Number
The number of the capturing group that you want to backreference (search repeated values) must match the automatically assigned capturing group number.
regex "(\\w)\\1" will find pp, oo, and ll.

This is how a named backreference works: (named group)\\k<name>
The named group that you want to backreference (search repeated values) will be prefixed with a \\k and then the group name inside chevrons <name>.
regex "(?<test>\\w)\\k<test>" will find pp, oo, and ll.

I can best explain these concepts using source code examples.



Open the command prompt (CMD - see the Getting Started ) and type in the following commands.

C:\Windows\System32>cd \
C:\>md Java
C:\>cd Java
C:\Java>
C:\Java>md RegexBackreference
C:\Java>cd RegexBackreference
C:\Java\RegexBackreference>Notepad RegexBackreference.java

Copy and Paste, or type the following code into Notepad and be sure to save the file when you are done.


import java.util.regex.*;

class RegexBackreference {
    public static void main(String args[]) {
        Matcher m = null;

        m = Pattern.compile("pp|oo|ll").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("pp|oo|ll = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(p)\\1|(o)\\2|(l)\\3").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("(p)\\1|(o)\\2|(l)\\3 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("([a-z])\\1").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("([a-z])\\1 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(\\w)\\1").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("(\\w)\\1 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(?<test>\\w)\\k<test>").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("(?<test>\\w)\\k<test> = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("([a-z][a-z])\\1").matcher("Apples, bananas, and tootsie rolls.");
        while (m.find()) {
            System.out.println("([a-z][a-z])\\1 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(t)(o)\\2\\1").matcher("Apples, bananas, and tootsie rolls. SAAS");
        while (m.find()) {
            System.out.println("(t)(o)\\2\\1 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(\\w)(\\w)\\2\\1").matcher("Apples, bananas, and tootsie rolls. SAAS");
        while (m.find()) {
            System.out.println("(\\w)(\\w)\\2\\1 = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("(?<one>\\w)(?<two>\\w)\\k<two>\\k<one>").matcher("Apples, bananas, and tootsie rolls. SAAS");
        while (m.find()) {
            System.out.println("(?<one>\\w)(?<two>\\w)\\k<two>\\k<one> = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("<[^>]{1,}>").matcher("<html> <body> Javac Java </body> </html>");
        while (m.find()) {
            System.out.println("<[^>]{1,}> = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("<[^>]+>").matcher("<html> <body> Javac Java </body> </html>");
        while (m.find()) {
            System.out.println("<[^>]+> = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("<[^>]+>.{0,}</[^>]+>").matcher("<html> <body> Javac Java </body> </html>");
        while (m.find()) {
            System.out.println("<[^>]+>.{0,}</[^>]+> = " + m.group());
        }
        System.out.println();

        m = Pattern.compile("<([^>]+)>.*</\\1>").matcher("<html> <body> Javac Java </body> </html>");
        while (m.find()) {
            System.out.println("<([^>]+)>.*</\\1> = " + m.group());
        }
        System.out.println();

    }
}

Now switch back to the command prompt (CMD) and type in javac RegexBackreference.java and press Enter.
Now type in java RegexBackreference and press Enter.


C:\Java\RegexBackreference>javac RegexBackreference.java
C:\Java\RegexBackreference>java RegexBackreference
See video for results


Final thoughts

Just to reiterate that backreferencing is all about repeating patterns. If you found this video confusing, that is good - you are learning. Sometimes it is best to wait a day and watch it again and things tend to sink in more.


Tutorials