Replacement Text Tutorial |
Introduction |
Characters |
Non-Printable Characters |
Matched Text |
Backreferences |
Match Context |
Case Conversion |
Conditionals |
If your regular expression has named or numbered capturing groups, then you can reinsert the text matched by any of those capturing groups in the replacement text. Your replacement text can reference as many groups as you like, and can even reference the same group more than once. This makes it possible to rearrange the text matched by a regular expression in many different ways. As a simple example, the regex \*(\w+)\* matches a single word between asterisks, storing the word in the first (and only) capturing group. The replacement text <b>\1</b> replaces each regex match with the text stored by the capturing group between bold tags. Effectively, this search-and-replace replaces the asterisks with bold tags, leaving the word between the asterisks in place. This technique using backreferences is important to understand. Replacing *word* as a whole with <b>word</b> is far easier and far more efficient than trying to come up with a way to correctly replace the asterisks separately.
The \1 syntax for backreferences in the replacement text is borrowed from the syntax for backreferences in the regular expression. \1 through \9 are supported by the JGsoft applications, Delphi, Perl (though deprecated), Python, Ruby, PHP, R, Boost, and Tcl. Double-digit backreferences \10 through \99 are supported by the JGsoft applications, Delphi, Python, and Boost. If there are not enough capturing groups in the regex for the double-digit backreference to be valid, then all these flavors treat \10 through \99 as a single-digit backreference followed by a literal digit. The flavors that support single-digit backreferences but not double-digit backreferences also do this.
$1 through $99 for single-digit and double-digit backreferences are supported by the JGsoft applications, Delphi, .NET, Java, JavaScript, VBScript, PCRE2, PHP, Boost, std::regex, and XPath. These are also the variables that hold text matched by capturing groups in Perl. If there are not enough capturing groups in the regex for a double-digit backreference to be valid, then $10 through $99 are treated as a single-digit backreference followed by a literal digit by all these flavors except .NET, Perl, PCRE2, and std::regex..
Putting curly braces around the digit ${1} isolates the digit from any literal digits that follow. This works in the JGsoft applications, Delphi, .NET, Perl, PCRE2, PHP, Boost, and XRegExp.
If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?'name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .NET, PCRE2, Java 7, and XRegExp. PCRE2 also supports $name without the curly braces. In Perl 5.10 and later you can interpolate the variable $+{name}. Boost too uses $+{name} in replacement strings. ${name} does not work in any version of Perl. $name is unique to PCRE2.
In Python, if you have the regex (?P<name>group) then you can use its match in the replacement text with \g<name>. This syntax also works in the JGsoft applications and Delphi. Python and the JGsoft applications, but not Delphi, also support numbered backreferences using this syntax. In Python this is the only way to have a numbered backreference immediately followed by a literal digit.
PHP and R support named capturing groups and named backreferences in regular expressions. But they do not support named backreferences in replacement texts. You’ll have to use numbered backreferences in the replacement text to reinsert text matched by named groups. To determine the numbers, count the opening parentheses of all capturing groups (named and unnamed) in the regex from left to right.
An invalid backreference is a reference to a number greater than the number of capturing groups in the regex or a reference to a name that does not exist in the regex. Such a backreference can be treated in three different ways. Delphi, Perl, Ruby, PHP, R, Boost, std::regex, XPath, and Tcl substitute the empty string for invalid backreferences. Java, XRegExp, PCRE2, and Python treat them as a syntax error. JavaScript (without XRegExp) and .NET treat them as literal text.
The original JGsoft flavor replaced invalid backreferences with the empty string. But JGsoft V2 treats them as a syntax error. Applications using the V2 flavor all apply syntax coloring to replacement strings, highlighting invalid backreferences in red.
A non-participating capturing group is a group that did not participate in the match attempt at all. This is different from a group that matched an empty string. The group in a(b?)c always participates in the match. Its contents are optional but the group itself is not optional. The group in a(b)?c is optional. It participates when the regex matches abc, but not when the regex matches ac.
In most applications, there is no difference between a backreference in the replacement string to a group that matched the empty string or a group that did not participate. Both are replaced with an empty string. Two exceptions are Python and PCRE2. They do allow backreferences in the replacement string to optional capturing groups. But the search-and-replace will return an error code in PCRE2 if the capturing group happens not to participate in one of the regex matches. The same situation raises an exception in Python 3.4 and prior. Python 3.5 no longer raises the exception.
In the JGsoft applications and Delphi, $+ inserts the text matched by the highest-numbered group that actually participated in the match. In Perl 5.18, the variable $+ holds the same text. When (a)(b)|(c)(d) matches ab, $+ is substituted with b. When the same regex matches cd, $+ inserts d. \+ does the same in the JGsoft applications, Delphi, and Ruby.
In .NET, VBScript, and Boost $+ inserts the text matched by the highest-numbered group, regardless of whether it participated in the match or not. If it didn’t, nothing is inserted. In Perl 5.16 and prior, the variable, the variable $+ holds the same text. When (a)(b)|(c)(d) matches ab, $+ is substituted with the empty string. When the same regex matches cd, $+ inserts d.
Boost 1.42 added additional syntax of its own invention for either meaning of highest-numbered group. $^N, $LAST_SUBMATCH_RESULT, and ${^LAST_SUBMATCH_RESULT} all insert the text matched by the highest-numbered group that actually participated in the match. $LAST_PAREN_MATCH and ${^LAST_PAREN_MATCH} both insert the text matched by the highest-numbered group regardless of whether participated in the match.
| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |
| Introduction | Characters | Non-Printable Characters | Matched Text | Backreferences | Match Context | Case Conversion | Conditionals |
Page URL: https://www.regular-expressions.info/replacebackref.html
Page last updated: 12 August 2021
Site last updated: 06 November 2024
Copyright © 2003-2024 Jan Goyvaerts. All rights reserved.