Lookbehind is often used to match certain text that is preceded by other text, without including the other text in the overall regex match. (?<=h)d matches only the second d in adhd. While a lot of regex flavors support lookbehind, most regex flavors only allow a subset of the regex syntax to be used inside lookbehind. Perl and Boost require the lookbehind to be of fixed length. PCRE and Ruby allow alternatives of different length, but still don’t allow quantifiers other than the fixed-length {n}.
To overcome the limitations of lookbehind, Perl 5.10, PCRE 7.2, Ruby 2.0, and Boost 1.42 introduced a new feature that can be used instead of lookbehind for its most common purpose. hd matches only the second d in adhd.
keeps the text matched so far out of the overall regex match.The JGsoft flavor has always supported unrestricted lookbehind, which is much more flexible than . Still, JGsoft V2 adds support for if you prefer this way of working.
Let’s see how hd works. The engine begins the match attempt at the start of the string. h fails to match a. There are no further alternatives to try. The match attempt at the start of the string has failed.
The engine advances one character through the string and attempts the match again. h fails to match d.
Advancing again, h matches h. The engine advances through the regex. The regex has now reached in the regex and the position between h and the second d in the string. does nothing other than to tell that if this match attempt ends up succeeding, the regex engine should pretend that the match attempt started at the present position between h and d, rather than between the first d and h where it really started.
The engine advances through the regex. d matches the second d in the string. An overall match is found. Because of the position saved by , the second d in the string is returned as the overall match.
hhhd matches the d in hhhhd. This regex first matches hhh at the start of the string. Then notes the position between hhh and hd in the string. Then d fails to match the fourth h in the string. The match attempt at the start of the string has failed.
only affects the position returned after a successful match. It does not move the start of the match attempt during the matching process. The regexNow the engine must advance one character in the string before starting the next match attempt. It advances from the actual start of the match attempt, which was at the start of the string. The position stored by hhh matches hhh, notes the position, and d matches d. Now, the position remembered by is taken into account, and d is returned as the overall match.
does not change this. So the second match attempt begins at the position after the first h in the string. Starting there,You can use (abc|de)f matches cf when preceded by ab. It also matches ef when preceded by d.
pretty much anywhere in any regular expression. You should only avoid using it inside lookbehind. You can use it inside groups, even when they have quantifiers. You can have as many instances of in your regex as you like.(abc|de)f matches cf, the capturing group captures abc as if the weren’t there. When the regex matches ef, the capturing group stores de.
does not affect capturing groups. WhenBecause
does not affect the way the regex engine goes through the matching process, it offers a lot more flexibility than lookbehind in Perl, PCRE, and Ruby. You can put anything to the left of , but you’re limited to what you can put inside lookbehind.But this flexibility does come at a cost. Lookbehind really goes backwards through the string. This allows lookbehind check for a match before the start of the match attempt. When the match attempt was started at the end of the previous match, lookbehind can match text that was part of the previous match.
cannot do this, precisely because it does not affect the way the regex engine goes through the matching process.If you iterate over all matches of (?<=a)a in the string aaaa, you will get three matches: the second, third, and fourth a in the string. The first match attempt begins at the start of the string and fails because the lookbehind fails. The second match attempt begins between the first and second a, where the lookbehind succeeds and the second a is matched. The third match attempt begins after the second a that was just matched. Here the lookbehind succeeds too. It doesn’t matter that the preceding a was part of the previous match. Thus the third match attempt matches the third a. Similarly, the fourth match attempt matches the fourth a. The fifth match attempt starts at the end of the string. The lookbehind still succeeds, but there are no characters left for a to match. The match attempt fails. The engine has reached the end of the string and the iteration stops. Five match attempts have found three matches.
Things are different when you iterate over aa in the string aaaa. You will get only two matches: the second and the fourth a. The first match attempt begins at the start of the string. The first a in the regex matches the first a in the string. notes the position. The second a matches the second a in the string, which is returned as the first match. The second match attempt begins after the second a that was just matched. The first a in the regex matches the third a in the string. notes the position. The second a matches the fourth a in the string, which is returned as the first match. The third match attempt begins at the end of the string. a fails. The engine has reached the end of the string and the iteration stops. Three match attempts have found two matches.
Basically, you’ll run into this issue when the part of the regex before the
can match the same text as the part of the regex after the . If those parts can’t match the same text, then a regex using will find the same matches than the same regex rewritten using lookbehind. In that case, you should use instead of lookbehind as that will give you better performance in Perl, PCRE, and Ruby.Another limitation is that while lookbehind comes in positive and negative variants, (?<!a)b matches the string b entirely, because it is a “b” not preceded by an “a”. [^a]b does not match the string b at all. When attempting the match, [^a] matches b. The regex has now reached the end of the string. notes this position. But now there is nothing left for b to match. The match attempt fails. [^a]b is the same as (?<=[^a])b, which are both different from (?<!a)b.
does not provide a way to negate anything.| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |
| Introduction | Table of Contents | Special Characters | Non-Printable Characters | Regex Engine Internals | Character Classes | Character Class Subtraction | Character Class Intersection | Shorthand Character Classes | Dot | Anchors | Word Boundaries | Alternation | Optional Items | Repetition | Grouping & Capturing | Backreferences | Backreferences, part 2 | Named Groups | Relative Backreferences | Branch Reset Groups | Free-Spacing & Comments | Unicode | Mode Modifiers | Atomic Grouping | Possessive Quantifiers | Lookahead & Lookbehind | Lookaround, part 2 | Keep Text out of The Match | Conditionals | Balancing Groups | Recursion | Subroutines | Infinite Recursion | Recursion & Quantifiers | Recursion & Capturing | Recursion & Backreferences | Recursion & Backtracking | POSIX Bracket Expressions | Zero-Length Matches | Continuing Matches |
Page URL: https://www.regular-expressions.info/keep.html
Page last updated: 12 August 2021
Site last updated: 06 November 2024
Copyright © 2003-2024 Jan Goyvaerts. All rights reserved.