Replacement Reference |
Characters |
Matched Text & Backreferences |
Context & Case Conversion |
Conditionals |
This reference page explains what the Unicode tokens do when used outside character classes. All of these except \X can also be used inside character classes. Inside a character class, these tokens add the characters that they normally match to the character class.
Feature | Syntax | Description | Example | JGsoft | .NET | Java | Perl | PCRE | PCRE2 | PHP | Delphi | R | JavaScript | VBScript | XRegExp | Python | Ruby | std::regex | Boost | Tcl ARE | POSIX BRE | POSIX ERE | GNU BRE | GNU ERE | Oracle | XML | XPath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Grapheme | \X | Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a “character”. | \X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc. | YES | no | 9 | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | no | no | 2.0 | no | ECMA extended egrep awk | no | no | no | no | no | no | no | no |
Code point | \uFFFF where FFFF are 4 hexadecimal digits | Matches a specific Unicode code point. | \u00E0 matches à encoded as U+00E0 only. \u00A9 matches © | YES | YES | YES | no | no | no | no | no | no | YES | YES | YES | 3.3 2.4 string | 1.9 | ECMA | no | YES | no | no | no | no | no | no | no |
Code point | \u{FFFF} where FFFF are 1 to 4 hexadecimal digits | Matches a specific Unicode code point. | \u{E0} matches à encoded as U+00E0 only. \u{A9} matches © | V2 | no | no | no | no | no | 7.0.0 string | no | no | no | no | 3 | no | 1.9 | no | no | no | no | no | no | no | no | no | no |
Code point | \xFFFF where FFFF are 4 hexadecimal digits | Matches a specific Unicode code point. | \x00E0 matches à encoded as U+00E0 only. \x00A9 matches © | no | no | no | no | no | no | no | no | no | no | no | no | no | no | string | no | 8.4–8.5 | no | no | no | no | no | no | no |
Code point | \x{FFFF} where FFFF are 1 to 4 hexadecimal digits | Matches a specific Unicode code point. | \x{E0} matches à encoded as U+00E0 only. \x{A9} matches © | YES | no | 7 | YES | YES | YES | YES | YES | YES | no | no | no | no | no | no | ECMA extended egrep awk | no | no | no | no | no | no | no | no |
Unicode category | \pL where L is a Unicode category | Matches a single Unicode code point in the specified Unicode category. | \pL matches à encoded as U+00E0; \pS matches © | YES | no | YES | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | 3 | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode category | \PL where L is a Unicode category | Matches a single Unicode code point that is not in the specified Unicode category. | \PS matches à encoded as U+00E0; \PL matches © | YES | no | YES | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | 3 | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode category | \p{L} where L is a Unicode category | Matches a single Unicode code point in the specified Unicode category. | \p{L} matches à encoded as U+00E0; \p{S} matches © | YES | YES | YES | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | YES | no | 1.9 | no | no | no | no | no | no | no | no | YES | YES |
Unicode category | \p{IsL} where L is a Unicode category | Matches a single Unicode code point in the specified Unicode category. | \p{IsL} matches à encoded as U+00E0; \p{IsS} matches © | YES | no | YES | YES | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode category | \p{Category} | Matches a single Unicode code point in the specified Unicode category. | \p{Letter} matches à encoded as U+00E0; \p{Symbol} matches © | YES | no | no | YES | no | no | no | no | no | no | no | YES | no | 1.9 | no | no | no | no | no | no | no | no | no | no |
Unicode category | \p{IsCategory} | Matches a single Unicode code point in the specified Unicode category. | \p{IsLetter} matches à encoded as U+00E0; \p{IsSymbol} matches © | YES | no | no | YES | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode script | \p{Script} | Matches a single Unicode code point that is part of the specified Unicode script. Each Unicode code point is part of exactly one script. Scripts never contain unassigned code points. | \p{Greek} matches Ω | YES | no | no | YES | 6.5 | YES | 5.1.3 | YES | YES | no | no | YES | no | 1.9 | no | no | no | no | no | no | no | no | no | no |
Unicode script | \p{IsScript} | Matches a single Unicode code point that is part of the specified Unicode script. Each Unicode code point is part of exactly one script. Scripts never contain unassigned code points. | \p{IsGreek} matches Ω | YES | no | 7 | YES | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode block | \p{Block} | Matches a single Unicode code point that is part of the specified Unicode block. Each Unicode code point is part of exactly one block. Blocks may contain unassigned code points. | \p{Arrows} matches any of the code points from U+2190 until U+21FF (← until ⇿) | YES | no | no | YES | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no |
Unicode block | \p{InBlock} | Matches a single Unicode code point that is part of the specified Unicode block. Each Unicode code point is part of exactly one block. Blocks may contain unassigned code points. | \p{InArrows} matches any of the code points from U+2190 until U+21FF (← until ⇿) | YES | no | YES | YES | no | no | no | no | no | no | no | 2–4 | no | 2.0 | no | no | no | no | no | no | no | no | no | no |
Unicode block | \p{IsBlock} | Matches a single Unicode code point that is part of the specified Unicode block. Each Unicode code point is part of exactly one block. Blocks may contain unassigned code points. | \p{IsArrows} matches any of the code points from U+2190 until U+21FF (← until ⇿) | YES | YES | no | YES | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | no | YES | YES |
Negated Unicode property | \P{Property} | Matches a single Unicode code point that does not have the specified property (category, script, or block). | \P{L} matches © | YES | YES | YES | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | YES | no | 1.9 | no | ECMA extended egrep awk | no | no | no | no | no | no | YES | YES |
Negated Unicode property | \p{^Property} | Matches a single Unicode code point that does not have the specified property (category, script, or block). | \p{^L} matches © | YES | no | no | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | YES | no | 1.9 | no | no | no | no | no | no | no | no | no | no |
Unicode property | \P{^Property} | Matches a single Unicode code point that does have the specified property (category, script, or block). Double negative is taken as positive. | \P{^L} matches q | V2 | no | no | YES | 5.0 | YES | 5.0.5 | YES | YES | no | no | no | no | 1.9 | no | no | no | no | no | no | no | no | no | no |
Feature | Syntax | Description | Example | JGsoft | .NET | Java | Perl | PCRE | PCRE2 | PHP | Delphi | R | JavaScript | VBScript | XRegExp | Python | Ruby | std::regex | Boost | Tcl ARE | POSIX BRE | POSIX ERE | GNU BRE | GNU ERE | Oracle | XML | XPath |
| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |
| Introduction | Table of Contents | Quick Reference | Characters | Basic Features | Character Classes | Shorthands | Anchors | Word Boundaries | Quantifiers | Unicode | Capturing Groups & Backreferences | Named Groups & Backreferences | Special Groups | Mode Modifiers | Recursion & Balancing Groups |
| Characters | Matched Text & Backreferences | Context & Case Conversion | Conditionals |
Page URL: https://www.regular-expressions.info/refunicode.html
Page last updated: 13 August 2021
Site last updated: 06 November 2024
Copyright © 2003-2024 Jan Goyvaerts. All rights reserved.