1Oniguruma Regular Expressions Version 6.9.4 2019/10/31 2 3syntax: ONIG_SYNTAX_ONIGURUMA (default) 4 5 61. Syntax elements 7 8 \ escape (enable or disable meta character) 9 | alternation 10 (...) group 11 [...] character class 12 13 142. Characters 15 16 \t horizontal tab (0x09) 17 \v vertical tab (0x0B) 18 \n newline (line feed) (0x0A) 19 \r carriage return (0x0D) 20 \b backspace (0x08) 21 \f form feed (0x0C) 22 \a bell (0x07) 23 \e escape (0x1B) 24 \nnn octal char (encoded byte value) 25 \o{17777777777} wide octal char (character code point value) 26 \uHHHH wide hexadecimal char (character code point value) 27 \xHH hexadecimal char (encoded byte value) 28 \x{7HHHHHHH} wide hexadecimal char (character code point value) 29 \cx control char (character code point value) 30 \C-x control char (character code point value) 31 \M-x meta (x|0x80) (character code point value) 32 \M-\C-x meta control char (character code point value) 33 34 (* \b as backspace is effective in character class only) 35 36 373. Character types 38 39 . any character (except newline) 40 41 \w word character 42 43 Not Unicode: 44 alphanumeric, "_" and multibyte char. 45 46 Unicode: 47 General_Category -- (Letter|Mark|Number|Connector_Punctuation) 48 49 \W non-word char 50 51 \s whitespace char 52 53 Not Unicode: 54 \t, \n, \v, \f, \r, \x20 55 56 Unicode case: 57 U+0009, U+000A, U+000B, U+000C, U+000D, U+0085(NEL), 58 General_Category -- Line_Separator 59 -- Paragraph_Separator 60 -- Space_Separator 61 62 \S non-whitespace char 63 64 \d decimal digit char 65 66 Unicode: General_Category -- Decimal_Number 67 68 \D non-decimal-digit char 69 70 \h hexadecimal digit char [0-9a-fA-F] 71 72 \H non-hexdigit char 73 74 \R general newline (* can't be used in character-class) 75 "\r\n" or \n,\v,\f,\r (* but doesn't backtrack from \r\n to \r) 76 77 Unicode case: 78 "\r\n" or \n,\v,\f,\r or U+0085, U+2028, U+2029 79 80 \N negative newline (?-m:.) 81 82 \O true anychar (?m:.) (* original function) 83 84 \X Text Segment \X === (?>\O(?:\Y\O)*) 85 86 The meaning of this operator changes depending on the setting of 87 the option (?y{..}). 88 89 \X doesn't check whether matching start position is boundary or not. 90 Please write as \y\X if you want to ensure it. 91 92 [Extended Grapheme Cluster mode] (default) 93 Unicode case: 94 See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] 95 96 Not Unicode case: \X === (?>\r\n|\O) 97 98 [Word mode] 99 Currently, this mode is supported in Unicode only. 100 See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] 101 102 103 Character Property 104 105 * \p{property-name} 106 * \p{^property-name} (negative) 107 * \P{property-name} (negative) 108 109 property-name: 110 111 + works on all encodings 112 Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, 113 Print, Punct, Space, Upper, XDigit, Word, ASCII 114 115 + works on EUC_JP, Shift_JIS 116 Hiragana, Katakana 117 118 + works on UTF8, UTF16, UTF32 119 See doc/UNICODE_PROPERTIES. 120 121 122 1234. Quantifier 124 125 greedy 126 127 ? 1 or 0 times 128 * 0 or more times 129 + 1 or more times 130 {n,m} (n <= m) at least n but no more than m times 131 {n,} at least n times 132 {,n} at least 0 but no more than n times ({0,n}) 133 {n} n times 134 135 reluctant 136 137 ?? 0 or 1 times 138 *? 0 or more times 139 +? 1 or more times 140 {n,m}? (n <= m) at least n but not more than m times 141 {n,}? at least n times 142 {,n}? at least 0 but not more than n times (== {0,n}?) 143 144 possessive (greedy and does not backtrack once match) 145 146 ?+ 1 or 0 times 147 *+ 0 or more times 148 ++ 1 or more times 149 {n,m} (n > m) at least m but not more than n times 150 151 {n,m}+, {n,}+, {n}+ are possessive operators in ONIG_SYNTAX_JAVA and 152 ONIG_SYNTAX_PERL only. 153 154 ex. /a*+/ === /(?>a*)/ 155 156 1575. Anchors 158 159 ^ beginning of the line 160 $ end of the line 161 \b word boundary 162 \B non-word boundary 163 164 \A beginning of string 165 \Z end of string, or before newline at the end 166 \z end of string 167 \G where the current search attempt begins 168 \K keep (keep start position of the result string) 169 170 171 \y Text Segment boundary 172 \Y Text Segment non-boundary 173 174 The meaning of these operators(\y, \Y) changes depending on the setting 175 of the option (?y{..}). 176 177 [Extended Grapheme Cluster mode] (default) 178 Unicode case: 179 See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] 180 181 Not Unicode: 182 All positions except between \r and \n. 183 184 [Word mode] 185 Currently, this mode is supported in Unicode only. 186 See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/] 187 188 189 1906. Character class 191 192 ^... negative class (lowest precedence) 193 x-y range from x to y 194 [...] set (character class in character class) 195 ..&&.. intersection (low precedence, only higher than ^) 196 197 ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w] 198 199 * If you want to use '[', '-', or ']' as a normal character 200 in character class, you should escape them with '\'. 201 202 203 POSIX bracket ([:xxxxx:], negate [:^xxxxx:]) 204 205 Not Unicode Case: 206 207 alnum alphabet or digit char 208 alpha alphabet 209 ascii code value: [0 - 127] 210 blank \t, \x20 211 cntrl 212 digit 0-9 213 graph include all of multibyte encoded characters 214 lower 215 print include all of multibyte encoded characters 216 punct 217 space \t, \n, \v, \f, \r, \x20 218 upper 219 xdigit 0-9, a-f, A-F 220 word alphanumeric, "_" and multibyte characters 221 222 223 Unicode Case: 224 225 alnum Letter | Mark | Decimal_Number 226 alpha Letter | Mark 227 ascii 0000 - 007F 228 blank Space_Separator | 0009 229 cntrl Control | Format | Unassigned | Private_Use | Surrogate 230 digit Decimal_Number 231 graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate 232 lower Lowercase_Letter 233 print [[:graph:]] | [[:space:]] 234 punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | 235 Final_Punctuation | Initial_Punctuation | Other_Punctuation | 236 Open_Punctuation 237 space Space_Separator | Line_Separator | Paragraph_Separator | 238 U+0009 | U+000A | U+000B | U+000C | U+000D | U+0085 239 upper Uppercase_Letter 240 xdigit U+0030 - U+0039 | U+0041 - U+0046 | U+0061 - U+0066 241 (0-9, a-f, A-F) 242 word Letter | Mark | Decimal_Number | Connector_Punctuation 243 244 245 2467. Extended groups 247 248 (?#...) comment 249 250 (?imxWDSPy-imxWDSP:subexp) option on/off for subexp 251 252 i: ignore case 253 m: multi-line (dot (.) also matches newline) 254 x: extended form 255 W: ASCII only word (\w, \p{Word}, [[:word:]]) 256 ASCII only word bound (\b) 257 D: ASCII only digit (\d, \p{Digit}, [[:digit:]]) 258 S: ASCII only space (\s, \p{Space}, [[:space:]]) 259 P: ASCII only POSIX properties (includes W,D,S) 260 (alnum, alpha, blank, cntrl, digit, graph, 261 lower, print, punct, space, upper, xdigit, word) 262 263 y{?}: Text Segment mode 264 This option changes the meaning of \X, \y, \Y. 265 Currently, this option is supported in Unicode only. 266 267 y{g}: Extended Grapheme Cluster mode (default) 268 y{w}: Word mode 269 See [Unicode Standard Annex #29] 270 271 (?imxWDSPy-imxWDSP) isolated option 272 273 * It makes a group to the next ')' or end of the pattern. 274 /ab(?i)c|def|gh/ == /ab(?i:c|def|gh)/ 275 276 277 (?:subexp) non-capturing group 278 (subexp) capturing group 279 280 (?=subexp) look-ahead 281 (?!subexp) negative look-ahead 282 (?<=subexp) look-behind 283 (?<!subexp) negative look-behind 284 285 Subexp of look-behind must be fixed-width. 286 But top-level alternatives can be of various lengths. 287 ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed. 288 289 In negative look-behind, capturing group isn't allowed, 290 but non-capturing group (?:) is allowed. 291 292 * In look-behind and negative look-behind, support for 293 ignore-case option is limited. Only supports conversion 294 between single characters. (Does not support conversion 295 of multiple characters in Unicode) 296 297 (?>subexp) atomic group 298 no backtracks in subexp. 299 300 (?<name>subexp), (?'name'subexp) 301 define named group 302 (Each character of the name must be a word character.) 303 304 Not only a name but a number is assigned like a capturing 305 group. 306 307 Assigning the same name to two or more subexps is allowed. 308 309 310 <Callouts> 311 312 * Callouts of contents 313 (?{...contents...}) callout in progress 314 (?{...contents...}D) D is a direction flag char 315 D = 'X': in progress and retraction 316 '<': in retraction only 317 '>': in progress only 318 (?{...contents...}[tag]) tag assigned 319 (?{...contents...}[tag]D) 320 321 * Escape characters have no effects in contents. 322 * contents is not allowed to start with '{'. 323 324 (?{{{...contents...}}}) n times continuations '}' in contents is allowed in 325 (n+1) times continuations {{{...}}}. 326 327 Allowed tag string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) 328 329 330 * Callouts of name 331 (*name) 332 (*name{args...}) with args 333 (*name[tag]) tag assigned 334 (*name[tag]{args...}) 335 336 Allowed name string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) 337 Allowed tag string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z) 338 339 340 <Absent functions> 341 342 (?~absent) Absent repeater (* proposed by Tanaka Akira) 343 This works like .* (more precisely \O*), but it is 344 limited by the range that does not include the string 345 match with <absent>. 346 This is a written abbreviation of (?~|(?:absent)|\O*). 347 \O* is used as a repeater. 348 349 (?~|absent|exp) Absent expression (* original) 350 This works like "exp", but it is limited by the range 351 that does not include the string match with <absent>. 352 353 ex. (?~|345|\d*) "12345678" ==> "12", "1", "" 354 355 (?~|absent) Absent stopper (* original) 356 After passed this operator, string right range is limited 357 at the point that does not include the string match whth 358 <absent>. 359 360 (?~|) Range clear 361 Clear the effects caused by Absent stoppers. 362 363 * Nested Absent functions are not supported and the behavior 364 is undefined. 365 366 367 <if-then-else> 368 369 (?(condition_exp)then_exp|else_exp) if-then-else 370 (?(condition_exp)then_exp) if-then 371 372 condition_exp can be a backreference number/name or a normal 373 regular expression. 374 When condition_exp is a backreference number/name, both then_exp and 375 else_exp can be omitted. 376 Then it works as a backreference validity checker. 377 378 [ Backreference validity checker ] (* original) 379 380 (?(n)), (?(-n)), (?(+n)), (?(n+level)) ... 381 (?(<n>)), (?('-n')), (?(<+n>)) ... 382 (?(<name>)), (?('name')), (?(<name+level>)) ... 383 384 385 3868. Backreferences 387 388 When we say "backreference a group," it actually means, "re-match the same 389 text matched by the subexp in that group." 390 391 \n \k<n> \k'n' (n >= 1) backreference the nth group in the regexp 392 \k<-n> \k'-n' (n >= 1) backreference the nth group counting 393 backwards from the referring position 394 \k<+n> \k'+n' (n >= 1) backreference the nth group counting 395 forwards from the referring position 396 \k<name> \k'name' backreference a group with the specified name 397 398 When backreferencing with a name that is assigned to more than one groups, 399 the last group with the name is checked first, if not matched then the 400 previous one with the name, and so on, until there is a match. 401 402 * Backreference by number is forbidden if any named group is defined and 403 ONIG_OPTION_CAPTURE_GROUP is not set. 404 405 406 backreference with recursion level 407 408 (n >= 1, level >= 0) 409 410 \k<n+level> \k'n+level' 411 \k<n-level> \k'n-level' 412 413 \k<name+level> \k'name+level' 414 \k<name-level> \k'name-level' 415 416 Destine a group on the recursion level relative to the referring position. 417 418 ex 1. 419 420 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee") 421 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer") 422 423 \k<b+0> refers to the (?<b>.) on the same recursion level with it. 424 425 ex 2. 426 427 r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) 428 (?<element> \g<stag> \g<content>* \g<etag> ){0} 429 (?<stag> < \g<name> \s* > ){0} 430 (?<name> [a-zA-Z_:]+ ){0} 431 (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} 432 (?<etag> </ \k<name+1> >){0} 433 \g<element> 434 __REGEXP__ 435 436 p r.match("<foo>f<bar>bbb</bar>f</foo>").captures 437 438 4399. Subexp calls ("Tanaka Akira special") (* original function) 440 441 When we say "call a group," it actually means, "re-execute the subexp in 442 that group." 443 444 \g<n> \g'n' (n >= 1) call the nth group 445 \g<0> \g'0' call zero (call the total regexp) 446 \g<-n> \g'-n' (n >= 1) call the nth group counting backwards from 447 the calling position 448 \g<+n> \g'+n' (n >= 1) call the nth group counting forwards from 449 the calling position 450 \g<name> \g'name' call the group with the specified name 451 452 * Left-most recursive calls are not allowed. 453 454 ex. (?<name>a|\g<name>b) => error 455 (?<name>a|b\g<name>c) => OK 456 457 * Calls with a name that is assigned to more than one groups are not 458 allowed. 459 460 * Call by number is forbidden if any named group is defined and 461 ONIG_OPTION_CAPTURE_GROUP is not set. 462 463 * The option status of the called group is always effective. 464 465 ex. /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A") 466 467 46810. Captured group 469 470 Behavior of an unnamed group (...) changes with the following conditions. 471 (But named group is not changed.) 472 473 case 1. /.../ (named group is not used, no option) 474 475 (...) is treated as a capturing group. 476 477 case 2. /.../g (named group is not used, 'g' option) 478 479 (...) is treated as a non-capturing group (?:...). 480 481 case 3. /..(?<name>..)../ (named group is used, no option) 482 483 (...) is treated as a non-capturing group. 484 numbered-backref/call is not allowed. 485 486 case 4. /..(?<name>..)../G (named group is used, 'G' option) 487 488 (...) is treated as a capturing group. 489 numbered-backref/call is allowed. 490 491 where 492 g: ONIG_OPTION_DONT_CAPTURE_GROUP 493 G: ONIG_OPTION_CAPTURE_GROUP 494 495 ('g' and 'G' options are argued in ruby-dev ML) 496 497 498 499----------------------------- 500A-1. Syntax-dependent options 501 502 + ONIG_SYNTAX_ONIGURUMA 503 (?m): dot (.) also matches newline 504 505 + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA 506 (?s): dot (.) also matches newline 507 (?m): ^ matches after newline, $ matches before newline 508 509 510A-2. Original extensions 511 512 + hexadecimal digit char type \h, \H 513 + true anychar \O 514 + text segment boundary \y, \Y 515 + backreference validity checker (?(...)) 516 + named group (?<name>...), (?'name'...) 517 + named backref \k<name> 518 + subexp call \g<name>, \g<group-num> 519 + absent expression (?~|...|...) 520 + absent stopper (?|...) 521 522 523A-3. Missing features compared with perl 5.8.0 524 525 + \N{name} 526 + \l,\u,\L,\U,\C 527 + (??{code}) 528 529 * \Q...\E 530 This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. 531 532 533A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8 534 535 + add character property (\p{property}, \P{property}) 536 + add hexadecimal digit char type (\h, \H) 537 + add look-behind 538 (?<=fixed-width-pattern), (?<!fixed-width-pattern) 539 + add possessive quantifier. ?+, *+, ++ 540 + add operations in character class. [], && 541 ('[' must be escaped as an usual char in character class.) 542 + add named group and subexp call. 543 + octal or hexadecimal number sequence can be treated as 544 a multibyte code char in character class if multibyte encoding 545 is specified. 546 (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1]) 547 + allow the range of single byte char and multibyte char in character 548 class. 549 ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding. 550 + effect range of isolated option is to next ')'. 551 ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b). 552 + isolated option is not transparent to previous pattern. 553 ex. a(?i)* is a syntax error pattern. 554 + allowed unpaired left brace as a normal character. 555 ex. /{/, /({)/, /a{2,3/ etc... 556 + negative POSIX bracket [:^xxxx:] is supported. 557 + POSIX bracket [:ascii:] is added. 558 + repeat of look-ahead is not allowed. 559 ex. /(?=a)*/, /(?!b){5}/ 560 + Ignore case option is effective to escape sequence. 561 ex. /\x61/i =~ "A" 562 + In the range quantifier, the number of the minimum is optional. 563 /a{,n}/ == /a{0,n}/ 564 The omission of both minimum and maximum values is not allowed. 565 /a{,}/ 566 + /{n}?/ is not a reluctant quantifier. 567 /a{n}?/ == /(?:a{n})?/ 568 + invalid back reference is checked and raises error. 569 /\1/, /(a)\2/ 570 + Zero-width match in an infinite loop stops the repeat, 571 then changes of the capture group status are checked as stop condition. 572 /(?:()|())*\1\2/ =~ "" 573 /(?:\1a|())*/ =~ "a" 574 575// END 576