1Oniguruma Regular Expressions Version 5.9.1 2007/09/05 2 3syntax: ONIG_SYNTAX_RUBY (default) 4 5 61. Syntax elements 7 8 \ escape (enable or disable meta character meaning) 9 | alternation 10 (...) group 11 [...] character class 12 13 142. Characters 15 16 \t horizontal tab (0x09) 17 \v vertical tab (0x0B) 18 \n newline (0x0A) 19 \r return (0x0D) 20 \b back space (0x08) 21 \f form feed (0x0C) 22 \a bell (0x07) 23 \e escape (0x1B) 24 \nnn octal char (encoded byte value) 25 \xHH hexadecimal char (encoded byte value) 26 \x{7HHHHHHH} wide hexadecimal char (character code point value) 27 \cx control char (character code point value) 28 \C-x control char (character code point value) 29 \M-x meta (x|0x80) (character code point value) 30 \M-\C-x meta control char (character code point value) 31 32 (* \b is effective in character class [...] only) 33 34 353. Character types 36 37 . any character (except newline) 38 39 \w word character 40 41 Not Unicode: 42 alphanumeric, "_" and multibyte char. 43 44 Unicode: 45 General_Category -- (Letter|Mark|Number|Connector_Punctuation) 46 47 \W non word char 48 49 \s whitespace char 50 51 Not Unicode: 52 \t, \n, \v, \f, \r, \x20 53 54 Unicode: 55 0009, 000A, 000B, 000C, 000D, 0085(NEL), 56 General_Category -- Line_Separator 57 -- Paragraph_Separator 58 -- Space_Separator 59 60 \S non whitespace char 61 62 \d decimal digit char 63 64 Unicode: General_Category -- Decimal_Number 65 66 \D non decimal digit char 67 68 \h hexadecimal digit char [0-9a-fA-F] 69 70 \H non hexadecimal digit char 71 72 73 Character Property 74 75 * \p{property-name} 76 * \p{^property-name} (negative) 77 * \P{property-name} (negative) 78 79 property-name: 80 81 + works on all encodings 82 Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, 83 Print, Punct, Space, Upper, XDigit, Word, ASCII, 84 85 + works on EUC_JP, Shift_JIS 86 Hiragana, Katakana 87 88 + works on UTF8, UTF16, UTF32 89 Any, Assigned, C, Cc, Cf, Cn, Co, Cs, L, Ll, Lm, Lo, Lt, Lu, 90 M, Mc, Me, Mn, N, Nd, Nl, No, P, Pc, Pd, Pe, Pf, Pi, Po, Ps, 91 S, Sc, Sk, Sm, So, Z, Zl, Zp, Zs, 92 Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, 93 Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, 94 Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, 95 Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, 96 Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, 97 Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, 98 Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, 99 Oriya, Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, 100 Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, 101 Tifinagh, Ugaritic, Yi 102 103 104 1054. Quantifier 106 107 greedy 108 109 ? 1 or 0 times 110 * 0 or more times 111 + 1 or more times 112 {n,m} at least n but not more than m times 113 {n,} at least n times 114 {,n} at least 0 but not more than n times ({0,n}) 115 {n} n times 116 117 reluctant 118 119 ?? 1 or 0 times 120 *? 0 or more times 121 +? 1 or more times 122 {n,m}? at least n but not more than m times 123 {n,}? at least n times 124 {,n}? at least 0 but not more than n times (== {0,n}?) 125 126 possessive (greedy and does not backtrack after repeated) 127 128 ?+ 1 or 0 times 129 *+ 0 or more times 130 ++ 1 or more times 131 132 ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only) 133 134 ex. /a*+/ === /(?>a*)/ 135 136 1375. Anchors 138 139 ^ beginning of the line 140 $ end of the line 141 \b word boundary 142 \B not word boundary 143 \A beginning of string 144 \Z end of string, or before newline at the end 145 \z end of string 146 \G matching start position 147 148 1496. Character class 150 151 ^... negative class (lowest precedence operator) 152 x-y range from x to y 153 [...] set (character class in character class) 154 ..&&.. intersection (low precedence at the next of ^) 155 156 ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w] 157 158 * If you want to use '[', '-', ']' as a normal character 159 in a character class, you should escape these characters by '\'. 160 161 162 POSIX bracket ([:xxxxx:], negate [:^xxxxx:]) 163 164 Not Unicode Case: 165 166 alnum alphabet or digit char 167 alpha alphabet 168 ascii code value: [0 - 127] 169 blank \t, \x20 170 cntrl 171 digit 0-9 172 graph include all of multibyte encoded characters 173 lower 174 print include all of multibyte encoded characters 175 punct 176 space \t, \n, \v, \f, \r, \x20 177 upper 178 xdigit 0-9, a-f, A-F 179 word alphanumeric, "_" and multibyte characters 180 181 182 Unicode Case: 183 184 alnum Letter | Mark | Decimal_Number 185 alpha Letter | Mark 186 ascii 0000 - 007F 187 blank Space_Separator | 0009 188 cntrl Control | Format | Unassigned | Private_Use | Surrogate 189 digit Decimal_Number 190 graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate 191 lower Lowercase_Letter 192 print [[:graph:]] | [[:space:]] 193 punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | 194 Final_Punctuation | Initial_Punctuation | Other_Punctuation | 195 Open_Punctuation 196 space Space_Separator | Line_Separator | Paragraph_Separator | 197 0009 | 000A | 000B | 000C | 000D | 0085 198 upper Uppercase_Letter 199 xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066 200 (0-9, a-f, A-F) 201 word Letter | Mark | Decimal_Number | Connector_Punctuation 202 203 204 2057. Extended groups 206 207 (?#...) comment 208 209 (?imx-imx) option on/off 210 i: ignore case 211 m: multi-line (dot(.) match newline) 212 x: extended form 213 (?imx-imx:subexp) option on/off for subexp 214 215 (?:subexp) not captured group 216 (subexp) captured group 217 218 (?=subexp) look-ahead 219 (?!subexp) negative look-ahead 220 (?<=subexp) look-behind 221 (?<!subexp) negative look-behind 222 223 Subexp of look-behind must be fixed character length. 224 But different character length is allowed in top level 225 alternatives only. 226 ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed. 227 228 In negative-look-behind, captured group isn't allowed, 229 but shy group(?:) is allowed. 230 231 (?>subexp) atomic group 232 don't backtrack in subexp. 233 234 (?<name>subexp), (?'name'subexp) 235 define named group 236 (All characters of the name must be a word character.) 237 238 Not only a name but a number is assigned like a captured 239 group. 240 241 Assigning the same name as two or more subexps is allowed. 242 In this case, a subexp call can not be performed although 243 the back reference is possible. 244 245 2468. Back reference 247 248 \n back reference by group number (n >= 1) 249 \k<n> back reference by group number (n >= 1) 250 \k'n' back reference by group number (n >= 1) 251 \k<-n> back reference by relative group number (n >= 1) 252 \k'-n' back reference by relative group number (n >= 1) 253 \k<name> back reference by group name 254 \k'name' back reference by group name 255 256 In the back reference by the multiplex definition name, 257 a subexp with a large number is referred to preferentially. 258 (When not matched, a group of the small number is referred to.) 259 260 * Back reference by group number is forbidden if named group is defined 261 in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted. 262 263 264 back reference with nest level 265 266 level: 0, 1, 2, ... 267 268 \k<n+level> (n >= 1) 269 \k<n-level> (n >= 1) 270 \k'n+level' (n >= 1) 271 \k'n-level' (n >= 1) 272 273 \k<name+level> 274 \k<name-level> 275 \k'name+level' 276 \k'name-level' 277 278 Destinate relative nest level from back reference position. 279 280 ex 1. 281 282 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer") 283 284 ex 2. 285 286 r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) 287 (?<element> \g<stag> \g<content>* \g<etag> ){0} 288 (?<stag> < \g<name> \s* > ){0} 289 (?<name> [a-zA-Z_:]+ ){0} 290 (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} 291 (?<etag> </ \k<name+1> >){0} 292 \g<element> 293 __REGEXP__ 294 295 p r.match('<foo>f<bar>bbb</bar>f</foo>').captures 296 297 298 2999. Subexp call ("Tanaka Akira special") 300 301 \g<name> call by group name 302 \g'name' call by group name 303 \g<n> call by group number (n >= 1) 304 \g'n' call by group number (n >= 1) 305 \g<-n> call by relative group number (n >= 1) 306 \g'-n' call by relative group number (n >= 1) 307 308 * left-most recursive call is not allowed. 309 ex. (?<name>a|\g<name>b) => error 310 (?<name>a|b\g<name>c) => OK 311 312 * Call by group number is forbidden if named group is defined in the pattern 313 and ONIG_OPTION_CAPTURE_GROUP is not setted. 314 315 * If the option status of called group is different from calling position 316 then the group's option is effective. 317 318 ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A" 319 320 32110. Captured group 322 323 Behavior of the no-named group (...) changes with the following conditions. 324 (But named group is not changed.) 325 326 case 1. /.../ (named group is not used, no option) 327 328 (...) is treated as a captured group. 329 330 case 2. /.../g (named group is not used, 'g' option) 331 332 (...) is treated as a no-captured group (?:...). 333 334 case 3. /..(?<name>..)../ (named group is used, no option) 335 336 (...) is treated as a no-captured group (?:...). 337 numbered-backref/call is not allowed. 338 339 case 4. /..(?<name>..)../G (named group is used, 'G' option) 340 341 (...) is treated as a captured group. 342 numbered-backref/call is allowed. 343 344 where 345 g: ONIG_OPTION_DONT_CAPTURE_GROUP 346 G: ONIG_OPTION_CAPTURE_GROUP 347 348 ('g' and 'G' options are argued in ruby-dev ML) 349 350 351 352----------------------------- 353A-1. Syntax depend options 354 355 + ONIG_SYNTAX_RUBY 356 (?m): dot(.) match newline 357 358 + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA 359 (?s): dot(.) match newline 360 (?m): ^ match after newline, $ match before newline 361 362 363A-2. Original extensions 364 365 + hexadecimal digit char type \h, \H 366 + named group (?<name>...), (?'name'...) 367 + named backref \k<name> 368 + subexp call \g<name>, \g<group-num> 369 370 371A-3. Lacked features compare with perl 5.8.0 372 373 + \N{name} 374 + \l,\u,\L,\U, \X, \C 375 + (?{code}) 376 + (??{code}) 377 + (?(condition)yes-pat|no-pat) 378 379 * \Q...\E 380 This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. 381 382 383A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8 384 385 + add character property (\p{property}, \P{property}) 386 + add hexadecimal digit char type (\h, \H) 387 + add look-behind 388 (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern) 389 + add possessive quantifier. ?+, *+, ++ 390 + add operations in character class. [], && 391 ('[' must be escaped as an usual char in character class.) 392 + add named group and subexp call. 393 + octal or hexadecimal number sequence can be treated as 394 a multibyte code char in character class if multibyte encoding 395 is specified. 396 (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1]) 397 + allow the range of single byte char and multibyte char in character 398 class. 399 ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding. 400 + effect range of isolated option is to next ')'. 401 ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b). 402 + isolated option is not transparent to previous pattern. 403 ex. a(?i)* is a syntax error pattern. 404 + allowed incompleted left brace as an usual string. 405 ex. /{/, /({)/, /a{2,3/ etc... 406 + negative POSIX bracket [:^xxxx:] is supported. 407 + POSIX bracket [:ascii:] is added. 408 + repeat of look-ahead is not allowed. 409 ex. /(?=a)*/, /(?!b){5}/ 410 + Ignore case option is effective to numbered character. 411 ex. /\x61/i =~ "A" 412 + In the range quantifier, the number of the minimum is omissible. 413 /a{,n}/ == /a{0,n}/ 414 The simultanious abbreviation of the number of times of the minimum 415 and the maximum is not allowed. (/a{,}/) 416 + /a{n}?/ is not a non-greedy operator. 417 /a{n}?/ == /(?:a{n})?/ 418 + invalid back reference is checked and cause error. 419 /\1/, /(a)\2/ 420 + Zero-length match in infinite repeat stops the repeat, 421 then changes of the capture group status are checked as stop condition. 422 /(?:()|())*\1\2/ =~ "" 423 /(?:\1a|())*/ =~ "a" 424 425 426A-5. Disabled functions by default syntax 427 428 + capture history 429 430 (?@...) and (?@<name>...) 431 432 ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>] 433 434 see sample/listcap.c file. 435 436 437A-6. Problems 438 439 + Invalid encoding byte sequence is not checked. 440 441 ex. UTF-8 442 443 * Invalid first byte is treated as a character. 444 /./u =~ "\xa3" 445 446 * Incomplete byte sequence is not checked. 447 /\w+/ =~ "a\xf3\x8ec" 448 449// END 450