1Oniguruma Regular Expressions Version 4.3.0 2006/08/17 2 3syntax: ONIG_SYNTAX_RUBY (default) 4 5 61. Syntax elements 7 8 \ escape (enable or disable meta character meaning) 9 | alternation 10 (...) group 11 [...] character class 12 13 142. Characters 15 16 \t horizontal tab (0x09) 17 \v vertical tab (0x0B) 18 \n newline (0x0A) 19 \r return (0x0D) 20 \b back space (0x08) 21 \f form feed (0x0C) 22 \a bell (0x07) 23 \e escape (0x1B) 24 \nnn octal char (encoded byte value) 25 \xHH hexadecimal char (encoded byte value) 26 \x{7HHHHHHH} wide hexadecimal char (character code point value) 27 \cx control char (character code point value) 28 \C-x control char (character code point value) 29 \M-x meta (x|0x80) (character code point value) 30 \M-\C-x meta control char (character code point value) 31 32 (* \b is effective in character class [...] only) 33 34 353. Character types 36 37 . any character (except newline) 38 39 \w word character 40 41 Not Unicode: 42 alphanumeric, "_" and multibyte char. 43 44 Unicode: 45 General_Category -- (Letter|Mark|Number|Connector_Punctuation) 46 47 \W non word char 48 49 \s whitespace char 50 51 Not Unicode: 52 \t, \n, \v, \f, \r, \x20 53 54 Unicode: 55 0009, 000A, 000B, 000C, 000D, 0085(NEL), 56 General_Category -- Line_Separator 57 -- Paragraph_Separator 58 -- Space_Separator 59 60 \S non whitespace char 61 62 \d decimal digit char 63 64 Unicode: General_Category -- Decimal_Number 65 66 \D non decimal digit char 67 68 \h hexadecimal digit char [0-9a-fA-F] 69 70 \H non hexadecimal digit char 71 72 734. Quantifier 74 75 greedy 76 77 ? 1 or 0 times 78 * 0 or more times 79 + 1 or more times 80 {n,m} at least n but not more than m times 81 {n,} at least n times 82 {,n} at least 0 but not more than n times ({0,n}) 83 {n} n times 84 85 reluctant 86 87 ?? 1 or 0 times 88 *? 0 or more times 89 +? 1 or more times 90 {n,m}? at least n but not more than m times 91 {n,}? at least n times 92 {,n}? at least 0 but not more than n times (== {0,n}?) 93 94 possessive (greedy and does not backtrack after repeated) 95 96 ?+ 1 or 0 times 97 *+ 0 or more times 98 ++ 1 or more times 99 100 ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only) 101 102 ex. /a*+/ === /(?>a*)/ 103 104 1055. Anchors 106 107 ^ beginning of the line 108 $ end of the line 109 \b word boundary 110 \B not word boundary 111 \A beginning of string 112 \Z end of string, or before newline at the end 113 \z end of string 114 \G matching start position (*) 115 116 * Ruby Regexp: 117 previous end-of-match position 118 (This specification is not related to this library.) 119 120 1216. Character class 122 123 ^... negative class (lowest precedence operator) 124 x-y range from x to y 125 [...] set (character class in character class) 126 ..&&.. intersection (low precedence at the next of ^) 127 128 ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w] 129 130 * If you want to use '[', '-', ']' as a normal character 131 in a character class, you should escape these characters by '\'. 132 133 134 POSIX bracket ([:xxxxx:], negate [:^xxxxx:]) 135 136 Not Unicode Case: 137 138 alnum alphabet or digit char 139 alpha alphabet 140 ascii code value: [0 - 127] 141 blank \t, \x20 142 cntrl 143 digit 0-9 144 graph include all of multibyte encoded characters 145 lower 146 print include all of multibyte encoded characters 147 punct 148 space \t, \n, \v, \f, \r, \x20 149 upper 150 xdigit 0-9, a-f, A-F 151 152 153 Unicode Case: 154 155 alnum Letter | Mark | Decimal_Number 156 alpha Letter | Mark 157 ascii 0000 - 007F 158 blank Space_Separator | 0009 159 cntrl Control | Format | Unassigned | Private_Use | Surrogate 160 digit Decimal_Number 161 graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate 162 lower Lowercase_Letter 163 print [[:graph:]] | [[:space:]] 164 punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | 165 Final_Punctuation | Initial_Punctuation | Other_Punctuation | 166 Open_Punctuation 167 space Space_Separator | Line_Separator | Paragraph_Separator | 168 0009 | 000A | 000B | 000C | 000D | 0085 169 upper Uppercase_Letter 170 xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066 171 (0-9, a-f, A-F) 172 173 1747. Extended groups 175 176 (?#...) comment 177 178 (?imx-imx) option on/off 179 i: ignore case 180 m: multi-line (dot(.) match newline) 181 x: extended form 182 (?imx-imx:subexp) option on/off for subexp 183 184 (?:subexp) not captured group 185 (subexp) captured group 186 187 (?=subexp) look-ahead 188 (?!subexp) negative look-ahead 189 (?<=subexp) look-behind 190 (?<!subexp) negative look-behind 191 192 Subexp of look-behind must be fixed character length. 193 But different character length is allowed in top level 194 alternatives only. 195 ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed. 196 197 In negative-look-behind, captured group isn't allowed, 198 but shy group(?:) is allowed. 199 200 (?>subexp) atomic group 201 don't backtrack in subexp. 202 203 (?<name>subexp) define named group 204 (All characters of the name must be a word character. 205 And first character must not be a digit or uppper case) 206 207 Not only a name but a number is assigned like a captured 208 group. 209 210 Assigning the same name as two or more subexps is allowed. 211 In this case, a subexp call can not be performed although 212 the back reference is possible. 213 214 2158. Back reference 216 217 \n back reference by group number (n >= 1) 218 \k<name> back reference by group name 219 220 In the back reference by the multiplex definition name, 221 a subexp with a large number is referred to preferentially. 222 (When not matched, a group of the small number is referred to.) 223 224 * Back reference by group number is forbidden if named group is defined 225 in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted. 226 227 228 back reference with nest level 229 230 (This function is disabled in Ruby 1.9.) 231 232 \k<name+n> n: 0, 1, 2, ... 233 \k<name-n> n: 0, 1, 2, ... 234 235 Destinate relative nest level from back reference position. 236 237 ex 1. 238 239 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer") 240 241 ex 2. 242 243 r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) 244 (?<element> \g<stag> \g<content>* \g<etag> ){0} 245 (?<stag> < \g<name> \s* > ){0} 246 (?<name> [a-zA-Z_:]+ ){0} 247 (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} 248 (?<etag> </ \k<name+1> >){0} 249 \g<element> 250 __REGEXP__ 251 252 p r.match('<foo>f<bar>bbb</bar>f</foo>').captures 253 254 255 2569. Subexp call ("Tanaka Akira special") 257 258 \g<name> call by group name 259 \g<n> call by group number (n >= 1) 260 261 * left-most recursive call is not allowed. 262 ex. (?<name>a|\g<name>b) => error 263 (?<name>a|b\g<name>c) => OK 264 265 * Call by group number is forbidden if named group is defined in the pattern 266 and ONIG_OPTION_CAPTURE_GROUP is not setted. 267 268 * If the option status of called group is different from calling position 269 then the group's option is effective. 270 271 ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A" 272 273 27410. Captured group 275 276 Behavior of the no-named group (...) changes with the following conditions. 277 (But named group is not changed.) 278 279 case 1. /.../ (named group is not used, no option) 280 281 (...) is treated as a captured group. 282 283 case 2. /.../g (named group is not used, 'g' option) 284 285 (...) is treated as a no-captured group (?:...). 286 287 case 3. /..(?<name>..)../ (named group is used, no option) 288 289 (...) is treated as a no-captured group (?:...). 290 numbered-backref/call is not allowed. 291 292 case 4. /..(?<name>..)../G (named group is used, 'G' option) 293 294 (...) is treated as a captured group. 295 numbered-backref/call is allowed. 296 297 where 298 g: ONIG_OPTION_DONT_CAPTURE_GROUP 299 G: ONIG_OPTION_CAPTURE_GROUP 300 301 ('g' and 'G' options are argued in ruby-dev ML) 302 303 These options are not implemented in Ruby level. 304 305 306----------------------------- 307A-1. Syntax depend options 308 309 + ONIG_SYNTAX_RUBY 310 (?m): dot(.) match newline 311 312 + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA 313 (?s): dot(.) match newline 314 (?m): ^ match after newline, $ match before newline 315 316 317A-2. Original extensions 318 319 + hexadecimal digit char type \h, \H 320 + named group (?<name>...) 321 + named backref \k<name> 322 + subexp call \g<name>, \g<group-num> 323 324 325A-3. Lacked features compare with perl 5.8.0 326 327 + [:word:] 328 + \N{name} 329 + \l,\u,\L,\U, \X, \C 330 + (?{code}) 331 + (??{code}) 332 + (?(condition)yes-pat|no-pat) 333 334 * \Q...\E 335 This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. 336 337 * \p{property}, \P{property} 338 This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. 339 Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, 340 Print, Punct, Space, Upper, XDigit, ASCII are supported. 341 342 Prefix 'Is' of property name is allowed in ONIG_SYNTAX_PERL only. 343 ex. \p{IsXDigit}. 344 345 Negation operator of property is supported in ONIG_SYNTAX_PERL only. 346 \p{^...}, \P{^...} 347 348 349A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 350 351 + add hexadecimal digit char type (\h, \H) 352 + add look-behind 353 (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern) 354 + add possessive quantifier. ?+, *+, ++ 355 + add operations in character class. [], && 356 ('[' must be escaped as an usual char in character class.) 357 + add named group and subexp call. 358 + octal or hexadecimal number sequence can be treated as 359 a multibyte code char in character class if multibyte encoding 360 is specified. 361 (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1]) 362 + allow the range of single byte char and multibyte char in character 363 class. 364 ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding. 365 + effect range of isolated option is to next ')'. 366 ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b). 367 + isolated option is not transparent to previous pattern. 368 ex. a(?i)* is a syntax error pattern. 369 + allowed incompleted left brace as an usual string. 370 ex. /{/, /({)/, /a{2,3/ etc... 371 + negative POSIX bracket [:^xxxx:] is supported. 372 + POSIX bracket [:ascii:] is added. 373 + repeat of look-ahead is not allowed. 374 ex. /(?=a)*/, /(?!b){5}/ 375 + Ignore case option is effective to numbered character. 376 ex. /\x61/i =~ "A" 377 + In the range quantifier, the number of the minimum is omissible. 378 /a{,n}/ == /a{0,n}/ 379 The simultanious abbreviation of the number of times of the minimum 380 and the maximum is not allowed. (/a{,}/) 381 + /a{n}?/ is not a non-greedy operator. 382 /a{n}?/ == /(?:a{n})?/ 383 + invalid back reference is checked and cause error. 384 /\1/, /(a)\2/ 385 + Zero-length match in infinite repeat stops the repeat, 386 then changes of the capture group status are checked as stop condition. 387 /(?:()|())*\1\2/ =~ "" 388 /(?:\1a|())*/ =~ "a" 389 390 391A-5. Disabled functions by default syntax 392 393 + capture history 394 395 (?@...) and (?@<name>...) 396 397 ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>] 398 399 see sample/listcap.c file. 400 401 402A-6. Problems 403 404 + Invalid encoding byte sequence is not checked in UTF-8. 405 406 * Invalid first byte is treated as a character. 407 /./u =~ "\xa3" 408 409 * Incomplete byte sequence is not checked. 410 /\w+/ =~ "a\xf3\x8ec" 411 412// END 413