1Oniguruma Regular Expressions Version 6.3.0 2017/05/19 2 3syntax: ONIG_SYNTAX_RUBY (default) 4 5 61. Syntax elements 7 8 \ escape (enable or disable meta character) 9 | alternation 10 (...) group 11 [...] character class 12 13 142. Characters 15 16 \t horizontal tab (0x09) 17 \v vertical tab (0x0B) 18 \n newline (line feed) (0x0A) 19 \r carriage return (0x0D) 20 \b backspace (0x08) 21 \f form feed (0x0C) 22 \a bell (0x07) 23 \e escape (0x1B) 24 \nnn octal char (encoded byte value) 25 \o{17777777777} wide octal char (character code point value) 26 \xHH hexadecimal char (encoded byte value) 27 \x{7HHHHHHH} wide hexadecimal char (character code point value) 28 \cx control char (character code point value) 29 \C-x control char (character code point value) 30 \M-x meta (x|0x80) (character code point value) 31 \M-\C-x meta control char (character code point value) 32 33 (* \b as backspace is effective in character class only) 34 35 363. Character types 37 38 . any character (except newline) 39 40 \w word character 41 42 Not Unicode: 43 alphanumeric, "_" and multibyte char. 44 45 Unicode: 46 General_Category -- (Letter|Mark|Number|Connector_Punctuation) 47 48 \W non-word char 49 50 \s whitespace char 51 52 Not Unicode: 53 \t, \n, \v, \f, \r, \x20 54 55 Unicode: 56 0009, 000A, 000B, 000C, 000D, 0085(NEL), 57 General_Category -- Line_Separator 58 -- Paragraph_Separator 59 -- Space_Separator 60 61 \S non-whitespace char 62 63 \d decimal digit char 64 65 Unicode: General_Category -- Decimal_Number 66 67 \D non-decimal-digit char 68 69 \h hexadecimal digit char [0-9a-fA-F] 70 71 \H non-hexdigit char 72 73 74 Character Property 75 76 * \p{property-name} 77 * \p{^property-name} (negative) 78 * \P{property-name} (negative) 79 80 property-name: 81 82 + works on all encodings 83 Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, 84 Print, Punct, Space, Upper, XDigit, Word, ASCII 85 86 + works on EUC_JP, Shift_JIS 87 Hiragana, Katakana 88 89 + works on UTF8, UTF16, UTF32 90 See doc/UNICODE_PROPERTIES. 91 92 93 944. Quantifier 95 96 greedy 97 98 ? 1 or 0 times 99 * 0 or more times 100 + 1 or more times 101 {n,m} at least n but no more than m times 102 {n,} at least n times 103 {,n} at least 0 but no more than n times ({0,n}) 104 {n} n times 105 106 reluctant 107 108 ?? 1 or 0 times 109 *? 0 or more times 110 +? 1 or more times 111 {n,m}? at least n but not more than m times 112 {n,}? at least n times 113 {,n}? at least 0 but not more than n times (== {0,n}?) 114 115 possessive (greedy and does not backtrack once match) 116 117 ?+ 1 or 0 times 118 *+ 0 or more times 119 ++ 1 or more times 120 121 ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only) 122 123 ex. /a*+/ === /(?>a*)/ 124 125 1265. Anchors 127 128 ^ beginning of the line 129 $ end of the line 130 \b word boundary 131 \B non-word boundary 132 \A beginning of string 133 \Z end of string, or before newline at the end 134 \z end of string 135 \G where the current search attempt begins 136 137 1386. Character class 139 140 ^... negative class (lowest precedence) 141 x-y range from x to y 142 [...] set (character class in character class) 143 ..&&.. intersection (low precedence, only higher than ^) 144 145 ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w] 146 147 * If you want to use '[', '-', or ']' as a normal character 148 in character class, you should escape them with '\'. 149 150 151 POSIX bracket ([:xxxxx:], negate [:^xxxxx:]) 152 153 Not Unicode Case: 154 155 alnum alphabet or digit char 156 alpha alphabet 157 ascii code value: [0 - 127] 158 blank \t, \x20 159 cntrl 160 digit 0-9 161 graph include all of multibyte encoded characters 162 lower 163 print include all of multibyte encoded characters 164 punct 165 space \t, \n, \v, \f, \r, \x20 166 upper 167 xdigit 0-9, a-f, A-F 168 word alphanumeric, "_" and multibyte characters 169 170 171 Unicode Case: 172 173 alnum Letter | Mark | Decimal_Number 174 alpha Letter | Mark 175 ascii 0000 - 007F 176 blank Space_Separator | 0009 177 cntrl Control | Format | Unassigned | Private_Use | Surrogate 178 digit Decimal_Number 179 graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate 180 lower Lowercase_Letter 181 print [[:graph:]] | [[:space:]] 182 punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | 183 Final_Punctuation | Initial_Punctuation | Other_Punctuation | 184 Open_Punctuation 185 space Space_Separator | Line_Separator | Paragraph_Separator | 186 0009 | 000A | 000B | 000C | 000D | 0085 187 upper Uppercase_Letter 188 xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066 189 (0-9, a-f, A-F) 190 word Letter | Mark | Decimal_Number | Connector_Punctuation 191 192 193 1947. Extended groups 195 196 (?#...) comment 197 198 (?imx-imx) option on/off 199 i: ignore case 200 m: multi-line (dot (.) also matches newline) 201 x: extended form 202 (?imx-imx:subexp) option on/off for subexp 203 204 (?:subexp) non-capturing group 205 (subexp) capturing group 206 207 (?=subexp) look-ahead 208 (?!subexp) negative look-ahead 209 (?<=subexp) look-behind 210 (?<!subexp) negative look-behind 211 212 Subexp of look-behind must be fixed-width. 213 But top-level alternatives can be of various lengths. 214 ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed. 215 216 In negative look-behind, capturing group isn't allowed, 217 but non-capturing group (?:) is allowed. 218 219 (?>subexp) atomic group 220 no backtracks in subexp. 221 222 (?<name>subexp), (?'name'subexp) 223 define named group 224 (Each character of the name must be a word character.) 225 226 Not only a name but a number is assigned like a capturing 227 group. 228 229 Assigning the same name to two or more subexps is allowed. 230 231 2328. Backreferences 233 234 When we say "backreference a group," it actually means, "re-match the same 235 text matched by the subexp in that group." 236 237 \n \k<n> \k'n' (n >= 1) backreference the nth group in the regexp 238 \k<-n> \k'-n' (n >= 1) backreference the nth group counting 239 backwards from the referring position 240 \k<name> \k'name' backreference a group with the specified name 241 242 When backreferencing with a name that is assigned to more than one groups, 243 the last group with the name is checked first, if not matched then the 244 previous one with the name, and so on, until there is a match. 245 246 * Backreference by number is forbidden if any named group is defined and 247 ONIG_OPTION_CAPTURE_GROUP is not set. 248 249 250 backreference with recursion level 251 252 (n >= 1, level >= 0) 253 254 \k<n+level> \k'n+level' 255 \k<n-level> \k'n-level' 256 257 \k<name+level> \k'name+level' 258 \k<name-level> \k'name-level' 259 260 Destine a group on the recursion level relative to the referring position. 261 262 ex 1. 263 264 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee") 265 /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer") 266 267 \k<b+0> refers to the (?<b>.) on the same recursion level with it. 268 269 ex 2. 270 271 r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) 272 (?<element> \g<stag> \g<content>* \g<etag> ){0} 273 (?<stag> < \g<name> \s* > ){0} 274 (?<name> [a-zA-Z_:]+ ){0} 275 (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} 276 (?<etag> </ \k<name+1> >){0} 277 \g<element> 278 __REGEXP__ 279 280 p r.match("<foo>f<bar>bbb</bar>f</foo>").captures 281 282 2839. Subexp calls ("Tanaka Akira special") 284 285 When we say "call a group," it actually means, "re-execute the subexp in 286 that group." 287 288 \g<n> \g'n' (n >= 1) call the nth group 289 \g<-n> \g'-n' (n >= 1) call the nth group counting backwards from 290 the calling position 291 \g<name> \g'name' call the group with the specified name 292 293 * Left-most recursive calls are not allowed. 294 295 ex. (?<name>a|\g<name>b) => error 296 (?<name>a|b\g<name>c) => OK 297 298 * Calls with a name that is assigned to more than one groups are not 299 allowed. 300 301 * Call by number is forbidden if any named group is defined and 302 ONIG_OPTION_CAPTURE_GROUP is not set. 303 304 * The option status of the called group is always effective. 305 306 ex. /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A") 307 308 30910. Captured group 310 311 Behavior of an unnamed group (...) changes with the following conditions. 312 (But named group is not changed.) 313 314 case 1. /.../ (named group is not used, no option) 315 316 (...) is treated as a capturing group. 317 318 case 2. /.../g (named group is not used, 'g' option) 319 320 (...) is treated as a non-capturing group (?:...). 321 322 case 3. /..(?<name>..)../ (named group is used, no option) 323 324 (...) is treated as a non-capturing group. 325 numbered-backref/call is not allowed. 326 327 case 4. /..(?<name>..)../G (named group is used, 'G' option) 328 329 (...) is treated as a capturing group. 330 numbered-backref/call is allowed. 331 332 where 333 g: ONIG_OPTION_DONT_CAPTURE_GROUP 334 G: ONIG_OPTION_CAPTURE_GROUP 335 336 ('g' and 'G' options are argued in ruby-dev ML) 337 338 339 340----------------------------- 341A-1. Syntax-dependent options 342 343 + ONIG_SYNTAX_RUBY 344 (?m): dot (.) also matches newline 345 346 + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA 347 (?s): dot (.) also matches newline 348 (?m): ^ matches after newline, $ matches before newline 349 350 351A-2. Original extensions 352 353 + hexadecimal digit char type \h, \H 354 + named group (?<name>...), (?'name'...) 355 + named backref \k<name> 356 + subexp call \g<name>, \g<group-num> 357 358 359A-3. Missing features compared with perl 5.8.0 360 361 + \N{name} 362 + \l,\u,\L,\U, \X, \C 363 + (?{code}) 364 + (??{code}) 365 + (?(condition)yes-pat|no-pat) 366 367 * \Q...\E 368 This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA. 369 370 371A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8 372 373 + add character property (\p{property}, \P{property}) 374 + add hexadecimal digit char type (\h, \H) 375 + add look-behind 376 (?<=fixed-width-pattern), (?<!fixed-width-pattern) 377 + add possessive quantifier. ?+, *+, ++ 378 + add operations in character class. [], && 379 ('[' must be escaped as an usual char in character class.) 380 + add named group and subexp call. 381 + octal or hexadecimal number sequence can be treated as 382 a multibyte code char in character class if multibyte encoding 383 is specified. 384 (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1]) 385 + allow the range of single byte char and multibyte char in character 386 class. 387 ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding. 388 + effect range of isolated option is to next ')'. 389 ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b). 390 + isolated option is not transparent to previous pattern. 391 ex. a(?i)* is a syntax error pattern. 392 + allowed unpaired left brace as a normal character. 393 ex. /{/, /({)/, /a{2,3/ etc... 394 + negative POSIX bracket [:^xxxx:] is supported. 395 + POSIX bracket [:ascii:] is added. 396 + repeat of look-ahead is not allowed. 397 ex. /(?=a)*/, /(?!b){5}/ 398 + Ignore case option is effective to escape sequence. 399 ex. /\x61/i =~ "A" 400 + In the range quantifier, the number of the minimum is optional. 401 /a{,n}/ == /a{0,n}/ 402 The omission of both minimum and maximum values is not allowed. 403 /a{,}/ 404 + /{n}?/ is not a reluctant quantifier. 405 /a{n}?/ == /(?:a{n})?/ 406 + invalid back reference is checked and raises error. 407 /\1/, /(a)\2/ 408 + Zero-width match in an infinite loop stops the repeat, 409 then changes of the capture group status are checked as stop condition. 410 /(?:()|())*\1\2/ =~ "" 411 /(?:\1a|())*/ =~ "a" 412 413 414A-5. Features disabled in default syntax 415 416 + capture history 417 418 (?@...) and (?@<name>...) 419 420 ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>] 421 422 see sample/listcap.c file. 423 424 425A-6. Problems 426 427 + Invalid encoding byte sequence is not checked. 428 429 ex. UTF-8 430 431 * Invalid first byte is treated as a character. 432 /./u =~ "\xa3" 433 434 * Incomplete byte sequence is not checked. 435 /\w+/ =~ "a\xf3\x8ec" 436 437// END 438