xref: /PHP-7.3/ext/mbstring/oniguruma/doc/RE (revision 1979c5d1)
1Oniguruma Regular Expressions Version 6.9.4    2019/10/31
2
3syntax: ONIG_SYNTAX_ONIGURUMA (default)
4
5
61. Syntax elements
7
8  \       escape (enable or disable meta character)
9  |       alternation
10  (...)   group
11  [...]   character class
12
13
142. Characters
15
16  \t           horizontal tab         (0x09)
17  \v           vertical tab           (0x0B)
18  \n           newline (line feed)    (0x0A)
19  \r           carriage return        (0x0D)
20  \b           backspace              (0x08)
21  \f           form feed              (0x0C)
22  \a           bell                   (0x07)
23  \e           escape                 (0x1B)
24  \nnn         octal char             (encoded byte value)
25  \o{17777777777} wide octal char     (character code point value)
26  \uHHHH       wide hexadecimal char  (character code point value)
27  \xHH         hexadecimal char       (encoded byte value)
28  \x{7HHHHHHH} wide hexadecimal char  (character code point value)
29  \cx          control char           (character code point value)
30  \C-x         control char           (character code point value)
31  \M-x         meta  (x|0x80)         (character code point value)
32  \M-\C-x      meta control char      (character code point value)
33
34 (* \b as backspace is effective in character class only)
35
36
373. Character types
38
39  .        any character (except newline)
40
41  \w       word character
42
43           Not Unicode:
44             alphanumeric, "_" and multibyte char.
45
46           Unicode:
47             General_Category -- (Letter|Mark|Number|Connector_Punctuation)
48
49  \W       non-word char
50
51  \s       whitespace char
52
53           Not Unicode:
54             \t, \n, \v, \f, \r, \x20
55
56           Unicode case:
57             U+0009, U+000A, U+000B, U+000C, U+000D, U+0085(NEL),
58             General_Category -- Line_Separator
59                              -- Paragraph_Separator
60                              -- Space_Separator
61
62  \S       non-whitespace char
63
64  \d       decimal digit char
65
66           Unicode: General_Category -- Decimal_Number
67
68  \D       non-decimal-digit char
69
70  \h       hexadecimal digit char   [0-9a-fA-F]
71
72  \H       non-hexdigit char
73
74  \R       general newline  (* can't be used in character-class)
75           "\r\n" or \n,\v,\f,\r  (* but doesn't backtrack from \r\n to \r)
76
77           Unicode case:
78             "\r\n" or \n,\v,\f,\r or U+0085, U+2028, U+2029
79
80  \N       negative newline  (?-m:.)
81
82  \O       true anychar      (?m:.)    (* original function)
83
84  \X       Text Segment    \X === (?>\O(?:\Y\O)*)
85
86           The meaning of this operator changes depending on the setting of
87           the option (?y{..}).
88
89           \X doesn't check whether matching start position is boundary or not.
90           Please write as \y\X if you want to ensure it.
91
92           [Extended Grapheme Cluster mode] (default)
93             Unicode case:
94               See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/]
95
96             Not Unicode case:  \X === (?>\r\n|\O)
97
98           [Word mode]
99             Currently, this mode is supported in Unicode only.
100             See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/]
101
102
103  Character Property
104
105    * \p{property-name}
106    * \p{^property-name}    (negative)
107    * \P{property-name}     (negative)
108
109    property-name:
110
111     + works on all encodings
112       Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
113       Print, Punct, Space, Upper, XDigit, Word, ASCII
114
115     + works on EUC_JP, Shift_JIS
116       Hiragana, Katakana
117
118     + works on UTF8, UTF16, UTF32
119       See doc/UNICODE_PROPERTIES.
120
121
122
1234. Quantifier
124
125  greedy
126
127    ?       1 or 0 times
128    *       0 or more times
129    +       1 or more times
130    {n,m}   (n <= m)  at least n but no more than m times
131    {n,}    at least n times
132    {,n}    at least 0 but no more than n times ({0,n})
133    {n}     n times
134
135  reluctant
136
137    ??      0 or 1 times
138    *?      0 or more times
139    +?      1 or more times
140    {n,m}?  (n <= m)  at least n but not more than m times
141    {n,}?   at least n times
142    {,n}?   at least 0 but not more than n times (== {0,n}?)
143
144  possessive (greedy and does not backtrack once match)
145
146    ?+      1 or 0 times
147    *+      0 or more times
148    ++      1 or more times
149    {n,m}   (n > m)  at least m but not more than n times
150
151    {n,m}+, {n,}+, {n}+ are possessive operators in ONIG_SYNTAX_JAVA and
152    ONIG_SYNTAX_PERL only.
153
154    ex. /a*+/ === /(?>a*)/
155
156
1575. Anchors
158
159  ^       beginning of the line
160  $       end of the line
161  \b      word boundary
162  \B      non-word boundary
163
164  \A      beginning of string
165  \Z      end of string, or before newline at the end
166  \z      end of string
167  \G      where the current search attempt begins
168  \K      keep (keep start position of the result string)
169
170
171  \y      Text Segment boundary
172  \Y      Text Segment non-boundary
173
174          The meaning of these operators(\y, \Y) changes depending on the setting
175          of the option (?y{..}).
176
177          [Extended Grapheme Cluster mode] (default)
178            Unicode case:
179              See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/]
180
181            Not Unicode:
182              All positions except between \r and \n.
183
184          [Word mode]
185            Currently, this mode is supported in Unicode only.
186            See [Unicode Standard Annex #29: http://unicode.org/reports/tr29/]
187
188
189
1906. Character class
191
192  ^...    negative class (lowest precedence)
193  x-y     range from x to y
194  [...]   set (character class in character class)
195  ..&&..  intersection (low precedence, only higher than ^)
196
197    ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
198
199  * If you want to use '[', '-', or ']' as a normal character
200    in character class, you should escape them with '\'.
201
202
203  POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
204
205    Not Unicode Case:
206
207      alnum    alphabet or digit char
208      alpha    alphabet
209      ascii    code value: [0 - 127]
210      blank    \t, \x20
211      cntrl
212      digit    0-9
213      graph    include all of multibyte encoded characters
214      lower
215      print    include all of multibyte encoded characters
216      punct
217      space    \t, \n, \v, \f, \r, \x20
218      upper
219      xdigit   0-9, a-f, A-F
220      word     alphanumeric, "_" and multibyte characters
221
222
223    Unicode Case:
224
225      alnum    Letter | Mark | Decimal_Number
226      alpha    Letter | Mark
227      ascii    0000 - 007F
228      blank    Space_Separator | 0009
229      cntrl    Control | Format | Unassigned | Private_Use | Surrogate
230      digit    Decimal_Number
231      graph    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
232      lower    Lowercase_Letter
233      print    [[:graph:]] | [[:space:]]
234      punct    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
235               Final_Punctuation | Initial_Punctuation | Other_Punctuation |
236               Open_Punctuation
237      space    Space_Separator | Line_Separator | Paragraph_Separator |
238               U+0009 | U+000A | U+000B | U+000C | U+000D | U+0085
239      upper    Uppercase_Letter
240      xdigit   U+0030 - U+0039 | U+0041 - U+0046 | U+0061 - U+0066
241               (0-9, a-f, A-F)
242      word     Letter | Mark | Decimal_Number | Connector_Punctuation
243
244
245
2467. Extended groups
247
248  (?#...)            comment
249
250  (?imxWDSPy-imxWDSP:subexp)  option on/off for subexp
251
252                           i: ignore case
253                           m: multi-line (dot (.) also matches newline)
254                           x: extended form
255                           W: ASCII only word (\w, \p{Word}, [[:word:]])
256                              ASCII only word bound (\b)
257                           D: ASCII only digit (\d, \p{Digit}, [[:digit:]])
258                           S: ASCII only space (\s, \p{Space}, [[:space:]])
259                           P: ASCII only POSIX properties (includes W,D,S)
260                              (alnum, alpha, blank, cntrl, digit, graph,
261                               lower, print, punct, space, upper, xdigit, word)
262
263                           y{?}: Text Segment mode
264                              This option changes the meaning of \X, \y, \Y.
265                              Currently, this option is supported in Unicode only.
266
267                              y{g}: Extended Grapheme Cluster mode (default)
268                              y{w}: Word mode
269                              See [Unicode Standard Annex #29]
270
271  (?imxWDSPy-imxWDSP)  isolated option
272
273                      * It makes a group to the next ')' or end of the pattern.
274                        /ab(?i)c|def|gh/ == /ab(?i:c|def|gh)/
275
276
277  (?:subexp)         non-capturing group
278  (subexp)           capturing group
279
280  (?=subexp)         look-ahead
281  (?!subexp)         negative look-ahead
282  (?<=subexp)        look-behind
283  (?<!subexp)        negative look-behind
284
285                     Subexp of look-behind must be fixed-width.
286                     But top-level alternatives can be of various lengths.
287                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
288
289                     In negative look-behind, capturing group isn't allowed,
290                     but non-capturing group (?:) is allowed.
291
292                     * In look-behind and negative look-behind, support for
293                       ignore-case option is limited. Only supports conversion
294                       between single characters. (Does not support conversion
295                       of multiple characters in Unicode)
296
297  (?>subexp)         atomic group
298                     no backtracks in subexp.
299
300  (?<name>subexp), (?'name'subexp)
301                     define named group
302                     (Each character of the name must be a word character.)
303
304                     Not only a name but a number is assigned like a capturing
305                     group.
306
307                     Assigning the same name to two or more subexps is allowed.
308
309
310  <Callouts>
311
312  * Callouts of contents
313  (?{...contents...})         callout in progress
314  (?{...contents...}D)        D is a direction flag char
315                              D = 'X': in progress and retraction
316                                  '<': in retraction only
317                                  '>': in progress only
318  (?{...contents...}[tag])    tag assigned
319  (?{...contents...}[tag]D)
320
321                              * Escape characters have no effects in contents.
322                              * contents is not allowed to start with '{'.
323
324  (?{{{...contents...}}})     n times continuations '}' in contents is allowed in
325                              (n+1) times continuations {{{...}}}.
326
327    Allowed tag string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z)
328
329
330  * Callouts of name
331  (*name)
332  (*name{args...})            with args
333  (*name[tag])                tag assigned
334  (*name[tag]{args...})
335
336    Allowed name string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z)
337    Allowed tag  string characters: _ A-Z a-z 0-9 (* first character: _ A-Z a-z)
338
339
340  <Absent functions>
341
342  (?~absent)         Absent repeater    (* proposed by Tanaka Akira)
343                     This works like .* (more precisely \O*), but it is
344                     limited by the range that does not include the string
345                     match with <absent>.
346                     This is a written abbreviation of (?~|(?:absent)|\O*).
347                     \O* is used as a repeater.
348
349  (?~|absent|exp)    Absent expression  (* original)
350                     This works like "exp", but it is limited by the range
351                     that does not include the string match with <absent>.
352
353                     ex. (?~|345|\d*)  "12345678"  ==> "12", "1", ""
354
355  (?~|absent)        Absent stopper (* original)
356                     After passed this operator, string right range is limited
357                     at the point that does not include the string match whth
358                     <absent>.
359
360  (?~|)              Range clear
361                     Clear the effects caused by Absent stoppers.
362
363     * Nested Absent functions are not supported and the behavior
364       is undefined.
365
366
367  <if-then-else>
368
369  (?(condition_exp)then_exp|else_exp)    if-then-else
370  (?(condition_exp)then_exp)             if-then
371
372               condition_exp can be a backreference number/name or a normal
373               regular expression.
374               When condition_exp is a backreference number/name, both then_exp and
375               else_exp can be omitted.
376               Then it works as a backreference validity checker.
377
378  [ Backreference validity checker ]   (* original)
379
380    (?(n)), (?(-n)), (?(+n)), (?(n+level)) ...
381    (?(<n>)), (?('-n')), (?(<+n>)) ...
382    (?(<name>)), (?('name')), (?(<name+level>)) ...
383
384
385
3868. Backreferences
387
388  When we say "backreference a group," it actually means, "re-match the same
389  text matched by the subexp in that group."
390
391  \n  \k<n>     \k'n'     (n >= 1) backreference the nth group in the regexp
392      \k<-n>    \k'-n'    (n >= 1) backreference the nth group counting
393                          backwards from the referring position
394      \k<+n>    \k'+n'    (n >= 1) backreference the nth group counting
395                          forwards from the referring position
396      \k<name>  \k'name'  backreference a group with the specified name
397
398  When backreferencing with a name that is assigned to more than one groups,
399  the last group with the name is checked first, if not matched then the
400  previous one with the name, and so on, until there is a match.
401
402  * Backreference by number is forbidden if any named group is defined and
403    ONIG_OPTION_CAPTURE_GROUP is not set.
404
405
406  backreference with recursion level
407
408    (n >= 1, level >= 0)
409
410    \k<n+level> \k'n+level'
411    \k<n-level> \k'n-level'
412
413    \k<name+level> \k'name+level'
414    \k<name-level> \k'name-level'
415
416    Destine a group on the recursion level relative to the referring position.
417
418    ex 1.
419
420      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee")
421      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
422
423      \k<b+0> refers to the (?<b>.) on the same recursion level with it.
424
425    ex 2.
426
427      r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
428      (?<element> \g<stag> \g<content>* \g<etag> ){0}
429      (?<stag> < \g<name> \s* > ){0}
430      (?<name> [a-zA-Z_:]+ ){0}
431      (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
432      (?<etag> </ \k<name+1> >){0}
433      \g<element>
434      __REGEXP__
435
436      p r.match("<foo>f<bar>bbb</bar>f</foo>").captures
437
438
4399. Subexp calls ("Tanaka Akira special")   (* original function)
440
441  When we say "call a group," it actually means, "re-execute the subexp in
442  that group."
443
444  \g<n>     \g'n'     (n >= 1) call the nth group
445  \g<0>     \g'0'     call zero (call the total regexp)
446  \g<-n>    \g'-n'    (n >= 1) call the nth group counting backwards from
447                      the calling position
448  \g<+n>    \g'+n'    (n >= 1) call the nth group counting forwards from
449                      the calling position
450  \g<name>  \g'name'  call the group with the specified name
451
452  * Left-most recursive calls are not allowed.
453
454    ex. (?<name>a|\g<name>b)    => error
455        (?<name>a|b\g<name>c)   => OK
456
457  * Calls with a name that is assigned to more than one groups are not
458    allowed.
459
460  * Call by number is forbidden if any named group is defined and
461    ONIG_OPTION_CAPTURE_GROUP is not set.
462
463  * The option status of the called group is always effective.
464
465    ex. /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A")
466
467
46810. Captured group
469
470  Behavior of an unnamed group (...) changes with the following conditions.
471  (But named group is not changed.)
472
473  case 1. /.../     (named group is not used, no option)
474
475     (...) is treated as a capturing group.
476
477  case 2. /.../g    (named group is not used, 'g' option)
478
479     (...) is treated as a non-capturing group (?:...).
480
481  case 3. /..(?<name>..)../   (named group is used, no option)
482
483     (...) is treated as a non-capturing group.
484     numbered-backref/call is not allowed.
485
486  case 4. /..(?<name>..)../G  (named group is used, 'G' option)
487
488     (...) is treated as a capturing group.
489     numbered-backref/call is allowed.
490
491  where
492    g: ONIG_OPTION_DONT_CAPTURE_GROUP
493    G: ONIG_OPTION_CAPTURE_GROUP
494
495  ('g' and 'G' options are argued in ruby-dev ML)
496
497
498
499-----------------------------
500A-1. Syntax-dependent options
501
502   + ONIG_SYNTAX_ONIGURUMA
503     (?m): dot (.) also matches newline
504
505   + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
506     (?s): dot (.) also matches newline
507     (?m): ^ matches after newline, $ matches before newline
508
509
510A-2. Original extensions
511
512   + hexadecimal digit char type     \h, \H
513   + true anychar                    \O
514   + text segment boundary           \y, \Y
515   + backreference validity checker  (?(...))
516   + named group                     (?<name>...), (?'name'...)
517   + named backref                   \k<name>
518   + subexp call                     \g<name>, \g<group-num>
519   + absent expression               (?~|...|...)
520   + absent stopper                  (?|...)
521
522
523A-3. Missing features compared with perl 5.8.0
524
525   + \N{name}
526   + \l,\u,\L,\U,\C
527   + (??{code})
528
529   * \Q...\E
530     This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
531
532
533A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
534
535   + add character property (\p{property}, \P{property})
536   + add hexadecimal digit char type (\h, \H)
537   + add look-behind
538     (?<=fixed-width-pattern), (?<!fixed-width-pattern)
539   + add possessive quantifier. ?+, *+, ++
540   + add operations in character class. [], &&
541     ('[' must be escaped as an usual char in character class.)
542   + add named group and subexp call.
543   + octal or hexadecimal number sequence can be treated as
544     a multibyte code char in character class if multibyte encoding
545     is specified.
546     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
547   + allow the range of single byte char and multibyte char in character
548     class.
549     ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
550   + effect range of isolated option is to next ')'.
551     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
552   + isolated option is not transparent to previous pattern.
553     ex. a(?i)* is a syntax error pattern.
554   + allowed unpaired left brace as a normal character.
555     ex. /{/, /({)/, /a{2,3/ etc...
556   + negative POSIX bracket [:^xxxx:] is supported.
557   + POSIX bracket [:ascii:] is added.
558   + repeat of look-ahead is not allowed.
559     ex. /(?=a)*/, /(?!b){5}/
560   + Ignore case option is effective to escape sequence.
561     ex. /\x61/i =~ "A"
562   + In the range quantifier, the number of the minimum is optional.
563     /a{,n}/ == /a{0,n}/
564     The omission of both minimum and maximum values is not allowed.
565     /a{,}/
566   + /{n}?/ is not a reluctant quantifier.
567     /a{n}?/ == /(?:a{n})?/
568   + invalid back reference is checked and raises error.
569     /\1/, /(a)\2/
570   + Zero-width match in an infinite loop stops the repeat,
571     then changes of the capture group status are checked as stop condition.
572     /(?:()|())*\1\2/ =~ ""
573     /(?:\1a|())*/ =~ "a"
574
575// END
576