xref: /PHP-7.2/ext/mbstring/oniguruma/doc/RE (revision 0ae2f95b)
1Oniguruma Regular Expressions Version 6.3.0    2017/05/19
2
3syntax: ONIG_SYNTAX_RUBY (default)
4
5
61. Syntax elements
7
8  \       escape (enable or disable meta character)
9  |       alternation
10  (...)   group
11  [...]   character class
12
13
142. Characters
15
16  \t           horizontal tab         (0x09)
17  \v           vertical tab           (0x0B)
18  \n           newline (line feed)    (0x0A)
19  \r           carriage return        (0x0D)
20  \b           backspace              (0x08)
21  \f           form feed              (0x0C)
22  \a           bell                   (0x07)
23  \e           escape                 (0x1B)
24  \nnn         octal char             (encoded byte value)
25  \o{17777777777} wide octal char     (character code point value)
26  \xHH         hexadecimal char       (encoded byte value)
27  \x{7HHHHHHH} wide hexadecimal char  (character code point value)
28  \cx          control char           (character code point value)
29  \C-x         control char           (character code point value)
30  \M-x         meta  (x|0x80)         (character code point value)
31  \M-\C-x      meta control char      (character code point value)
32
33 (* \b as backspace is effective in character class only)
34
35
363. Character types
37
38  .        any character (except newline)
39
40  \w       word character
41
42           Not Unicode:
43             alphanumeric, "_" and multibyte char.
44
45           Unicode:
46             General_Category -- (Letter|Mark|Number|Connector_Punctuation)
47
48  \W       non-word char
49
50  \s       whitespace char
51
52           Not Unicode:
53             \t, \n, \v, \f, \r, \x20
54
55           Unicode:
56             0009, 000A, 000B, 000C, 000D, 0085(NEL),
57             General_Category -- Line_Separator
58                              -- Paragraph_Separator
59                              -- Space_Separator
60
61  \S       non-whitespace char
62
63  \d       decimal digit char
64
65           Unicode: General_Category -- Decimal_Number
66
67  \D       non-decimal-digit char
68
69  \h       hexadecimal digit char   [0-9a-fA-F]
70
71  \H       non-hexdigit char
72
73
74  Character Property
75
76    * \p{property-name}
77    * \p{^property-name}    (negative)
78    * \P{property-name}     (negative)
79
80    property-name:
81
82     + works on all encodings
83       Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
84       Print, Punct, Space, Upper, XDigit, Word, ASCII
85
86     + works on EUC_JP, Shift_JIS
87       Hiragana, Katakana
88
89     + works on UTF8, UTF16, UTF32
90       See doc/UNICODE_PROPERTIES.
91
92
93
944. Quantifier
95
96  greedy
97
98    ?       1 or 0 times
99    *       0 or more times
100    +       1 or more times
101    {n,m}   at least n but no more than m times
102    {n,}    at least n times
103    {,n}    at least 0 but no more than n times ({0,n})
104    {n}     n times
105
106  reluctant
107
108    ??      1 or 0 times
109    *?      0 or more times
110    +?      1 or more times
111    {n,m}?  at least n but not more than m times
112    {n,}?   at least n times
113    {,n}?   at least 0 but not more than n times (== {0,n}?)
114
115  possessive (greedy and does not backtrack once match)
116
117    ?+      1 or 0 times
118    *+      0 or more times
119    ++      1 or more times
120
121    ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
122
123    ex. /a*+/ === /(?>a*)/
124
125
1265. Anchors
127
128  ^       beginning of the line
129  $       end of the line
130  \b      word boundary
131  \B      non-word boundary
132  \A      beginning of string
133  \Z      end of string, or before newline at the end
134  \z      end of string
135  \G      where the current search attempt begins
136
137
1386. Character class
139
140  ^...    negative class (lowest precedence)
141  x-y     range from x to y
142  [...]   set (character class in character class)
143  ..&&..  intersection (low precedence, only higher than ^)
144
145    ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
146
147  * If you want to use '[', '-', or ']' as a normal character
148    in character class, you should escape them with '\'.
149
150
151  POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
152
153    Not Unicode Case:
154
155      alnum    alphabet or digit char
156      alpha    alphabet
157      ascii    code value: [0 - 127]
158      blank    \t, \x20
159      cntrl
160      digit    0-9
161      graph    include all of multibyte encoded characters
162      lower
163      print    include all of multibyte encoded characters
164      punct
165      space    \t, \n, \v, \f, \r, \x20
166      upper
167      xdigit   0-9, a-f, A-F
168      word     alphanumeric, "_" and multibyte characters
169
170
171    Unicode Case:
172
173      alnum    Letter | Mark | Decimal_Number
174      alpha    Letter | Mark
175      ascii    0000 - 007F
176      blank    Space_Separator | 0009
177      cntrl    Control | Format | Unassigned | Private_Use | Surrogate
178      digit    Decimal_Number
179      graph    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
180      lower    Lowercase_Letter
181      print    [[:graph:]] | [[:space:]]
182      punct    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
183               Final_Punctuation | Initial_Punctuation | Other_Punctuation |
184               Open_Punctuation
185      space    Space_Separator | Line_Separator | Paragraph_Separator |
186               0009 | 000A | 000B | 000C | 000D | 0085
187      upper    Uppercase_Letter
188      xdigit   0030 - 0039 | 0041 - 0046 | 0061 - 0066
189               (0-9, a-f, A-F)
190      word     Letter | Mark | Decimal_Number | Connector_Punctuation
191
192
193
1947. Extended groups
195
196  (?#...)            comment
197
198  (?imx-imx)         option on/off
199                         i: ignore case
200                         m: multi-line (dot (.) also matches newline)
201                         x: extended form
202  (?imx-imx:subexp)  option on/off for subexp
203
204  (?:subexp)         non-capturing group
205  (subexp)           capturing group
206
207  (?=subexp)         look-ahead
208  (?!subexp)         negative look-ahead
209  (?<=subexp)        look-behind
210  (?<!subexp)        negative look-behind
211
212                     Subexp of look-behind must be fixed-width.
213                     But top-level alternatives can be of various lengths.
214                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
215
216                     In negative look-behind, capturing group isn't allowed,
217                     but non-capturing group (?:) is allowed.
218
219  (?>subexp)         atomic group
220                     no backtracks in subexp.
221
222  (?<name>subexp), (?'name'subexp)
223                     define named group
224                     (Each character of the name must be a word character.)
225
226                     Not only a name but a number is assigned like a capturing
227                     group.
228
229                     Assigning the same name to two or more subexps is allowed.
230
231
2328. Backreferences
233
234  When we say "backreference a group," it actually means, "re-match the same
235  text matched by the subexp in that group."
236
237  \n  \k<n>     \k'n'     (n >= 1) backreference the nth group in the regexp
238      \k<-n>    \k'-n'    (n >= 1) backreference the nth group counting
239                          backwards from the referring position
240      \k<name>  \k'name'  backreference a group with the specified name
241
242  When backreferencing with a name that is assigned to more than one groups,
243  the last group with the name is checked first, if not matched then the
244  previous one with the name, and so on, until there is a match.
245
246  * Backreference by number is forbidden if any named group is defined and
247    ONIG_OPTION_CAPTURE_GROUP is not set.
248
249
250  backreference with recursion level
251
252    (n >= 1, level >= 0)
253
254    \k<n+level> \k'n+level'
255    \k<n-level> \k'n-level'
256
257    \k<name+level> \k'name+level'
258    \k<name-level> \k'name-level'
259
260    Destine a group on the recursion level relative to the referring position.
261
262    ex 1.
263
264      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee")
265      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
266
267      \k<b+0> refers to the (?<b>.) on the same recursion level with it.
268
269    ex 2.
270
271      r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
272      (?<element> \g<stag> \g<content>* \g<etag> ){0}
273      (?<stag> < \g<name> \s* > ){0}
274      (?<name> [a-zA-Z_:]+ ){0}
275      (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
276      (?<etag> </ \k<name+1> >){0}
277      \g<element>
278      __REGEXP__
279
280      p r.match("<foo>f<bar>bbb</bar>f</foo>").captures
281
282
2839. Subexp calls ("Tanaka Akira special")
284
285  When we say "call a group," it actually means, "re-execute the subexp in
286  that group."
287
288  \g<n>     \g'n'     (n >= 1) call the nth group
289  \g<-n>    \g'-n'    (n >= 1) call the nth group counting backwards from
290                      the calling position
291  \g<name>  \g'name'  call the group with the specified name
292
293  * Left-most recursive calls are not allowed.
294
295    ex. (?<name>a|\g<name>b)    => error
296        (?<name>a|b\g<name>c)   => OK
297
298  * Calls with a name that is assigned to more than one groups are not
299    allowed.
300
301  * Call by number is forbidden if any named group is defined and
302    ONIG_OPTION_CAPTURE_GROUP is not set.
303
304  * The option status of the called group is always effective.
305
306    ex. /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A")
307
308
30910. Captured group
310
311  Behavior of an unnamed group (...) changes with the following conditions.
312  (But named group is not changed.)
313
314  case 1. /.../     (named group is not used, no option)
315
316     (...) is treated as a capturing group.
317
318  case 2. /.../g    (named group is not used, 'g' option)
319
320     (...) is treated as a non-capturing group (?:...).
321
322  case 3. /..(?<name>..)../   (named group is used, no option)
323
324     (...) is treated as a non-capturing group.
325     numbered-backref/call is not allowed.
326
327  case 4. /..(?<name>..)../G  (named group is used, 'G' option)
328
329     (...) is treated as a capturing group.
330     numbered-backref/call is allowed.
331
332  where
333    g: ONIG_OPTION_DONT_CAPTURE_GROUP
334    G: ONIG_OPTION_CAPTURE_GROUP
335
336  ('g' and 'G' options are argued in ruby-dev ML)
337
338
339
340-----------------------------
341A-1. Syntax-dependent options
342
343   + ONIG_SYNTAX_RUBY
344     (?m): dot (.) also matches newline
345
346   + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
347     (?s): dot (.) also matches newline
348     (?m): ^ matches after newline, $ matches before newline
349
350
351A-2. Original extensions
352
353   + hexadecimal digit char type  \h, \H
354   + named group                  (?<name>...), (?'name'...)
355   + named backref                \k<name>
356   + subexp call                  \g<name>, \g<group-num>
357
358
359A-3. Missing features compared with perl 5.8.0
360
361   + \N{name}
362   + \l,\u,\L,\U, \X, \C
363   + (?{code})
364   + (??{code})
365   + (?(condition)yes-pat|no-pat)
366
367   * \Q...\E
368     This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
369
370
371A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
372
373   + add character property (\p{property}, \P{property})
374   + add hexadecimal digit char type (\h, \H)
375   + add look-behind
376     (?<=fixed-width-pattern), (?<!fixed-width-pattern)
377   + add possessive quantifier. ?+, *+, ++
378   + add operations in character class. [], &&
379     ('[' must be escaped as an usual char in character class.)
380   + add named group and subexp call.
381   + octal or hexadecimal number sequence can be treated as
382     a multibyte code char in character class if multibyte encoding
383     is specified.
384     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
385   + allow the range of single byte char and multibyte char in character
386     class.
387     ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
388   + effect range of isolated option is to next ')'.
389     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
390   + isolated option is not transparent to previous pattern.
391     ex. a(?i)* is a syntax error pattern.
392   + allowed unpaired left brace as a normal character.
393     ex. /{/, /({)/, /a{2,3/ etc...
394   + negative POSIX bracket [:^xxxx:] is supported.
395   + POSIX bracket [:ascii:] is added.
396   + repeat of look-ahead is not allowed.
397     ex. /(?=a)*/, /(?!b){5}/
398   + Ignore case option is effective to escape sequence.
399     ex. /\x61/i =~ "A"
400   + In the range quantifier, the number of the minimum is optional.
401     /a{,n}/ == /a{0,n}/
402     The omission of both minimum and maximum values is not allowed.
403     /a{,}/
404   + /{n}?/ is not a reluctant quantifier.
405     /a{n}?/ == /(?:a{n})?/
406   + invalid back reference is checked and raises error.
407     /\1/, /(a)\2/
408   + Zero-width match in an infinite loop stops the repeat,
409     then changes of the capture group status are checked as stop condition.
410     /(?:()|())*\1\2/ =~ ""
411     /(?:\1a|())*/ =~ "a"
412
413
414A-5. Features disabled in default syntax
415
416   + capture history
417
418     (?@...) and (?@<name>...)
419
420     ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
421
422     see sample/listcap.c file.
423
424
425A-6. Problems
426
427   + Invalid encoding byte sequence is not checked.
428
429     ex. UTF-8
430
431     * Invalid first byte is treated as a character.
432       /./u =~ "\xa3"
433
434     * Incomplete byte sequence is not checked.
435       /\w+/ =~ "a\xf3\x8ec"
436
437// END
438