xref: /PHP-5.5/ext/mbstring/oniguruma/doc/RE (revision fe92d64a)
1Oniguruma Regular Expressions Version 5.9.1    2007/09/05
2
3syntax: ONIG_SYNTAX_RUBY (default)
4
5
61. Syntax elements
7
8  \       escape (enable or disable meta character meaning)
9  |       alternation
10  (...)   group
11  [...]   character class
12
13
142. Characters
15
16  \t           horizontal tab (0x09)
17  \v           vertical tab   (0x0B)
18  \n           newline        (0x0A)
19  \r           return         (0x0D)
20  \b           back space     (0x08)
21  \f           form feed      (0x0C)
22  \a           bell           (0x07)
23  \e           escape         (0x1B)
24  \nnn         octal char            (encoded byte value)
25  \xHH         hexadecimal char      (encoded byte value)
26  \x{7HHHHHHH} wide hexadecimal char (character code point value)
27  \cx          control char          (character code point value)
28  \C-x         control char          (character code point value)
29  \M-x         meta  (x|0x80)        (character code point value)
30  \M-\C-x      meta control char     (character code point value)
31
32 (* \b is effective in character class [...] only)
33
34
353. Character types
36
37  .        any character (except newline)
38
39  \w       word character
40
41           Not Unicode:
42             alphanumeric, "_" and multibyte char.
43
44           Unicode:
45             General_Category -- (Letter|Mark|Number|Connector_Punctuation)
46
47  \W       non word char
48
49  \s       whitespace char
50
51           Not Unicode:
52             \t, \n, \v, \f, \r, \x20
53
54           Unicode:
55             0009, 000A, 000B, 000C, 000D, 0085(NEL),
56             General_Category -- Line_Separator
57                              -- Paragraph_Separator
58                              -- Space_Separator
59
60  \S       non whitespace char
61
62  \d       decimal digit char
63
64           Unicode: General_Category -- Decimal_Number
65
66  \D       non decimal digit char
67
68  \h       hexadecimal digit char   [0-9a-fA-F]
69
70  \H       non hexadecimal digit char
71
72
73  Character Property
74
75    * \p{property-name}
76    * \p{^property-name}    (negative)
77    * \P{property-name}     (negative)
78
79    property-name:
80
81     + works on all encodings
82       Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
83       Print, Punct, Space, Upper, XDigit, Word, ASCII,
84
85     + works on EUC_JP, Shift_JIS
86       Hiragana, Katakana
87
88     + works on UTF8, UTF16, UTF32
89       Any, Assigned, C, Cc, Cf, Cn, Co, Cs, L, Ll, Lm, Lo, Lt, Lu,
90       M, Mc, Me, Mn, N, Nd, Nl, No, P, Pc, Pd, Pe, Pf, Pi, Po, Ps,
91       S, Sc, Sk, Sm, So, Z, Zl, Zp, Zs,
92       Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese,
93       Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic,
94       Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian,
95       Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul,
96       Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana,
97       Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
98       Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian,
99       Oriya, Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac,
100       Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan,
101       Tifinagh, Ugaritic, Yi
102
103
104
1054. Quantifier
106
107  greedy
108
109    ?       1 or 0 times
110    *       0 or more times
111    +       1 or more times
112    {n,m}   at least n but not more than m times
113    {n,}    at least n times
114    {,n}    at least 0 but not more than n times ({0,n})
115    {n}     n times
116
117  reluctant
118
119    ??      1 or 0 times
120    *?      0 or more times
121    +?      1 or more times
122    {n,m}?  at least n but not more than m times
123    {n,}?   at least n times
124    {,n}?   at least 0 but not more than n times (== {0,n}?)
125
126  possessive (greedy and does not backtrack after repeated)
127
128    ?+      1 or 0 times
129    *+      0 or more times
130    ++      1 or more times
131
132    ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
133
134    ex. /a*+/ === /(?>a*)/
135
136
1375. Anchors
138
139  ^       beginning of the line
140  $       end of the line
141  \b      word boundary
142  \B      not word boundary
143  \A      beginning of string
144  \Z      end of string, or before newline at the end
145  \z      end of string
146  \G      matching start position
147
148
1496. Character class
150
151  ^...    negative class (lowest precedence operator)
152  x-y     range from x to y
153  [...]   set (character class in character class)
154  ..&&..  intersection (low precedence at the next of ^)
155
156    ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
157
158  * If you want to use '[', '-', ']' as a normal character
159    in a character class, you should escape these characters by '\'.
160
161
162  POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
163
164    Not Unicode Case:
165
166      alnum    alphabet or digit char
167      alpha    alphabet
168      ascii    code value: [0 - 127]
169      blank    \t, \x20
170      cntrl
171      digit    0-9
172      graph    include all of multibyte encoded characters
173      lower
174      print    include all of multibyte encoded characters
175      punct
176      space    \t, \n, \v, \f, \r, \x20
177      upper
178      xdigit   0-9, a-f, A-F
179      word     alphanumeric, "_" and multibyte characters
180
181
182    Unicode Case:
183
184      alnum    Letter | Mark | Decimal_Number
185      alpha    Letter | Mark
186      ascii    0000 - 007F
187      blank    Space_Separator | 0009
188      cntrl    Control | Format | Unassigned | Private_Use | Surrogate
189      digit    Decimal_Number
190      graph    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
191      lower    Lowercase_Letter
192      print    [[:graph:]] | [[:space:]]
193      punct    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
194               Final_Punctuation | Initial_Punctuation | Other_Punctuation |
195               Open_Punctuation
196      space    Space_Separator | Line_Separator | Paragraph_Separator |
197               0009 | 000A | 000B | 000C | 000D | 0085
198      upper    Uppercase_Letter
199      xdigit   0030 - 0039 | 0041 - 0046 | 0061 - 0066
200               (0-9, a-f, A-F)
201      word     Letter | Mark | Decimal_Number | Connector_Punctuation
202
203
204
2057. Extended groups
206
207  (?#...)            comment
208
209  (?imx-imx)         option on/off
210                         i: ignore case
211                         m: multi-line (dot(.) match newline)
212                         x: extended form
213  (?imx-imx:subexp)  option on/off for subexp
214
215  (?:subexp)         not captured group
216  (subexp)           captured group
217
218  (?=subexp)         look-ahead
219  (?!subexp)         negative look-ahead
220  (?<=subexp)        look-behind
221  (?<!subexp)        negative look-behind
222
223                     Subexp of look-behind must be fixed character length.
224                     But different character length is allowed in top level
225                     alternatives only.
226                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
227
228                     In negative-look-behind, captured group isn't allowed,
229                     but shy group(?:) is allowed.
230
231  (?>subexp)         atomic group
232                     don't backtrack in subexp.
233
234  (?<name>subexp), (?'name'subexp)
235                     define named group
236                     (All characters of the name must be a word character.)
237
238                     Not only a name but a number is assigned like a captured
239                     group.
240
241                     Assigning the same name as two or more subexps is allowed.
242                     In this case, a subexp call can not be performed although
243                     the back reference is possible.
244
245
2468. Back reference
247
248  \n          back reference by group number (n >= 1)
249  \k<n>       back reference by group number (n >= 1)
250  \k'n'       back reference by group number (n >= 1)
251  \k<-n>      back reference by relative group number (n >= 1)
252  \k'-n'      back reference by relative group number (n >= 1)
253  \k<name>    back reference by group name
254  \k'name'    back reference by group name
255
256  In the back reference by the multiplex definition name,
257  a subexp with a large number is referred to preferentially.
258  (When not matched, a group of the small number is referred to.)
259
260  * Back reference by group number is forbidden if named group is defined
261    in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
262
263
264  back reference with nest level
265
266    level: 0, 1, 2, ...
267
268    \k<n+level>     (n >= 1)
269    \k<n-level>     (n >= 1)
270    \k'n+level'     (n >= 1)
271    \k'n-level'     (n >= 1)
272
273    \k<name+level>
274    \k<name-level>
275    \k'name+level'
276    \k'name-level'
277
278    Destinate relative nest level from back reference position.
279
280    ex 1.
281
282      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
283
284    ex 2.
285
286      r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
287      (?<element> \g<stag> \g<content>* \g<etag> ){0}
288      (?<stag> < \g<name> \s* > ){0}
289      (?<name> [a-zA-Z_:]+ ){0}
290      (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
291      (?<etag> </ \k<name+1> >){0}
292      \g<element>
293      __REGEXP__
294
295      p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
296
297
298
2999. Subexp call ("Tanaka Akira special")
300
301  \g<name>    call by group name
302  \g'name'    call by group name
303  \g<n>       call by group number (n >= 1)
304  \g'n'       call by group number (n >= 1)
305  \g<-n>      call by relative group number (n >= 1)
306  \g'-n'      call by relative group number (n >= 1)
307
308  * left-most recursive call is not allowed.
309     ex. (?<name>a|\g<name>b)   => error
310         (?<name>a|b\g<name>c)  => OK
311
312  * Call by group number is forbidden if named group is defined in the pattern
313    and ONIG_OPTION_CAPTURE_GROUP is not setted.
314
315  * If the option status of called group is different from calling position
316    then the group's option is effective.
317
318    ex. (?-i:\g<name>)(?i:(?<name>a)){0}  match to "A"
319
320
32110. Captured group
322
323  Behavior of the no-named group (...) changes with the following conditions.
324  (But named group is not changed.)
325
326  case 1. /.../     (named group is not used, no option)
327
328     (...) is treated as a captured group.
329
330  case 2. /.../g    (named group is not used, 'g' option)
331
332     (...) is treated as a no-captured group (?:...).
333
334  case 3. /..(?<name>..)../   (named group is used, no option)
335
336     (...) is treated as a no-captured group (?:...).
337     numbered-backref/call is not allowed.
338
339  case 4. /..(?<name>..)../G  (named group is used, 'G' option)
340
341     (...) is treated as a captured group.
342     numbered-backref/call is allowed.
343
344  where
345    g: ONIG_OPTION_DONT_CAPTURE_GROUP
346    G: ONIG_OPTION_CAPTURE_GROUP
347
348  ('g' and 'G' options are argued in ruby-dev ML)
349
350
351
352-----------------------------
353A-1. Syntax depend options
354
355   + ONIG_SYNTAX_RUBY
356     (?m): dot(.) match newline
357
358   + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
359     (?s): dot(.) match newline
360     (?m): ^ match after newline, $ match before newline
361
362
363A-2. Original extensions
364
365   + hexadecimal digit char type  \h, \H
366   + named group                  (?<name>...), (?'name'...)
367   + named backref                \k<name>
368   + subexp call                  \g<name>, \g<group-num>
369
370
371A-3. Lacked features compare with perl 5.8.0
372
373   + \N{name}
374   + \l,\u,\L,\U, \X, \C
375   + (?{code})
376   + (??{code})
377   + (?(condition)yes-pat|no-pat)
378
379   * \Q...\E
380     This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
381
382
383A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
384
385   + add character property (\p{property}, \P{property})
386   + add hexadecimal digit char type (\h, \H)
387   + add look-behind
388     (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
389   + add possessive quantifier. ?+, *+, ++
390   + add operations in character class. [], &&
391     ('[' must be escaped as an usual char in character class.)
392   + add named group and subexp call.
393   + octal or hexadecimal number sequence can be treated as
394     a multibyte code char in character class if multibyte encoding
395     is specified.
396     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
397   + allow the range of single byte char and multibyte char in character
398     class.
399     ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
400   + effect range of isolated option is to next ')'.
401     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
402   + isolated option is not transparent to previous pattern.
403     ex. a(?i)* is a syntax error pattern.
404   + allowed incompleted left brace as an usual string.
405     ex. /{/, /({)/, /a{2,3/ etc...
406   + negative POSIX bracket [:^xxxx:] is supported.
407   + POSIX bracket [:ascii:] is added.
408   + repeat of look-ahead is not allowed.
409     ex. /(?=a)*/, /(?!b){5}/
410   + Ignore case option is effective to numbered character.
411     ex. /\x61/i =~ "A"
412   + In the range quantifier, the number of the minimum is omissible.
413     /a{,n}/ == /a{0,n}/
414     The simultanious abbreviation of the number of times of the minimum
415     and the maximum is not allowed. (/a{,}/)
416   + /a{n}?/ is not a non-greedy operator.
417     /a{n}?/ == /(?:a{n})?/
418   + invalid back reference is checked and cause error.
419     /\1/, /(a)\2/
420   + Zero-length match in infinite repeat stops the repeat,
421     then changes of the capture group status are checked as stop condition.
422     /(?:()|())*\1\2/ =~ ""
423     /(?:\1a|())*/ =~ "a"
424
425
426A-5. Disabled functions by default syntax
427
428   + capture history
429
430     (?@...) and (?@<name>...)
431
432     ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
433
434     see sample/listcap.c file.
435
436
437A-6. Problems
438
439   + Invalid encoding byte sequence is not checked.
440
441     ex. UTF-8
442
443     * Invalid first byte is treated as a character.
444       /./u =~ "\xa3"
445
446     * Incomplete byte sequence is not checked.
447       /\w+/ =~ "a\xf3\x8ec"
448
449// END
450