xref: /PHP-5.3/ext/mbstring/oniguruma/doc/RE (revision 99211d04)
1Oniguruma Regular Expressions Version 4.3.0    2006/08/17
2
3syntax: ONIG_SYNTAX_RUBY (default)
4
5
61. Syntax elements
7
8  \       escape (enable or disable meta character meaning)
9  |       alternation
10  (...)   group
11  [...]   character class
12
13
142. Characters
15
16  \t           horizontal tab (0x09)
17  \v           vertical tab   (0x0B)
18  \n           newline        (0x0A)
19  \r           return         (0x0D)
20  \b           back space     (0x08)
21  \f           form feed      (0x0C)
22  \a           bell           (0x07)
23  \e           escape         (0x1B)
24  \nnn         octal char            (encoded byte value)
25  \xHH         hexadecimal char      (encoded byte value)
26  \x{7HHHHHHH} wide hexadecimal char (character code point value)
27  \cx          control char          (character code point value)
28  \C-x         control char          (character code point value)
29  \M-x         meta  (x|0x80)        (character code point value)
30  \M-\C-x      meta control char     (character code point value)
31
32 (* \b is effective in character class [...] only)
33
34
353. Character types
36
37  .        any character (except newline)
38
39  \w       word character
40
41           Not Unicode:
42             alphanumeric, "_" and multibyte char.
43
44           Unicode:
45             General_Category -- (Letter|Mark|Number|Connector_Punctuation)
46
47  \W       non word char
48
49  \s       whitespace char
50
51           Not Unicode:
52             \t, \n, \v, \f, \r, \x20
53
54           Unicode:
55             0009, 000A, 000B, 000C, 000D, 0085(NEL),
56             General_Category -- Line_Separator
57                              -- Paragraph_Separator
58                              -- Space_Separator
59
60  \S       non whitespace char
61
62  \d       decimal digit char
63
64           Unicode: General_Category -- Decimal_Number
65
66  \D       non decimal digit char
67
68  \h       hexadecimal digit char   [0-9a-fA-F]
69
70  \H       non hexadecimal digit char
71
72
734. Quantifier
74
75  greedy
76
77    ?       1 or 0 times
78    *       0 or more times
79    +       1 or more times
80    {n,m}   at least n but not more than m times
81    {n,}    at least n times
82    {,n}    at least 0 but not more than n times ({0,n})
83    {n}     n times
84
85  reluctant
86
87    ??      1 or 0 times
88    *?      0 or more times
89    +?      1 or more times
90    {n,m}?  at least n but not more than m times
91    {n,}?   at least n times
92    {,n}?   at least 0 but not more than n times (== {0,n}?)
93
94  possessive (greedy and does not backtrack after repeated)
95
96    ?+      1 or 0 times
97    *+      0 or more times
98    ++      1 or more times
99
100    ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
101
102    ex. /a*+/ === /(?>a*)/
103
104
1055. Anchors
106
107  ^       beginning of the line
108  $       end of the line
109  \b      word boundary
110  \B      not word boundary
111  \A      beginning of string
112  \Z      end of string, or before newline at the end
113  \z      end of string
114  \G      matching start position (*)
115
116          * Ruby Regexp:
117                 previous end-of-match position
118                (This specification is not related to this library.)
119
120
1216. Character class
122
123  ^...    negative class (lowest precedence operator)
124  x-y     range from x to y
125  [...]   set (character class in character class)
126  ..&&..  intersection (low precedence at the next of ^)
127
128    ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
129
130  * If you want to use '[', '-', ']' as a normal character
131    in a character class, you should escape these characters by '\'.
132
133
134  POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
135
136    Not Unicode Case:
137
138    alnum    alphabet or digit char
139    alpha    alphabet
140    ascii    code value: [0 - 127]
141    blank    \t, \x20
142    cntrl
143    digit    0-9
144    graph    include all of multibyte encoded characters
145    lower
146    print    include all of multibyte encoded characters
147    punct
148    space    \t, \n, \v, \f, \r, \x20
149    upper
150    xdigit   0-9, a-f, A-F
151
152
153    Unicode Case:
154
155    alnum    Letter | Mark | Decimal_Number
156    alpha    Letter | Mark
157    ascii    0000 - 007F
158    blank    Space_Separator | 0009
159    cntrl    Control | Format | Unassigned | Private_Use | Surrogate
160    digit    Decimal_Number
161    graph    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
162    lower    Lowercase_Letter
163    print    [[:graph:]] | [[:space:]]
164    punct    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
165             Final_Punctuation | Initial_Punctuation | Other_Punctuation |
166             Open_Punctuation
167    space    Space_Separator | Line_Separator | Paragraph_Separator |
168             0009 | 000A | 000B | 000C | 000D | 0085
169    upper    Uppercase_Letter
170    xdigit   0030 - 0039 | 0041 - 0046 | 0061 - 0066
171             (0-9, a-f, A-F)
172
173
1747. Extended groups
175
176  (?#...)            comment
177
178  (?imx-imx)         option on/off
179                         i: ignore case
180                         m: multi-line (dot(.) match newline)
181                         x: extended form
182  (?imx-imx:subexp)  option on/off for subexp
183
184  (?:subexp)         not captured group
185  (subexp)           captured group
186
187  (?=subexp)         look-ahead
188  (?!subexp)         negative look-ahead
189  (?<=subexp)        look-behind
190  (?<!subexp)        negative look-behind
191
192                     Subexp of look-behind must be fixed character length.
193                     But different character length is allowed in top level
194                     alternatives only.
195                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
196
197                     In negative-look-behind, captured group isn't allowed,
198                     but shy group(?:) is allowed.
199
200  (?>subexp)         atomic group
201                     don't backtrack in subexp.
202
203  (?<name>subexp)    define named group
204                     (All characters of the name must be a word character.
205                     And first character must not be a digit or uppper case)
206
207                     Not only a name but a number is assigned like a captured
208                     group.
209
210                     Assigning the same name as two or more subexps is allowed.
211                     In this case, a subexp call can not be performed although
212                     the back reference is possible.
213
214
2158. Back reference
216
217  \n          back reference by group number (n >= 1)
218  \k<name>    back reference by group name
219
220  In the back reference by the multiplex definition name,
221  a subexp with a large number is referred to preferentially.
222  (When not matched, a group of the small number is referred to.)
223
224  * Back reference by group number is forbidden if named group is defined
225    in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
226
227
228  back reference with nest level
229
230    (This function is disabled in Ruby 1.9.)
231
232    \k<name+n>     n: 0, 1, 2, ...
233    \k<name-n>     n: 0, 1, 2, ...
234
235    Destinate relative nest level from back reference position.
236
237    ex 1.
238
239      /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
240
241    ex 2.
242
243      r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
244      (?<element> \g<stag> \g<content>* \g<etag> ){0}
245      (?<stag> < \g<name> \s* > ){0}
246      (?<name> [a-zA-Z_:]+ ){0}
247      (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
248      (?<etag> </ \k<name+1> >){0}
249      \g<element>
250      __REGEXP__
251
252      p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
253
254
255
2569. Subexp call ("Tanaka Akira special")
257
258  \g<name>    call by group name
259  \g<n>       call by group number (n >= 1)
260
261  * left-most recursive call is not allowed.
262     ex. (?<name>a|\g<name>b)   => error
263         (?<name>a|b\g<name>c)  => OK
264
265  * Call by group number is forbidden if named group is defined in the pattern
266    and ONIG_OPTION_CAPTURE_GROUP is not setted.
267
268  * If the option status of called group is different from calling position
269    then the group's option is effective.
270
271    ex. (?-i:\g<name>)(?i:(?<name>a)){0}  match to "A"
272
273
27410. Captured group
275
276  Behavior of the no-named group (...) changes with the following conditions.
277  (But named group is not changed.)
278
279  case 1. /.../     (named group is not used, no option)
280
281     (...) is treated as a captured group.
282
283  case 2. /.../g    (named group is not used, 'g' option)
284
285     (...) is treated as a no-captured group (?:...).
286
287  case 3. /..(?<name>..)../   (named group is used, no option)
288
289     (...) is treated as a no-captured group (?:...).
290     numbered-backref/call is not allowed.
291
292  case 4. /..(?<name>..)../G  (named group is used, 'G' option)
293
294     (...) is treated as a captured group.
295     numbered-backref/call is allowed.
296
297  where
298    g: ONIG_OPTION_DONT_CAPTURE_GROUP
299    G: ONIG_OPTION_CAPTURE_GROUP
300
301  ('g' and 'G' options are argued in ruby-dev ML)
302
303  These options are not implemented in Ruby level.
304
305
306-----------------------------
307A-1. Syntax depend options
308
309   + ONIG_SYNTAX_RUBY
310     (?m): dot(.) match newline
311
312   + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
313     (?s): dot(.) match newline
314     (?m): ^ match after newline, $ match before newline
315
316
317A-2. Original extensions
318
319   + hexadecimal digit char type  \h, \H
320   + named group                  (?<name>...)
321   + named backref                \k<name>
322   + subexp call                  \g<name>, \g<group-num>
323
324
325A-3. Lacked features compare with perl 5.8.0
326
327   + [:word:]
328   + \N{name}
329   + \l,\u,\L,\U, \X, \C
330   + (?{code})
331   + (??{code})
332   + (?(condition)yes-pat|no-pat)
333
334   * \Q...\E
335     This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
336
337   * \p{property}, \P{property}
338     This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
339     Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
340     Print, Punct, Space, Upper, XDigit, ASCII are supported.
341
342     Prefix 'Is' of property name is allowed in ONIG_SYNTAX_PERL only.
343     ex. \p{IsXDigit}.
344
345     Negation operator of property is supported in ONIG_SYNTAX_PERL only.
346     \p{^...}, \P{^...}
347
348
349A-4. Differences with Japanized GNU regex(version 0.12) of Ruby
350
351   + add hexadecimal digit char type (\h, \H)
352   + add look-behind
353     (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
354   + add possessive quantifier. ?+, *+, ++
355   + add operations in character class. [], &&
356     ('[' must be escaped as an usual char in character class.)
357   + add named group and subexp call.
358   + octal or hexadecimal number sequence can be treated as
359     a multibyte code char in character class if multibyte encoding
360     is specified.
361     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
362   + allow the range of single byte char and multibyte char in character
363     class.
364     ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
365   + effect range of isolated option is to next ')'.
366     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
367   + isolated option is not transparent to previous pattern.
368     ex. a(?i)* is a syntax error pattern.
369   + allowed incompleted left brace as an usual string.
370     ex. /{/, /({)/, /a{2,3/ etc...
371   + negative POSIX bracket [:^xxxx:] is supported.
372   + POSIX bracket [:ascii:] is added.
373   + repeat of look-ahead is not allowed.
374     ex. /(?=a)*/, /(?!b){5}/
375   + Ignore case option is effective to numbered character.
376     ex. /\x61/i =~ "A"
377   + In the range quantifier, the number of the minimum is omissible.
378     /a{,n}/ == /a{0,n}/
379     The simultanious abbreviation of the number of times of the minimum
380     and the maximum is not allowed. (/a{,}/)
381   + /a{n}?/ is not a non-greedy operator.
382     /a{n}?/ == /(?:a{n})?/
383   + invalid back reference is checked and cause error.
384     /\1/, /(a)\2/
385   + Zero-length match in infinite repeat stops the repeat,
386     then changes of the capture group status are checked as stop condition.
387     /(?:()|())*\1\2/ =~ ""
388     /(?:\1a|())*/ =~ "a"
389
390
391A-5. Disabled functions by default syntax
392
393   + capture history
394
395     (?@...) and (?@<name>...)
396
397     ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
398
399     see sample/listcap.c file.
400
401
402A-6. Problems
403
404   + Invalid encoding byte sequence is not checked in UTF-8.
405
406     * Invalid first byte is treated as a character.
407       /./u =~ "\xa3"
408
409     * Incomplete byte sequence is not checked.
410       /\w+/ =~ "a\xf3\x8ec"
411
412// END
413