Lines Matching refs:a

2 This file contains a concatenation of the PCRE man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
22 first release of a new API, known as PCRE2, with release numbers start-
31 The PCRE library is a set of functions that implement regular expres-
33 just a few differences. Some features that appeared in Python and PCRE
41 (including UTF-8 strings), and a second library that supports 16-bit
46 Starting with release 8.32 it is possible to compile a third separate
70 alternative function that matches the same compiled patterns in a dif-
72 advantages. For a discussion of the two matching algorithms, see the
75 PCRE is written in C and released as a C library. A number of people
77 Google Inc. have provided a comprehensive C++ wrapper for the 8-bit
87 tern and pcrecompat pages. There is a syntax summary in the pcresyntax
91 library is built. The pcre_config() function makes it possible for a
97 The libraries contains a number of undocumented internal functions and
102 is possible to control which external symbols are exported when a
109 If you are using PCRE in a non-UTF application that permits users to
110 supply arbitrary patterns for compilation, you should be aware of a
111 feature that allows users to turn on UTF support from within a pattern,
117 ity. If the data string is very long, such a check might use suffi-
124 option at compile time. This causes an compile time error if a pattern
125 contains a UTF-setting sequence.
132 Another way that performance can be hit is by running a pattern that
133 has a very large search tree against a string that will never match.
134 Nested unlimited repeats in a pattern are a common example. PCRE pro-
141 The user documentation for PCRE comprises a number of different sec-
142 tions. In the "man" format, each of these is a separate "man page". In
143 the HTML format, each is a separate page, linked from the index page.
147 is a program listing), are concatenated in pcre.txt, for ease of
159 pcredemo a demonstration C program that uses PCRE
176 In the "man" and HTML formats, there is also a short page for each C
186 Putting an actual email address here seems to have been a spam magnet,
314 Starting with release 8.30, it is possible to compile a PCRE library
328 tions from just one library. For example, if you want to study a pat-
355 int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
356 as "unsigned short int", but checks that it really is a 16-bit data
367 used for passing data to a callout function is pcre16_callout_block.
375 For every function in the 8-bit library there is a corresponding func-
376 tion in the 16-bit library with a name that starts with pcre16_ instead
378 extra function, pcre16_utf16_to_host_byte_order(). This is a utility
379 function that converts a UTF-16 character string to host byte order if
388 input string; a negative value specifies a zero-terminated string.
394 If byte_order is not NULL, a non-zero value of the integer to which it
428 define the same bits in the options word. There is a discussion about
452 A UTF-16 string can indicate its endianness by special code knows as a
463 given when a compiled pattern is passed to a function that processes
464 patterns in the other mode, for example, if a pattern compiled with
481 If there is an error while compiling a pattern, the error text that is
488 The subject and mark fields in the callout block that is passed to a
645 Starting with release 8.32, it is possible to compile a PCRE library
660 want to study a pattern that was compiled with pcre32_compile(), you
686 int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
687 as "unsigned int", but checks that it really is a 32-bit data type. If
698 used for passing data to a callout function is pcre32_callout_block.
706 For every function in the 8-bit library there is a corresponding func-
707 tion in the 32-bit library with a name that starts with pcre32_ instead
709 extra function, pcre32_utf32_to_host_byte_order(). This is a utility
710 function that converts a UTF-32 character string to host byte order if
719 input string; a negative value specifies a zero-terminated string.
725 If byte_order is not NULL, a non-zero value of the integer to which it
759 define the same bits in the options word. There is a discussion about
782 A UTF-32 string can indicate its endianness by special code knows as a
792 The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
793 to a function that processes patterns in the other mode, for example,
794 if a pattern compiled with pcre_compile() is passed to pcre32_exec().
809 If there is an error while compiling a pattern, the error text that is
816 The subject and mark fields in the callout block that is passed to a
867 PCRE is distributed with a configure script that can be used to build
873 systems. There is a lot more information about building PCRE without
876 consult this file as well as the README file if you are building in a
910 By default, a library called libpcre is built, containing functions
913 build a separate library, called libpcre16, in which strings are con-
951 will search for a C++ compiler and C++ header files. If it finds them,
974 utf8 is a synonym of --enable-utf.)
1013 this option is set for an unsupported architecture, a compile time
1014 error occurs. See the pcrejit documentation for a discussion of JIT
1026 the end of a line. This is the normal newline character on Unix-like
1032 to the configure command. There is also a --enable-newline-is-lf
1040 to the configure command. There is a fourth option, specified by
1045 CRLF as indicating a line ending. Finally, a fifth option, specified by
1058 By default, the sequence \R in a pattern matches any Unicode newline
1078 longer used is 10; it can be changed by adding a setting such as
1087 Within a compiled pattern, offset values are used to point from one
1090 two-byte values are used for these offsets, leading to a maximum size
1091 for a compiled pattern of around 64K. This is sufficient to handle all
1094 use three-byte or four-byte offsets by adding a setting such as
1099 16-bit library, a value of 3 is rounded up to 4. In these libraries,
1113 the maximum stack size. There is a discussion in the pcrestack docu-
1117 If you want to build a version of PCRE that works this way, add
1137 Internally, PCRE has a function called match(), which it calls repeat-
1138 edly (sometimes recursively) when matching a pattern with the
1140 function may be called during a single matching operation, a limit can
1141 be placed on the resources used by a single call to pcre_exec(). The
1143 tation. The default is 10 million, but this can be changed by adding a
1156 imposes no additional constraints. However, you can set a lower limit
1168 less than 256. By default, PCRE is built with a set of tables that are
1175 Instead, a program called dftables is compiled and run. This outputs
1186 character code is ASCII (or Unicode, which is a superset of ASCII).
1229 pcregrep uses an internal buffer to hold a "window" on the file it is
1231 it finds a match. The size of the buffer is controlled by a parameter
1240 this value by specifying a run-time option.
1250 library, and when its input is from a terminal, it reads it using the
1252 Note that libreadline is GPL-licensed, so if you distribute a binary of
1256 pcretest build. In many operating environments with a sytem-installed
1288 If your C compiler is gcc, you can build a version of PCRE that can
1289 generate a code coverage report for its test suite. To enable this, you
1296 Note that using ccache (a caching C compiler) is incompatible with code
1309 This creates a fresh coverage report for the PCRE test suite. It is
1371 in PCRE for matching a compiled regular expression against a given sub-
1374 the same as as Perl's matching function, and provide a Perl-compatible
1380 pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a
1385 When there is only one possible way in which a given subject string can
1386 match a pattern, the two algorithms give the same answer. A difference
1402 The set of strings that are matched by a regular expression can be rep-
1403 resented as a tree structure. An unlimited repetition in the pattern
1404 makes the tree of infinite size, but it is still a tree. Matching the
1405 pattern to a given subject string (from a given starting point) can be
1406 thought of as a search of the tree. There are two ways to search a
1414 sions", the standard algorithm is an "NFA algorithm". It conducts a
1415 depth-first search of the pattern tree. That is, it proceeds along a
1417 required. When there is a mismatch, the algorithm tries any alterna-
1425 If a leaf node is reached, a matching string has been found, and at
1432 Because it ends up with a single path through the tree, it is rela-
1440 This algorithm conducts a breadth-first search of the tree. Starting
1444 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
1445 though it is not implemented as a traditional finite state machine (it
1450 exception: when a lookaround assertion is encountered, the characters
1474 ter repeats at the end of a pattern (as well as internally). For exam-
1475 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
1479 either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS
1482 There are a number of features of PCRE regular expressions that are not
1488 sessive quantifiers can make a difference when what follows could also
1489 match what is quantified, for example in a pattern like this:
1491 ^a++\w!
1494 a non-possessive quantifier. Similarly, if an atomic group is present,
1495 it is matched as if it were a standalone pattern at the current point,
1508 4. For the same reason, conditional expressions that use a backrefer-
1509 ence as the condition or test for a specific group recursion are not
1521 matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
1523 through the subject string one character (not data unit) at a time, for
1527 are not supported. (*FAIL) is supported, and behaves like a failing
1536 1. All possible matches (at a single point in the subject) are automat-
1553 The alternative algorithm suffers from a number of disadvantages:
1724 documentation. Both of these APIs define a set of C function calls. A
1736 In a Windows environment, if you want to statically link an application
1737 program against a non-dll pcre.a file, you must define PCRE_STATIC
1744 a Perl-compatible manner. A sample program that demonstrates the sim-
1759 From release 8.32 there is also a direct interface for JIT execution,
1764 ble, is also provided. This uses a different algorithm for the match-
1765 ing. The alternative algorithm finds all possible matches (at a given
1773 convenience functions for extracting captured substrings from a subject
1787 The function pcre_maketables() is used to build a set of character
1794 The function pcre_fullinfo() is used to find out information about a
1795 compiled pattern. The function pcre_version() returns a pointer to a
1798 The function pcre_refcount() maintains a reference count in a data
1799 block containing a compiled pattern. This is provided for the benefit
1805 so a calling program can replace them if it wishes to intercept the
1813 this. It is a non-standard way of building PCRE, for use in environ-
1817 used, these functions are always called in a stack-like manner (last
1819 There is a discussion about PCRE's stack usage in the pcrestack docu-
1823 by the caller to a "callout" function, which PCRE will then call at
1824 specified points during a matching operation. Details are given in the
1828 set by the caller to a function that is called by PCRE whenever it
1829 starts to compile a parenthesized part of a pattern. When parentheses
1832 stacks can force a compilation error if the stack runs out. The func-
1839 strings: a single CR (carriage return) character, a single LF (line-
1847 system as its standard newline sequence. When PCRE is built, a default
1849 dard. When PCRE is run, the default can be overridden, either when a
1858 acter or pair of characters that indicate a line break". The choice of
1861 CRLF is a recognized line ending sequence, the match position advance-
1862 ment for a non-anchored pattern. There is more detail about this in the
1867 which is controlled in a similar way, but by separate options.
1878 The compiled form of a regular expression is not altered during match-
1889 The compiled form of a regular expression can be saved and re-used at a
1890 later time, possibly by a different program, and even on a host other
1892 pcreprecompile documentation, which includes a description of the
1893 pcre_pattern_to_host_byte_order() function. However, compiling a regu-
1894 lar expression with one version of PCRE for use with a different ver-
1902 The function pcre_config() makes it possible for a PCRE client to dis-
1908 information is required; the second argument is a pointer to a variable
1950 The output is a pointer to a zero-terminated "const char *" string. If
1971 matches any Unicode line ending sequence; a value of 1 means that \R
1972 matches only CR, LF, or CRLF. The default can be overridden when a pat-
1980 is either 2 or 4 and is still a number of bytes. For the 32-bit
1981 library, the value is either 2 or 4 and is still a number of bytes. The
1995 The output is a long integer that gives the maximum depth of nesting of
1996 parentheses (of any kind) in a pattern. This limit is imposed to cap
1997 the amount of system stack used when a pattern is compiled. It is spec-
2000 tion. For finer control over compilation stack usage, you can set a
2005 The output is a long integer that gives the default limit for the num-
2006 ber of internal matching function calls in a pcre_exec() execution.
2011 The output is a long integer that gives the default limit for the depth
2012 of recursion when calling the internal matching function in a
2039 to compile a pattern into an internal form. The only difference between
2041 errorcodeptr, via which a numerical error code can be returned. To
2045 The pattern is a C string terminated by a binary zero, and is passed in
2046 the pattern argument. A pointer to a single block of memory that is
2049 is a typedef for a structure whose contents are not externally defined.
2053 Although the compiled code of a PCRE regex is relocatable, that is, it
2055 fully relocatable, because it may contain a copy of the tableptr argu-
2071 if compilation of a pattern fails, pcre_compile() returns NULL, and
2072 sets the variable pointed to by errptr to point to a textual error mes-
2073 sage. This is a static string that is part of the library. You must not
2083 Note that the offset is in data units, not characters, even in a UTF
2084 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
2088 codeptr argument is not NULL, a non-zero error code number is returned
2092 If the final argument, tableptr, is NULL, PCRE uses a default set of
2095 result of a call to pcre_maketables(). This value is stored with the
2100 This code fragment shows a typical straightforward call to pcre_com-
2137 ting an option when a compiled pattern is matched.
2143 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2154 If this bit is set, a dollar metacharacter in the pattern matches only
2155 at the end of the subject string. Without this option, a dollar also
2156 matches immediately before a newline at the end of the string (but not
2159 Perl, and no way to set it within a pattern.
2163 If this bit is set, a dot metacharacter in the pattern matches a char-
2164 acter of any value, including one that indicates a newline. However, it
2166 Without this option, a dot does not match when the current position is
2167 at a newline. This option is equivalent to Perl's /s option, and it can
2168 be changed within a pattern by a (?s) option setting. A negative class
2169 such as [^a] always matches newline characters, independent of the set-
2183 totally ignored except when escaped or inside a character class. How-
2185 introduce various parenthesized subpatterns, nor within a numerical
2187 between an item and a following quantifier and between a quantifier and
2188 a following + that indicates possessiveness.
2195 PCRE_EXTENDED also causes characters between an unescaped # outside a
2198 within a pattern by a (?x) option setting.
2201 options passed to pcre_compile() or by a special sequence at the start
2204 of comment is a literal newline sequence in the pattern; escape
2205 sequences that happen to represent a newline do not count.
2210 sequences in a pattern, for example within the sequence (?( that intro-
2211 duces a conditional subpattern.
2217 little use. When set, any backslash in a pattern that is followed by a
2219 these combinations for future expansion. By default, as in Perl, a
2220 backslash followed by a letter with no special meaning is treated as a
2223 controlled by this option. It can also be set by a (?X) option setting
2224 within a pattern.
2238 (1) A lone closing square bracket in a pattern causes a compile-time
2240 as a data character). Thus, the pattern AB]CD becomes illegal when this
2243 (2) At run time, a back reference to an unset subpattern group matches
2245 tive to fail). A pattern such as (\1)(a) succeeds when this option is
2246 set (assuming it can find an "a" in the subject), whereas it fails by
2249 (3) \U matches an upper case "U" character; by default \U causes a com-
2252 (4) \u matches a lower case "u" character unless it is followed by four
2254 code point to match. By default, \u causes a compile time error (Perl
2257 (5) \x matches a lower case "x" character unless it is followed by two
2259 code point to match. By default, as in Perl, a hexadecimal number is
2261 for example, \xz matches a binary zero character followed by z).
2266 line", PCRE treats the subject string as consisting of a single line of
2270 before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set).
2272 metacharacter (.) does not match at a newline. This behaviour (for ^,
2279 changed within a pattern by a (?m) option setting. If there are no new-
2280 lines in a subject string, or no occurrences of ^ or $ in a pattern,
2299 when PCRE was built. Setting the first or the second specifies that a
2300 newline is indicated by a single character (CR or LF, respectively).
2301 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2321 treated as a number, giving eight possibilities. Currently only six are
2328 The only time that a line break in a pattern is specially recognized
2331 side a character class indicates a comment that lasts until after the
2349 optimization that, for example, turns a+b into a++b in order to avoid
2350 backtracks into a+ that can never be successful. However, if callouts
2353 a full unoptimized search and run all the callouts, but it is mainly
2380 not compatible with Perl. It can also be set by a (?U) option setting
2393 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2394 automatically checked. There is a discussion about the validity of
2399 effect of passing an invalid UTF-8 string as a pattern is undefined. It
2429 13 POSIX named classes are supported only within a class
2472 55 repeating a DEFINE group is not allowed
2474 57 \g is not followed by a braced, angle-bracketed, or quoted
2475 name/number or by a plain number
2476 58 a numbered reference must not be zero
2489 69 \k is not followed by a braced, angle-bracketed, or quoted name
2491 71 \N is not supported in a class
2504 84 group name must start with a non-digit
2516 If a compiled pattern is going to be used several times, it is worth
2518 matching. The function pcre_study() takes a pointer to a compiled pat-
2520 information that will help speed up matching, pcre_study() returns a
2521 pointer to a pcre_extra block, in which the study_data field points to
2525 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
2527 passed; these are described below in the section on matching a pattern.
2534 returns a pcre_extra block even if studying did not find any additional
2551 JIT compilation is a heavyweight optimization. It can take some time
2553 terns the benefit of faster execution might be offset by a much slower
2559 The third argument for pcre_study() is a pointer for an error message.
2561 points to is set to NULL. Otherwise it is set to point to a textual
2562 error message. This is a static string that is part of the library. You
2566 When you are finished with a pattern, you can free the memory used for
2573 This is a typical way in which pcre_study() is used (except that in a
2583 &error); /* set to NULL or points to a message */
2590 Studying a pattern does two things: first, a lower bound for the length
2595 lower bound. You can find out the value in a calling program via the
2598 Studying a pattern is also useful for non-anchored patterns that do not
2599 have a single fixed starting character. A bitmap of possible starting
2600 bytes is created. This speeds up finding a position in the subject at
2618 There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
2624 letters, digits, or whatever, by reference to a set of tables, indexed
2630 the PCRE_UCP option can be set when a pattern is compiled; this causes
2646 application that calls PCRE. These may be created in a different locale
2675 It is possible to pass a table pointer or NULL (indicating the use of
2677 sion below in the section on matching a pattern). This facility is pro-
2679 reloaded. Character tables are not saved with patterns, so if a non-
2682 match a pattern in a different locale from the one in which it was com-
2691 The pcre_fullinfo() function returns information about a compiled pat-
2695 The first argument for pcre_fullinfo() is a pointer to the compiled
2698 of information is required, and the fourth argument is a pointer to a
2712 anness error can occur if a compiled pattern is saved and reloaded on a
2713 different host. Here is a typical call of pcre_fullinfo(), to obtain
2740 Return a pointer to the internal default character tables within PCRE.
2744 passing a NULL table pointer.
2749 a non-anchored pattern. The name of this option refers to the 8-bit
2757 If there is a fixed first value, for example, the letter "c" from a
2765 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2772 of a subject string or after any newline within the string. Otherwise
2790 a non-anchored pattern. The fourth argument should point to an int
2793 If there is a fixed first value, for example, the letter "c" from a
2798 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2805 a subject string or after any newline within the string. Otherwise 0 is
2810 If the pattern was studied, and this resulted in the construction of a
2811 256-bit table indicating a fixed set of values for the first data unit
2812 in any matching string, a pointer to the table is returned. Otherwise
2820 variable. An explicit match is either a literal CR or LF character, or
2835 with a JIT option, or that the JIT compiler could not handle this par-
2841 If the pattern was successfully studied with a JIT option, return the
2843 ment should point to a size_t variable.
2848 any matched string, other than at its start, if such a value has been
2850 is no such value, -1 is returned. For anchored patterns, a last literal
2852 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2853 /^a\dz\d/ the returned value is -1.
2867 If the pattern set a match limit by including an item of the form
2878 Note that the simple assertions \b and \B require a one-character look-
2879 behind. \A also registers a one-character lookbehind, though it does
2881 least one character from the old segment is retained when a new segment
2883 might match incorrectly at the start of a new segment.
2887 If the pattern was studied and a minimum length for matching subject
2889 value is -1. The value is a number of characters, which in UTF mode may
2891 point to an int variable. A non-negative value is a lower bound to the
2905 first converting the name to a number in order to access the correct
2910 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2914 a pointer to the first entry of the table. This is a pointer to char in
2934 As a simple example of the name/number table, consider the following
2946 00 01 d a t e 00 ??
2947 00 05 d a y 00 ?? ??
2949 00 02 y e a r 00 ??
2966 Return a copy of the options with which the pattern was compiled. The
2989 If the pattern set a recursion limit by including an item of the form
2998 libraries). The fourth argument should point to a size_t variable. This
3003 the pcre structure. Studying a compiled pattern, with or without JIT,
3009 pointed to by the study_data field in a pcre_extra block. If pcre_extra
3011 ment should point to a size_t variable. The study_data field is set by
3013 section entitled "Studying a pattern" above). The format of the
3020 Returns 1 if there is a rightmost literal data unit that must exist in
3026 For anchored patterns, a last literal value is recorded only if it fol-
3028 /^a\d+z\d+/ the returned value 1 (with "z" returned from
3029 PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
3034 any matched string, other than at its start, if such a value has been
3043 The pcre_refcount() function is used to maintain a reference count in
3044 the data block that contains a compiled pattern. It is provided for the
3049 When a pattern is compiled, the reference count field is initialized to
3057 if a pattern is compiled on one host and then transferred to a host
3058 whose byte-order is different. (This seems a highly unlikely scenario.)
3067 The function pcre_exec() is called to match a subject string against a
3075 operates in a Perl-like manner. For specialist use there is also an
3082 later in different processes, possibly even on different hosts. For a
3085 Here is an example of a simple call to pcre_exec():
3101 If the extra argument is not NULL, it must point to a pcre_extra data
3102 block. The pcre_study() function returns such a block (when it doesn't
3139 The match_limit field provides a means of preventing PCRE from using up
3140 a vast amount of resources when running patterns that are not going to
3141 match, but which have a very large number of possibilities in their
3142 search trees. The classic example is a pattern that uses nested unlim-
3145 Internally, pcre_exec() uses a function called match(), which it calls
3147 imposed on the number of times this function is called during a match,
3152 When pcre_exec() is called with a pattern that was successfully studied
3153 with a JIT option, the way that the matching is executed is entirely
3155 that goes on for a very long time, and so the match_limit value is also
3156 used in this case (but in a different way) to limit how long the match-
3161 cases. You can override the default by suppling pcre_exec() with a
3167 start of a pattern of the form
3171 where d is a decimal number. However, such a setting is ignored unless
3177 the depth of recursion. The recursion depth is a smaller number than
3190 a pcre_extra block in which match_limit_recursion is set, and
3195 start of a pattern of the form
3199 where d is a decimal number. However, such a setting is ignored unless
3208 then reloaded, because the tables that were used to compile a pattern
3209 are not saved with it. See the pcreprecompile documentation for a dis-
3215 the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
3222 set to point to a suitable variable. If the pattern contains any back-
3224 with a name to pass back, a pointer to the name string (zero termi-
3226 names are within the compiled pattern; if you wish to retain such a
3227 name you must copy it before freeing the memory of a compiled pattern.
3251 matching position. If a pattern was compiled with PCRE_ANCHORED, or
3273 ters. It may also alter the way the match position is advanced after a
3277 set, and a match attempt for an unanchored pattern fails when the cur-
3278 rent position is at a CRLF sequence, and the pattern contains no
3283 The above rule is a compromise that makes the most common cases work as
3291 An explicit match for CR of LF is either a literal appearance of one of
3297 is a valid newline sequence and explicit \r or \n escapes appear in the
3303 the beginning of a line, so the circumflex metacharacter should not
3311 of a line, so the dollar metacharacter should not match it nor (except
3312 in multiline mode) a newline immediately before it. Setting this with-
3319 An empty string is not considered to be a valid match if this option is
3324 a?b?
3326 is applied to a string not beginning with "a" or "b", it matches an
3329 rences of "a" or "b".
3335 anchored, such a match can occur only if the pattern contains \K.
3338 PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
3341 matching a null string by first trying the match again at the same off-
3346 check to see if the newline convention recognizes CRLF as a newline,
3352 There are a number of optimizations that pcre_exec() uses at the start
3353 of a match, in order to speed up the process. For example, if it is
3354 known that an unanchored match must start with a specific character, it
3357 This means that a special item such as (*COMMIT) at the start of a pat-
3358 tern is not considered until after a suitable starting point for the
3361 tern is never actually used. The start-up optimizations are in effect a
3374 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
3379 When this is compiled, PCRE records the fact that a match must start
3389 mizations may be used. For example, a minimum length for the subject
3394 The minimum length for a match is one character. If the subject is
3404 When PCRE_UTF8 is set at compile time, the validity of the subject as a
3408 points to the start of a UTF-8 character. There is a discussion about
3411 PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3416 tains a value that does not point to the start of a UTF-8 character (or
3423 making repeated calls to find all the matches in a single subject
3425 points to the start of a character (or the end of the subject). When
3426 PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3434 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
3441 caller is prepared to handle a partial match, but only if no complete
3445 case, if a partial match is found, pcre_exec() immediately returns
3447 other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
3451 partial match was found is set as the first matching string. There is a
3457 The subject string is passed to pcre_exec() as a pointer in subject, a
3458 length in length, and a starting offset in startoffset. The units for
3465 zero, the search for a match starts at the beginning of the subject,
3467 offset must point to the start of a character, or the end of the sub-
3473 in the same subject by calling pcre_exec() again after a previous suc-
3474 cess. Setting startoffset differs from just passing over a shortened
3475 string and setting PCRE_NOTBOL in the case of a pattern that begins
3481 only if the current position in the subject is not a word boundary.)
3486 to be a word boundary. However, if pcre_exec() is passed the entire
3489 discover that it is preceded by a letter.
3491 Finding all the matches in a subject is tricky when the pattern can
3498 if the newline convention recognizes CRLF as a newline, and if so, and
3502 If a non-zero starting offset is passed when the pattern is anchored,
3509 In general, a pattern matches a certain portion of the subject, and in
3513 subpattern" is used for a fragment of a pattern that picks out a sub-
3517 Captured substrings are returned to the caller via a vector of integers
3519 tor is passed in ovecsize, which must be a non-negative number. Note:
3523 strings, each substring using a pair of integers. The remaining third
3526 The number passed in ovecsize should always be a multiple of three. If
3529 When a match is successful, information about captured substrings is
3532 element of each pair is set to the offset of the first character in a
3534 after the end of a substring. These values are always data unit off-
3545 value from a successful match is 1, indicating that just the first pair
3548 If a capturing subpattern is matched repeatedly, it is the last portion
3553 function returns a value of zero. If neither the actual string matched
3565 (a)(?:(b)c|bd)
3567 If a vector of 6 elements (allowing for only 1 captured substring) is
3569 captured string, thereby recording a vector overflow, before failing to
3574 than the maximum, a non-zero value is returned.
3577 subpatterns there are in a compiled pattern. The smallest size for
3583 if the string "abc" is matched against the pattern (a|(z))(bc) the
3598 is, if a pattern contains n capturing parentheses, no more than ovec-
3607 If pcre_exec() fails, it returns a negative number. The following are
3625 PCRE stores a 4-byte "magic number" at the start of the compiled code,
3626 to catch the case when it is passed a junk pointer and to detect when a
3634 compiled pattern. This error could be caused by a bug in PCRE or by
3639 If a pattern contains back references, but the ovector that is passed
3641 PCRE gets a block of memory at the start of matching to use for this
3657 The backtracking limit, as specified by the match_limit field in a
3664 use by callout functions that want to yield a distinctive error code.
3669 A string that contains an invalid UTF-8 byte sequence was passed as a
3673 ment, and a reason code is placed in the second element. The reason
3675 if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
3681 The UTF-8 byte sequence that was passed as a subject was checked and
3683 value of startoffset did not point to the beginning of a UTF-8 charac-
3694 PCRE_PARTIAL option was used with a compiled pattern containing items
3701 by a bug in PCRE or by overwriting of the compiled pattern.
3710 field in a pcre_extra structure (or defaulted) was reached. See the
3725 string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
3734 This error is returned when pcre_exec() detects a recursion loop within
3735 the pattern. Specifically, it means that either the whole pattern or a
3744 This error is returned when a pattern that was successfully studied
3745 using a JIT compile option is being matched, but the memory available
3751 This error is given if a pattern that was compiled by the 8-bit library
3752 is passed to a 16-bit or 32-bit library function, or vice versa.
3756 This error is given if a pattern that was compiled and saved is
3757 reloaded on a host with different endianness. The utility function
3758 pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3763 This error is returned when a pattern that was successfully studied
3764 using a JIT compile option is being matched, but the matching mode
3772 This error is given if pcre_exec() is called with a negative value for
3786 first output vector element (ovector[0]) and a reason code is placed in
3796 The string ends with a truncated UTF-8 character; the code specifies
3820 A 4-byte character has a value greater than 0x10fff; these code points
3825 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
3836 for a value that can be represented by fewer bytes, which is invalid.
3842 The two most significant bits of the first byte of a character have the
3844 ond is 0). Such a byte can only validly occur as the second or subse-
3845 quent byte of a multi-byte character.
3849 The first byte of a character has the value 0xfe or 0xff. These values
3850 can never occur in a valid UTF-8 string.
3854 This error code was formerly used when the presence of a so-called
3856 that such characters should not cause a string to be rejected, and so
3881 A substring that contains a binary zero is correctly extracted and has
3882 a further zero added on the end, but the result is not, of course, a C
3883 string. However, you can process such a string by referring to the
3891 matched, ovector is a pointer to the vector of integer offsets that was
3899 The functions pcre_copy_substring() and pcre_get_substring() extract a
3904 buffersize, while for pcre_get_substring() a new block of memory is
3919 strings and builds a list of pointers to them. All this is done in a
3922 the list of string pointers. The end of the list is marked by a NULL
3930 When any of these functions encounter a substring that is unset, which
3933 empty string. This can be distinguished from a genuine zero-length sub-
3938 string_list() can be used to free the memory returned by a previous
3941 pcre_free, which of course could be called directly from a C program.
3942 However, PCRE is used in some situations where it is linked via a spe-
3963 To extract a substring by name, you first have to find associated num-
3966 (a+)b(?<xxx>\d+)...
3985 First, instead of a substring number, a substring name is given. Sec-
3986 ond, there is an extra argument, given at the start, which is a pointer
4009 When a pattern is compiled with the PCRE_DUPNAMES option, names for
4026 If you want to get full details of all captured substrings for a given
4034 tion entitled Information about a pattern above. Given all the rele-
4041 The traditional matching function uses a similar algorithm to Perl,
4042 which stops when it finds the first match, starting at a given point in
4050 What you have to do is to insert a callout right at the end of the pat-
4059 Matching certain patterns using pcre_exec() can use a lot of process
4070 combination of arguments, it returns instead a negative number whose
4089 The function pcre_dfa_exec() is called to match a subject string
4090 against a compiled pattern, using a matching algorithm that scans the
4095 a discussion of the two matching algorithms, and a list of features
4100 pcre_exec(), plus two extras. The ovector argument is used in a differ-
4108 workspace will be needed for patterns and subjects where there are a
4111 Here is an example of a simple call to pcre_dfa_exec():
4152 set as the first matching string in both cases. There is a more
4165 When pcre_dfa_exec() returns a partial match, it is possible to call it
4170 after a partial match. There is more discussion of this facility in the
4193 On success, the yield of the function is a number greater than zero,
4209 character repeats at the end of a pattern (as well as internally). For
4210 example, the pattern "a\d+" is compiled as if it were "a\d++" because
4214 cases, either use an ungreedy repeat ("a\d+?") or set the
4219 The pcre_dfa_exec() function returns a negative number when it fails.
4227 tern that it does not support, for instance, the use of \C or a back
4232 This return is given if pcre_dfa_exec() encounters a condition item
4233 that uses a back reference for the condition, or a test for recursion
4234 in a specific group. These are not supported.
4239 that contains a setting of the match_limit or match_limit_recursion
4250 When a recursive subpattern is processed, the matching function calls
4253 should be extremely rare, as a vector of size 1000 is used.
4304 PCRE provides a feature called "callout", which is a means of temporar-
4311 Within a regular expression, (?C) indicates the points at which the
4313 identified by putting a number less than 256 after the letter C. The
4319 If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled,
4330 Notice that there is a callout before and after each parenthesis and
4331 alternation bar. If the pattern contains a conditional group whose con-
4333 before the condition. Such a callout may also be inserted explicitly,
4336 (?(?C9)(?=a)ab|de)
4342 matching. The pcretest program has a pattern qualifier (/C) that sets
4345 to optimize the performance of a particular pattern.
4355 that what follows cannot be part of the repeat. For example, a+[bc] is
4356 compiled as if it were a++[bc]. The pcretest output when this pattern
4362 +1 ^ a+
4367 into a+ and therefore the callouts that would be taken for the back-
4375 +1 ^ a+
4382 This time, when matching [bc] fails, the matcher backtracks into a+ and
4383 tries again, repeatedly, until a+ itself fails.
4395 If the pattern is studied, PCRE knows the minimum length of a matching
4396 string, and will immediately give a "no match" return without actually
4397 running a match if the subject is not long enough, or, for unanchored
4408 During matching, when PCRE reaches a callout point, the external func-
4411 to the callout function is a pointer to a pcre_callout or
4442 The offset_vector field is a pointer to the vector of offsets that was
4446 for extracting substrings after a match has completed. For the DFA
4469 tured substring. However, when a recursion exits, the value reverts to
4475 The callout_data field contains a value that is passed to a matching
4477 passed in the callout_data field of a pcre_extra or pcre[16|32]_extra
4479 in a callout block is NULL. There is a description of the pcre_extra
4489 bar, a closing parenthesis, or the end of the pattern, the length is
4498 callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer
4501 been passed. Instances of (*PRUNE) or (*THEN) without a name do not
4502 obliterate a previous (*MARK). In callouts from the DFA matching func-
4511 matching possibilities goes ahead, just as if a lookahead assertion had
4516 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4549 1. PCRE has only a subset of Perl's Unicode support. Details of what it
4553 they do not mean what you might think. For example, (?!a){3} does not
4554 assert that the next three characters are not "a". It just asserts that
4555 the next character is not "a" three times (in principle: PCRE optimizes
4565 they are not allowed in a pattern string because it is passed as a nor-
4567 the pattern to represent a binary zero.
4570 \U, and \N when followed by a character name or Unicode value. (\N on
4571 its own, matching a non-newline character, is supported.) In fact these
4612 Python, but unlike Perl. Captured values that are set outside a sub-
4614 There is a discussion that explains these differences in more detail in
4617 10. If any of the backtracking control verbs are used in a subpattern
4618 that is called as a subroutine (whether or not recursively), their
4621 if (*THEN) is present in a group that is called as a subroutine, its
4626 11. If a pattern contains more than one backtracking control verb, the
4628 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
4636 captured strings when part of a pattern is repeated. For example,
4637 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
4641 pattern names is not as general as Perl's. This is a consequence of the
4643 ble to translate between numbers and names. In particular, a pattern
4644 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
4652 example, between the ( and ? at the start of a subpattern. If the /x
4658 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4674 (a) Although lookbehind assertions in PCRE must match fixed length
4675 strings, each alternative branch of a lookbehind assertion can match a
4682 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
4684 ignored. (Perl can be made to issue a warning.)
4688 lowed by a question mark they are.
4690 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
4704 (j) Patterns compiled by PCRE can be saved and re-used at a later time,
4709 pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way and
4713 of a pattern that set overall options that cannot be changed within the
4741 by PCRE are described in detail below. There is a quick-reference syn-
4749 regular expressions in general are covered in a number of books, some
4759 match using a different algorithm that is not Perl-compatible. Some of
4769 set by special items at the start of a pattern. These are not Perl-com-
4781 strings, and a third library that supports 32-bit and UTF-32 character
4792 (*UTF) is a generic sequence that can be used with any of the
4793 libraries. Starting a pattern with such a sequence is equivalent to
4794 setting the relevant option. How setting a UTF mode affects pattern
4795 matching is mentioned in several places below. There is also a summary
4805 Another special sequence that may appear at the start of a pattern is
4809 less than 128 via a lookup table.
4813 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
4816 the repeated item. For example, by default a+b is treated as a++b. For
4821 If a pattern starts with (*NO_START_OPT), it has the same effect as
4829 strings: a single CR (carriage return) character, a single LF (line-
4835 It is also possible to specify a newline convention by starting a pat-
4845 tion. For example, on a Unix system where LF is the default newline
4848 (*CR)a.b
4850 changes the convention to CR. That pattern matches "a\nb" because LF is
4851 no longer a newline. If more than one of these settings is present, the
4860 line sequences" below. A change of \R setting can be combined with a
4865 The caller of pcre_exec() can set a limit on the number of times the
4868 are provoked by patterns with huge matching trees (a typical example is
4869 a pattern with nested unlimited repeats) and to avoid running out of
4888 character code rather than ASCII or Unicode (typically a mainframe sys-
4896 A regular expression is a pattern that is matched against a subject
4897 string from left to right. Most characters stand for themselves in a
4898 pattern, and match the corresponding characters in the subject. As a
4903 matches a portion of a subject string that is identical to itself. When
4905 matched independently of case. In a UTF mode, PCRE always understands
4939 Part of a pattern that is in square brackets is called a "character
4940 class". In a character class the only metacharacters are:
4955 a character that is not a number or a letter, it takes away any special
4959 For example, if you want to match a * character, you write \* in the
4961 character would otherwise be interpreted as a metacharacter, so it is
4962 always safe to precede a non-alphanumeric with backslash to specify
4963 that it stands for itself. In particular, if you want to match a back-
4966 In a UTF mode, only ASCII numbers and letters have any special meaning
4967 after a backslash. All other characters (in particular, those whose
4970 If a pattern is compiled with the PCRE_EXTENDED option, most white
4971 space in the pattern (other than in a character class), and characters
4972 between a # outside a character class and the next newline, inclusive,
4973 are ignored. An escaping backslash can be used to include a white space
4976 If you want to remove the special meaning from a sequence of charac-
4993 end). If the isolated \Q is inside a character class, this causes an
4998 A second use of backslash provides a way of encoding non-printing char-
4999 acters in patterns in a visible manner. There is no restriction on the
5001 terminates a pattern, but when a pattern is being prepared by text
5006 \a alarm, that is, the BEL character (hex 07)
5020 The precise effect of \cx on ASCII characters is as follows: if x is a
5025 has a value greater than 127, a compile-time error occurs. This locks
5028 When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gener-
5031 are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
5032 Any other character provokes a compile-time error. The sequence \@
5043 but because 127 is not a control character in EBCDIC, Perl makes it
5052 sequence \0\x\015 specifies two binary zeros followed by a CR character
5056 The escape \o must be followed by a sequence of octal digits, enclosed
5057 in braces. An error occurs if this is not the case. This escape is a
5063 a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
5067 The handling of a backslash followed by a digit other than 0 is compli-
5069 change. Outside a character class, PCRE reads the digit and any follow-
5070 ing digits as a decimal number. If the number is less than 8, or if
5072 in the expression, the entire sequence is taken as a back reference. A
5076 Inside a character class, or if the decimal number following \ is
5080 them to generate a data character. Any subsequent digits stand for
5086 \7 is always a back reference
5087 \11 might be a back reference, or another way of
5088 writing a tab
5089 \011 is always a tab
5090 \0113 is a tab followed by the character "3"
5091 \113 might be a back reference, otherwise the
5093 \377 might be a back reference, otherwise
5095 \81 is either a back reference, or the two
5099 syntax must not be introduced by a leading zero, because no more than
5104 number of hexadecimal digits may appear between \x{ and }. If a charac-
5105 ter other than a hexadecimal digit appears between \x{ and }, or if
5110 its. Otherwise, it matches a literal "x" character. In JavaScript
5112 must be followed by four hexadecimal digits; otherwise it matches a
5126 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
5128 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
5130 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
5137 All the sequences that define a single character value can be used both
5138 inside and outside character classes. In addition, inside a character
5141 \N is not allowed in a character class. \B, \R, and \X are not special
5142 inside a character class. Like other unrecognized escape sequences,
5144 default, but cause an error if the PCRE_EXTRA option is set. Outside a
5152 PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
5153 \u can be used to define a character by code point, as described in the
5158 The sequence \g followed by an unsigned or a negative number, option-
5165 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
5166 name or a number enclosed either in angle brackets or single quotes, is
5167 an alternative syntax for referencing a subpattern as a "subroutine".
5169 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
5170 reference; the latter is a subroutine call.
5177 \D any character that is not a decimal digit
5179 \H any character that is not a horizontal white space character
5181 \S any character that is not a white space character
5183 \V any character that is not a vertical white space character
5187 There is also the single sequence \N, which matches a non-newline char-
5210 A "word" character is an underscore or any character that is a letter
5214 page). For example, in a French locale such as "fr_FR" in Unix-like
5279 Outside a character class, by default, the escape sequence \R matches
5290 sequence is treated as a single unit that cannot be split.
5303 specify these settings by starting a pattern string with one of the
5310 tion, but they can themselves be overridden by options given to a
5312 Perl-compatible, are recognized only at the very start of a pattern,
5314 present, the last one is used. They can be combined with a change of
5315 newline convention; for example, a pattern can start with:
5320 or (*UCP) special sequences. Inside a character class, \R is treated as
5332 \p{xx} a character with the xx property
5333 \P{xx} a character without the xx property
5334 \X a Unicode extended grapheme cluster
5341 does not match any characters, so always causes a match failure.
5344 A character from one of these sets can be matched using a script name.
5375 ified by a two-letter abbreviation. For compatibility with Perl, nega-
5376 tion can be specified by including a circumflex between the opening
5434 The special property L& is also supported: it matches a character that
5435 has the Lu, Ll, or Lt property, in other words, a letter that is not
5436 classified as a modifier or "other".
5458 to do a multistage table lookup in order to find a character's prop-
5473 That is, it matched a character without the "mark" property, followed
5479 cated kinds of composite character by giving each character a grapheme
5485 add additional characters according to the following rules for ending a
5493 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
5496 be followed by a V or T character; an LVT or T character may be follwed
5497 only by a T character.
5529 ter that can be represented by a Universal Character Name in C++ and
5534 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
5546 is similar to a lookbehind assertion (described below). However, in
5558 assertions, but is ignored in negative assertions. Note that when a
5565 tion specifies a condition that has to be met at a particular point in
5566 a match, without consuming any characters from the subject string. The
5570 \b matches at a word boundary
5571 \B matches when not at a word boundary
5574 also matches before a newline at the end of the subject
5578 Inside a character class, \b has a different meaning; it matches the
5579 backspace character. If any other of these assertions appears in a
5585 A word boundary is a position in the subject string where the current
5588 string if the first or last character matches \w, respectively. In a
5591 PCRE nor Perl has a separate "start of word" or "end of word" metase-
5593 For example, the fragment \ba matches "a" at the start of a word.
5602 cating that matching is to start at a point other than the beginning of
5604 that \Z matches before a newline at the end of the string as well as at
5618 at a time, it cannot reproduce this behaviour.
5620 If all the alternatives of a pattern begin with \G, the expression is
5628 That is, they test for a particular condition being true without con-
5631 Outside a character class, in the default matching mode, the circumflex
5635 PCRE_MULTILINE option is unset. Inside a character class, circumflex
5638 Circumflex need not be the first character of the pattern if a number
5641 branch. If all possible alternatives start with a circumflex, that is,
5644 constructs that can cause a pattern to be anchored.)
5648 before a newline at the end of the string (by default). Note, however,
5650 last character of the pattern if a number of alternatives are involved,
5652 lar has no special meaning in a character class.
5659 PCRE_MULTILINE option is set. When this is the case, a circumflex
5661 the subject string. It does not match after a newline that ends the
5668 (where \n represents a newline) in multiline mode, but not otherwise.
5670 all branches start with ^ are not anchored in multiline mode, and a
5676 and end of the subject in both modes, and if all branches of a pattern
5683 Outside a character class, a dot in the pattern matches any one charac-
5684 ter in the subject string except (by default) a character that signi-
5685 fies the end of a line.
5687 When a line ending is defined as a single character, dot never matches
5695 PCRE_DOTALL option is set, a dot matches any one character, without
5701 newlines. Dot has no special meaning in a character class.
5703 The escape sequence \N behaves like a dot, except that it is not
5705 character except one that signifies the end of a line. Perl also uses
5711 Outside a character class, the escape sequence \C matches any one data
5712 unit, whether or not a UTF mode is set. In the 8-bit library, one data
5713 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
5714 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
5718 units, matching one unit with \C in a UTF mode means that the rest of
5719 the string may start with a malformed UTF character. This has undefined
5726 below) in a UTF mode, because this would make it impossible to calcu-
5731 a lookahead to check the length of the next character, as in this pat-
5732 tern, which could be used with a UTF-8 string (ignore white space and
5750 An opening square bracket introduces a character class, terminated by a
5753 a lone closing square bracket causes a compile-time error. If a closing
5754 square bracket is required as a member of the class, it should be the
5756 present) or escaped with a backslash.
5758 A character class matches a single character in the subject. In a UTF
5761 the first character in the class definition is a circumflex, in which
5763 If a circumflex is actually required as a member of the class, ensure
5764 it is not the first character, or escape it with a backslash.
5767 while [^aeiou] matches any character that is not a lower case vowel.
5768 Note that a circumflex is just a convenient notation for specifying the
5770 class that starts with a circumflex is not an assertion; it still con-
5771 sumes a character from the subject string, and therefore it fails if
5775 (0xffff) can be included in a class as a literal string of data units,
5778 When caseless matching is set, any letters in a class represent both
5779 their upper case and lower case versions, so for example, a caseless
5780 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
5781 match "A", whereas a caseful version would. In a UTF mode, PCRE always
5786 caseless matching in a UTF mode for characters 128 and above, you must
5793 PCRE_MULTILINE options is used. A class such as [^a] always matches one
5796 The minus (hyphen) character can be used to specify a range of charac-
5797 ters in a character class. For example, [d-m] matches any letter
5798 between d and m, inclusive. If a minus character is required in a
5799 class, it must be escaped with a backslash or appear in a position
5800 where it cannot be interpreted as indicating a range, typically as the
5801 first or last character in the class, or immediately after a range. For
5802 example, [b-d-z] matches letters in the range b to d, a hyphen charac-
5806 ter of a range. A pattern such as [W-]46] is interpreted as a class of
5807 two characters ("W" and "-") followed by a literal string "46]", so it
5808 would match "W46]" or "-46]". However, if the "]" is escaped with a
5810 preted as a class containing a range followed by two other characters.
5812 a range.
5814 An error is generated if a POSIX character class (see below) or an
5815 escape sequence other than one that defines a single character appears
5816 at a point where a range ending character is expected. For example,
5824 If a range that includes letters is used when caseless matching is set,
5826 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
5827 character tables for a French locale are in use, [\xc8-\xcb] matches
5833 \w, and \W may appear in a character class, and add the characters that
5837 appear outside a character class, as described in the section entitled
5838 "Generic character types" above. The escape sequence \b has a different
5839 meaning inside a character class; it matches the backspace character.
5840 The sequences \B, \N, \R, and \X are not special inside a character
5846 types to specify a more restricted set of characters than the matching
5849 character class should be read as "something OR something OR ..." and a
5853 backslash, hyphen (only where it can be interpreted as specifying a
5855 when it can be interpreted as introducing a POSIX class name, or for a
5895 The name "word" is a Perl extension, and "blank" is a GNU extension
5897 by a ^ character after the colon. For example,
5902 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5954 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
5957 that \b matches at the start and the end of a word (see "Simple asser-
5958 tions" above), and in a Perl-style pattern the preceding or following
5975 are within a subpattern (defined below), "succeeds" means matching the
5983 within the pattern by a sequence of Perl option letters enclosed
5992 ble to unset these options by preceding the letter with a hyphen, and a
5995 is also permitted. If a letter appears both before and after the
6005 a pattern, PCRE extracts it into the global options (and it will there-
6008 An option change within a subpattern (see below for a description of
6012 (a(?i)b)c
6020 (a(?i)b|c)
6036 is a generic version that can be used with any of the libraries. How-
6044 nested. Turning part of a pattern into a subpattern does two things:
6046 1. It localizes a set of alternatives. For example, the pattern
6053 2. It sets up the subpattern as a capturing subpattern. This means
6070 helpful. There are often times when a grouping subpattern is required
6071 without a capturing requirement. If an opening parenthesis is followed
6072 by a question mark and a colon, the subpattern does not do any captur-
6082 As a convenient shorthand, if any option settings are required at the
6083 start of a non-capturing subpattern, the option letters may appear
6098 Perl 5.10 introduced a feature whereby each alternative in a subpattern
6099 uses the same numbers for its capturing parentheses. Such a subpattern
6100 starts with (?| and is itself a non-capturing subpattern. For example,
6105 Because the two alternatives are inside a (?| group, both sets of cap-
6109 not all, of one of a number of alternatives. Inside a (?| group, paren-
6117 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
6120 A back reference to a numbered subpattern uses the most recent value
6126 In contrast, a subroutine call to a numbered subpattern always refers
6132 If a condition test for a subpattern's having matched refers to a non-
6152 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
6159 must start with a non-digit. Named capturing parentheses are still
6162 to-number translation table from a compiled pattern. There is also a
6163 convenience function for extracting a captured substring by name.
6165 By default, a name must be unique within a pattern, but it is possible
6170 named parentheses can match. Suppose you want to match the name of a
6171 weekday, either as a 3-letter abbreviation or as the full name, and in
6181 There are five capturing substrings, but only one is ever set after a
6182 match. (An alternative way of solving this problem is to use a "branch
6190 If you make a back reference to a non-unique named subpattern from
6199 If you make a subroutine call to a non-unique named subpattern, the one
6204 If you use a named reference in a condition test (see the section about
6205 conditions below), either to check whether a subpattern has matched, or
6225 a literal data character
6230 an escape such as \d or \pL that matches a single character
6231 a character class
6232 a back reference (see next section)
6233 a parenthesized subpattern (including assertions)
6234 a subroutine call to a subpattern (recursive or otherwise)
6236 The general repetition quantifier specifies a minimum and maximum num-
6238 (braces), separated by a comma. The numbers must be less than 65536,
6243 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
6255 matches exactly 8 digits. An opening curly bracket that appears in a
6256 position where a quantifier is not allowed, or one that does not match
6257 the syntax of a quantifier, is taken as a literal character. For exam-
6258 ple, {,6} is not a quantifier, but a literal string of four characters.
6262 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
6272 have a {0} quantifier are omitted from the compiled pattern.
6281 It is possible to construct infinite loops by following a subpattern
6282 that can match no characters with a quantifier that has no upper limit,
6285 (a?)*
6310 However, if a quantifier is followed by a question mark, it ceases to
6318 matches. Do not confuse this use of question mark with its use as a
6329 can be made greedy by following them with a question mark. In other
6332 When a parenthesized subpattern is quantified with a minimum repeat
6333 count that is greater than 1 or with a limited maximum, more memory is
6337 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
6342 first. PCRE normally treats such a pattern as though it were preceded
6350 When .* is inside capturing parentheses that are the subject of a back
6351 reference elsewhere in the pattern, a match at the start may fail where
6352 a later one succeeds. Consider, for example:
6357 ter. For this reason, such a pattern is not implicitly anchored.
6360 ing .* is inside an atomic group. Once again, a match at the start may
6361 fail where a later one succeeds. Consider this pattern:
6363 (?>.*?a)b
6368 When a capturing subpattern is repeated, the value captured is the sub-
6378 /(a|(b))+/
6387 to be re-evaluated to see if a different number of repeats allows the
6401 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
6402 the means for specifying that once a subpattern has matched, it is not
6407 is a kind of special parenthesis, starting with (?> as in this example:
6412 tains once it has matched, and a failure further into the pattern is
6416 An alternative description is that a subpattern of this type matches
6421 such as the above example can be thought of as a maximizing repeat that
6429 atomic group is just a single repeated item, as in the example above, a
6430 simpler notation, called a "possessive quantifier" can be used. This
6431 consists of an additional + character following a quantifier. Using
6436 Note that a possessive quantifier can be used with an entire group, for
6442 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
6444 meaning of a possessive quantifier and the equivalent atomic group,
6445 though there may be a performance difference; possessive quantifiers
6456 A++B because there is no point in backtracking into a sequence of A's
6459 When a pattern contains an unlimited repeat inside a subpattern that
6461 atomic group is the only way to avoid some failing matches taking a
6472 it takes a long time before reporting failure. This is because the
6474 * repeat in a large number of ways, and all have to be tried. (The
6475 example uses [!?] rather than a single character at the end, because
6477 when a single character is used. They remember the last single charac-
6478 ter that is required for a match, and fail early if it is not present
6489 Outside a character class, a backslash followed by a digit greater than
6490 0 (and possibly further digits) is a back reference to a capturing sub-
6495 it is always taken as a back reference, and causes an error only if
6499 reference" of this type can make sense when a repetition is involved
6503 It is not possible to have a numerical "forward back reference" to a
6504 subpattern whose number is 10 or more using this syntax because a
6505 sequence such as \50 is interpreted as a character defined in octal.
6507 details of the handling of digits following a backslash. There is no
6512 following a backslash is to use the \g escape sequence. This escape
6513 must be followed by an unsigned number or a negative number, optionally
6522 digits follow the reference. A negative number is a relative reference.
6527 The sequence \g{-1} is a reference to the most recently started captur-
6536 the subpattern itself (see "Subpatterns as subroutines" below for a way
6566 There may be more than one back reference to the same subpattern. If a
6567 subpattern has not actually been used in a particular match, any back
6570 (a|(bc))\2
6572 always fails if it starts to match "a" rather than "bc". However, if
6573 the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
6576 Because there may be many capturing parentheses in a pattern, all dig-
6577 its following a backslash are taken as part of a potential back refer-
6578 ence number. If the pattern continues with a digit character, some
6586 fails when the subpattern is first used, so, for example, (a\1) never
6590 (a|b\1)+
6592 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
6597 the example above, or by a quantifier with a minimum of zero.
6600 treated as an atomic group. Once the whole group has been matched, a
6607 An assertion is a test on the characters following or preceding the
6650 matches a word followed by a semicolon, but does not include the semi-
6665 If you want to force a matching failure at some point in a pattern, the
6669 is a synonym for (?!).
6679 contents of a lookbehind assertion are restricted such that all the
6680 strings it matches must have a fixed length. However, if there are sev-
6691 strings are permitted only at the top level of a lookbehind assertion.
6704 of a lookbehind assertion to get round the fixed-length restriction.
6711 In a UTF mode, PCRE does not allow the \C escape (which matches a sin-
6712 gle data unit even in a UTF mode) to appear in lookbehind assertions,
6718 lookbehinds, as long as the subpattern matches a fixed-length string.
6723 end of subject strings. Consider a simple pattern such as
6727 when applied to a long string that does not match. Because matching
6728 proceeds from left to right, PCRE will look for each "a" in the subject
6735 (because there is no following "a"), it backtracks to match all but the
6737 again the search for "a" covers the entire string, from right to left,
6743 entire string. The subsequent lookbehind assertion does a single test
6745 For long strings, this approach makes a significant difference to the
6756 the subject string. First there is a check that the previous three
6757 characters are all digits, and then there is a check that the same
6784 It is possible to cause the matching process to obey a subpattern con-
6786 on the result of an assertion, or whether a specific capturing subpat-
6795 tives in the subpattern, a compile-time error occurs. Each of the two
6805 ences to recursion, a pseudo-condition called DEFINE, and assertions.
6807 Checking for a used subpattern by number
6809 If the text between the parentheses consists of a sequence of digits,
6810 the condition is true if a capturing subpattern of that number has pre-
6814 native notation is to precede the digits with a plus or minus sign. In
6820 is not used; it provokes a compile-time error.)
6831 third part is a conditional subpattern that tests whether or not the
6834 yes-pattern is executed and a closing parenthesis is required. Other-
6836 In other words, this pattern matches a sequence of non-parentheses,
6839 If you were embedding this pattern in a larger one, you could use a
6847 Checking for a used subpattern by name
6849 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
6854 Rewriting the above example to use a named subpattern gives this:
6858 If the name used in a condition of this kind is a duplicate, the test
6865 name R, the condition is true if a recursive call to the whole pattern
6866 or any subpattern has been made. If digits or a name preceded by amper-
6871 the condition is true if the most recent recursion is into a subpattern
6873 recursion stack. If the name used in a condition of this kind is a
6888 example, a pattern to match an IPv4 address such as "192.168.23.245"
6894 The first part of the pattern is a DEFINE group inside which a another
6896 an IPv4 address (a number less than 256). When matching takes place,
6897 this part of the pattern is skipped because DEFINE acts like a false
6900 ing on a word boundary at each end.
6905 assertion. This may be a positive or negative lookahead or lookbehind
6909 (?(?=[^a-z]*[a-z])
6910 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
6912 The condition is a positive lookahead assertion that matches an
6913 optional sequence of non-letters followed by a letter. In other words,
6914 it tests for the presence of at least one letter in the subject. If a
6924 by PCRE. In both cases, the start of the comment must not be in a char-
6926 ters such as (?: or a subpattern name or number. The characters that
6927 make up a comment play no part in the pattern matching.
6929 The sequence (?# marks the start of a comment that continues up to the
6931 PCRE_EXTENDED option is set, an unescaped # character also introduces a
6935 a compiling function or by a special sequence at the start of the pat-
6937 Note that the end of this type of comment is a literal newline sequence
6938 in the pattern; escape sequences that happen to represent a newline do
6945 for a newline in the pattern. The sequence \n is still literal at this
6952 Consider the problem of matching a string in parentheses, allowing for
6954 that can be done is to use a pattern that matches up to some fixed
6958 For some time, Perl has provided a facility that allows regular expres-
6975 A special item that consists of (? followed by a number greater than
6976 zero and a closing parenthesis is a recursive subroutine call of the
6978 subpattern. (If not, it is a non-recursive subroutine call, which is
6979 described in the next section.) The special item (?R) or (?0) is a
6988 substrings which can either be a sequence of non-parentheses, or a
6989 recursive match of the pattern itself (that is, a correctly parenthe-
6990 sized substring). Finally there is a closing parenthesis. Note the use
6991 of a possessive quantifier to avoid backtracking into sequences of non-
6994 If this were part of a larger pattern, you would not want to recurse
7002 In a larger pattern, keeping track of parenthesis numbers can be
7006 words, a negative number counts capturing parentheses leftwards from
7025 nested unlimited repeats, and so the use of a possessive quantifier for
7032 it yields "no match" quickly. However, if a possessive quantifier is
7033 not used, the match runs for a very long time indeed because there are
7037 At the end of a match, the values of capturing parentheses are those
7038 from the outermost level. If you want to obtain intermediate values, a
7045 which is the last value taken on at the top level. If a capturing sub-
7047 unset, even if it was (temporarily) set at a deeper level during the
7050 If there are more than 15 capturing parentheses in a pattern, PCRE has
7051 to obtain extra memory to store data during a recursion, which it does
7063 In this pattern, (?(R) is the start of a conditional subpattern, with
7070 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
7073 alternatives and there is a subsequent matching failure. This can be
7074 illustrated by the following pattern, which purports to match a palin-
7076 "a", "aba", "abcba", "abcdcba"):
7080 The idea is that it either matches a single character, or two identical
7081 characters surrounding a sub-palindrome. In Perl, this pattern works;
7092 subpattern 2 matched, which was "a". This fails. Because the recursion
7105 remaining alternative is at a deeper recursion level, which PCRE cannot
7115 When a deeper recursion has matched a single character, it cannot be
7128 as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
7130 ing into sequences of non-word characters. Without this, PCRE takes a
7132 Perl takes so long that you think it has gone into a loop.
7135 ject string does not start with a palindrome that is shorter than the
7143 cessing is in the handling of captured values. In Perl, when a subpat-
7144 tern is called recursively or as a subpattern (see the next section),
7149 ^(.)(\1|a(?2))
7153 to match "b", the second alternative matches "a" and then recurses. In
7161 If the syntax for a recursive subpattern call (either by number or by
7163 like a subroutine in a programming language. The called subpattern may
7185 atomic groups. That is, once a subroutine has matched some of the sub-
7187 natives and there is a subsequent matching failure. Any capturing
7191 Processing options such as case-independence are fixed when a subpat-
7192 tern is defined, so if it is used as a subroutine, such options cannot
7203 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
7204 name or a number enclosed either in angle brackets or single quotes, is
7205 an alternative syntax for referencing a subpattern as a subroutine,
7212 PCRE supports an extension to Oniguruma: if a number is preceded by a
7213 plus or a minus sign it is taken as a relative reference. For example:
7218 synonymous. The former is a back reference; the latter is a subroutine
7224 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
7225 Perl code to be obeyed in the middle of matching a regular expression.
7227 strings that match the same pair of parentheses when there is a repeti-
7230 PCRE provides a similar feature, but of course it cannot obey arbitrary
7237 Within a regular expression, (?C) indicates the points at which the
7239 callout points, you can put a number less than 256 after the letter C.
7245 If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
7247 are all numbered 255. If there is a conditional group in the pattern
7252 (?(?C9)(?=a)abc|def)
7257 During matching, when PCRE reaches a callout point, the external func-
7263 By default, PCRE implements a number of optimizations at compile time
7266 options that disable the relevant optimizations. More details, and a
7273 Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
7275 and subject to change or removal in a future version of Perl". It goes
7283 differently depending on whether or not a name is present. A name is
7284 any sequence of characters that does not include a closing parenthesis.
7288 the colon were not there. Any number of these verbs may occur in a
7293 the traditional matching functions, because these use a backtracking
7294 algorithm. With the exception of (*FAIL), which behaves like a failing
7296 encountered by a DFA matching function.
7306 may know the minimum length of matching subject, or that a particular
7308 running of a match, any included backtracking verbs will not, of
7321 be followed by a name.
7326 of the pattern. However, when it is inside a subpattern that is called
7327 as a subroutine, only that subpattern is ended successfully. Matching
7328 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
7329 tive assertion, the assertion succeeds; in a negative assertion, the
7342 This verb causes a matching failure, forcing backtracking to occur. It
7349 a+(?C)(*FAIL)
7356 There is one verb whose main purpose is to track how a match was
7357 arrived at, though it also has a secondary use in conjunction with
7363 instances of (*MARK) as you like in a pattern, and their names do not
7366 When a match succeeds, the name of the last-encountered (*MARK:NAME),
7382 ple it indicates which of the two alternatives matched. This is a more
7386 If a verb with a name is encountered in a positive assertion that is
7391 After a partial match or a failed match, the last encountered name in
7411 a backtrack to the verb, a failure is forced. That is, backtracking
7422 when the verb is not in a subroutine or an assertion. Subsequent sec-
7427 This verb, which may not be followed by a name, causes the whole match
7428 to fail outright if there is a later matching failure that causes back-
7430 attempts to find a match by advancing the starting point take place. If
7432 has been passed pcre_exec() is committed to finding a match at the cur-
7435 a+(*COMMIT)b
7437 This matches "xxaab" but not "aacaab". It can be thought of as a kind
7440 forces a match failure.
7442 If there is more than one backtracking verb in a pattern, a different
7444 (*COMMIT) during a match does not always guarantee that a match must be
7447 Note that (*COMMIT) at the start of a pattern is not the same as an
7457 For this pattern, PCRE knows that any match must start with "a", so the
7458 optimization skips along the subject to "a" before applying the pattern
7470 the subject if there is a later matching failure that causes backtrack-
7488 This verb, when given without a name, is like (*PRUNE), except that if
7492 it cannot be part of a successful match. Consider:
7494 a+(*SKIP)b
7498 skips on to start the next attempt at "c". Note that a possessive quan-
7511 a matching name is found, the (*SKIP) is ignored.
7518 This verb causes a skip to the next innermost alternative when back-
7521 that it can be used for a pattern-based if-then-else block:
7529 quently BAZ fails, there are no more alternatives, so there is a back-
7538 A subpattern that does not contain a | character is just a part of the
7539 enclosing alternative; it is not a nested alternation with only one
7540 alternative. The effect of (*THEN) extends beyond such a subpattern to
7547 If A and B are matched, but there is a failure in C, matching does not
7554 The effect of (*THEN) is now confined to the inner subpattern. After a
7559 Note that a conditional subpattern is not considered as having two
7561 character in a conditional subpattern has a different meaning. Ignoring
7564 ^.*? (?(?=a) a | b(*THEN)c )
7567 ungreedy, it initially matches zero characters. The condition (?=a)
7572 the match fails. (If there was a backtrack into .*?, allowing it to
7585 If more than one backtracking verb is present in a pattern, the one
7600 If there is a matching failure to the right, backtracking onto (*PRUNE)
7602 a backtrack onto (*COMMIT).
7609 /(a(*COMMIT)b)+ac/
7619 (*ACCEPT) in a positive assertion causes the assertion to succeed with-
7620 out any further processing. In a negative assertion, (*ACCEPT) causes
7624 in a positive assertion. In particular, (*THEN) skips to the next
7629 changing a positive assertion into a negative assertion changes its
7630 result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
7642 (*FAIL) in a subpattern called as a subroutine has its normal effect:
7645 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
7649 (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
7688 document contains a quick-reference summary of the syntax.
7693 \x where x is non-alphanumeric is a literal x
7699 \a alarm, that is, the BEL character (hex 07)
7721 \d a decimal digit
7722 \D a character that is not a decimal digit
7723 \h a horizontal white space character
7724 \H a character that is not a horizontal white space character
7725 \N a character that is not a newline
7726 \p{xx} a character with the xx property
7727 \P{xx} a character without the xx property
7728 \R a newline sequence
7729 \s a white space character
7730 \S a character that is not a white space character
7731 \v a vertical white space character
7732 \V a character that is not a vertical white space character
7733 \w a "word" character
7734 \W a "non-word" character
7735 \X a Unicode extended grapheme cluster
7799 represented by a Universal Character Name
7855 You can use \Q...\E inside a character class.
7881 \B not a word boundary
7937 The following are recognized only at the very start of a pattern or
7958 option settings with a similar syntax.
7970 option setting with a similar syntax.
7983 Each top-level branch of a look behind must be of a fixed length.
8043 The following act only when a subsequent match failure causes a back-
8044 track to reach them. They all force a match failure, but they differ in
8123 the library will be a bit bigger, but the additional run time overhead
8133 category properties such as Lu for an upper case letter or Nd for a
8137 supported. For example, \p{L} matches a letter. Its Perl synonym,
8160 other words, the whole surrogate thing is a fudge for UTF-16 which
8166 and pcre_dfa_exec() also pass back this information, as well as a more
8172 mance, for example in the case of a long subject string that is being
8180 If you want to disable the check for a subject string you must pass
8198 well as a more detailed reason code if the caller has provided memory
8221 well as a more detailed reason code if the caller has provided memory
8244 4. The dot metacharacter matches one UTF character instead of a single
8247 5. The escape sequence \C can be used to match a single byte in UTF-8
8248 mode, or a single 16-bit data unit in UTF-16 mode, or a single 32-bit
8254 JIT optimization is requested for a UTF pattern that contains \C, it
8265 in terms of \w and \W. If you really want to test for a wider sense of,
8311 Just-in-time compiling is a heavyweight optimization that can greatly
8315 necessarily mean many calls of a matching function; if the pattern is
8317 positions in the subject, even for a single call. Therefore, if the
8355 ever, a simple program does not need to check this in order to use JIT.
8356 The normal API is implemented in a way that falls back to the interpre-
8358 sible performance, there is also a "fast path" API that is JIT-spe-
8363 test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
8381 For a program that may be linked with pre-8.20 versions of PCRE, you
8419 is passed a pcre_extra block containing a pointer to JIT code of the
8428 actually used for a particular match, you should arrange for a JIT
8430 "Controlling the JIT stack" below, even if you do not need to supply a
8431 non-default JIT stack. Such a callback function is called whenever JIT
8436 ated. You can find out if JIT execution is available after studying a
8443 Once a pattern has been studied, with or without JIT, it can be used as
8454 The only unsupported pattern items are \C (match a single data unit)
8455 when running in a UTF mode, and a callout immediately before an asser-
8456 tion condition in a conditional group.
8461 When a pattern is matched using JIT execution, the return values are
8465 ling the JIT stack" below for a discussion of JIT stack usage. For com-
8471 searching a very large pattern tree goes on for too long, as it is in
8481 saved (in a file or database) and restored later like the bytecode and
8482 other data of a compiled pattern. Saving and restoring compiled pat-
8485 run pcre_study() on a saved and restored pattern, and thereby recreate
8493 When the compiled JIT code runs, it needs a block of memory to use as a
8501 The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
8502 are a starting size and a maximum size, and it returns a pointer to an
8504 The pcre_jit_stack_free() function can be used to free a stack that is
8509 a maximum stack size of 512K to 1M should be more than enough for any
8519 The extra argument must be the result of studying a pattern with
8527 a valid JIT stack, the result of calling pcre_jit_stack_alloc().
8529 (3) If callback is not NULL, it must point to a function that is
8531 order to set up a JIT stack. If the return from the callback
8533 return value must be a valid JIT stack, the result of calling
8539 determine whether a match operation was executed by JIT or by the
8544 matched sequentially in the same thread. In a multithread application,
8545 if you do not specify a JIT stack, or if you assign or pass back NULL
8546 from a callback, that is thread-safe, because each thread has its own
8547 machine stack. However, if you assign or pass back a non-NULL JIT
8548 stack, this must be a different stack for each thread so that the
8554 assign the same stack to all compiled patterns, and use a global mutex
8558 This is a suggestion for how a multithreaded program that needs to set
8567 Use a one-line callback function
8572 argument is non-NULL and points to a pcre_extra block that is the
8573 result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
8580 PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
8589 Modern operating systems have a nice feature: they can reserve an
8593 allocate 1M address space, and use only a single memory page (usually
8597 (3) Who "owns" a JIT stack?
8600 or anything else. The user program must ensure that if a stack is used
8604 grams is to allocate a stack for each thread, and return this stack
8607 (4) When should a JIT stack be freed?
8609 You can free a JIT stack at any time, as long as it will not be used by
8610 pcre_exec() again. When you assign the stack to a pattern, only a
8613 call pcre_exec() with a pattern pointing to an already freed stack, as
8614 that will cause SEGFAULT. (Also, do not free a stack currently used by
8615 pcre_exec() in another thread). You can also replace the stack for a
8617 assigning a replacement.
8619 (5) Should I allocate/free a stack every time before/after calling
8625 this without keeping a list of the currently JIT studied patterns.
8628 if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
8631 Especially on embedded sytems, it might be a good idea to release mem-
8633 the moment. Probably a function call which returns with the currently
8635 ory (shrinking the stack) would be a good idea if someone needs this.
8637 (7) This is too much of a headache. Isn't there any better solution for
8646 This is a single-threaded example that specifies a JIT stack without
8647 using a callback.
8673 pcre_exec() does have a performance impact. Programs that are written
8675 possible performance, can instead use a "fast path" API to call JIT
8681 must point to a JIT stack. The JIT stack arrangements described above
8684 When you call pcre_exec(), as well as testing for invalid options, a
8687 immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is set, a
8724 In normal use of PCRE, if the subject string that is passed to a match-
8730 Consider, for example, an application where a human is required to type
8731 in data for a field with specific formatting requirements. An example
8732 might be a date in the form ddmmmyy, defined by this pattern:
8738 raise an error as soon as a mistake is made, by beeping and not
8740 ate feedback is likely to be a better user interface than a check that
8747 matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
8749 options is whether or not a partial match is preferred to an alterna-
8763 has not been set for a match, the interpretive matching code is used.
8765 Setting a partial matching option disables two of PCRE's standard opti-
8766 mizations. PCRE remembers the last literal data unit in a pattern, and
8768 string. This optimization cannot be used for a subject string that
8770 minimum length of a matching string, and does not bother to run the
8777 A partial match occurs during a call to pcre_exec() or
8783 of inspecting characters before the start of a matched substring. The
8785 empty string can always be matched; without such a restriction there
8786 would always be a partial match of an empty string at the end of the
8789 If there are at least two slots in the offsets vector when a partial
8792 to the end of the subject so that a substring can easily be identified.
8805 subject string is "xyzabc12", the first two offsets after a partial
8810 What happens when a partial match is identified depends on which of the
8816 identifies a partial match, the partial match is remembered, but match-
8821 This option is "soft" because it prefers a complete match over a par-
8822 tial match. All the various matching items in a pattern behave as if
8825 of the subject is treated as a non-alphanumeric.
8842 PCRE_ERROR_PARTIAL is returned as soon as a partial match is found,
8844 is "hard" because it prefers an earlier partial match over a later com-
8854 special case of a truncated character at the end of the subject,
8861 trated by a pattern such as:
8867 "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
8874 In this case the result is always a complete match because that is
8875 found first, and matching never continues after finding a complete
8891 tern, there is the possibility of a partial match, again provided that
8896 are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
8917 If a pattern ends with one of sequences \b or \B, which test for word
8923 This matches "cat", provided there is a word boundary at either end. If
8924 the subject string is "the cat", the comparison of the final "t" with a
8925 following character cannot take place, so a partial match is found.
8927 subject when the last character is a letter, so a complete match is
8943 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
8946 PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
8952 If the escape sequence \P is present in a pcretest data line, the
8953 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
8974 If the escape sequence \P is present more than once in a pcretest data
8980 When a partial match has been found using a DFA matching function, it
9002 That means that, for an unanchored pattern, if a continued match fails,
9003 it is not possible to try again at a new starting point. All this
9006 the result is no match, even though there would be a match for "aug23"
9010 subject and try a new complete match.
9022 to restart the previous match with a new segment of data. Instead, new
9028 not treat the end of a segment as the end of the subject when matching
9043 with \b or \B, the string that is returned for a partial match includes
9044 characters that precede the start of what would be returned for a com-
9054 1. If the pattern contains a test for the beginning of a line, you need
9056 does start at the beginning of a line. There is also a PCRE_NOTEOL
9061 in the offsets that are returned for a partial match. However a lookbe-
9072 From release 8.33, there is a more accurate way of deciding which char-
9081 This indicates that the matching process that gave a partial match
9082 started at offset 5, but the characters "123a" were all inspected. The
9084 shows that we need only keep "123a", and the next match attempt can be
9085 started at offset 3 (that is, at "a") when further characters have been
9091 Partial match at offset 5: 123a
9093 3. Because a partial match must always contain at least one character,
9094 what might be considered a partial match of an empty string actually
9095 gives a "no match" result. For example:
9101 If the next segment begins "cx", a match should be found, but this will
9103 this reason, a "no match" result should be interpreted as "partial
9106 4. Matching a subject string that is split into multiple segments may
9112 PCRE_PARTIAL_SOFT) a partial match result is given only when there are
9114 been found, continuation to a new subject segment is no longer possi-
9128 The first data line passes the string "dogsb" to a standard matching
9130 a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
9131 because the shorter string "dog" is a complete match. Similarly, when
9132 the subject is presented to a DFA matching function in several parts
9135 "dogsbody" is presented as a single string, a DFA matching function
9156 If the first part of the subject is "ABC123", a partial match of the
9158 the second alternative, because such a match does not start at the same
9160 "7890" does not yield a match because only those alternatives that
9168 where no string can be a partial match for both alternatives. This is
9169 not a problem if a standard matching function is used, because the
9180 tions. Another possibility is to work with two buffers. If a partial
9182 PCRE_DFA_RESTART is used on the second buffer, you can then try a new
9209 If you are running an application that uses a large number of regular
9210 expression patterns, it may be useful to store them in a precompiled
9214 If you are using private tables, it is a little bit more complicated.
9218 If you save compiled patterns to a file, you can copy them to a differ-
9223 NESS if they detect a pattern with the wrong endianness.
9225 Compiling regular expressions with one version of PCRE for use with a
9227 saving and restoring a compiled pattern loses any JIT optimization
9233 The value returned by pcre[16|32]_compile() points to a single block of
9238 8-bit library that compiles a pattern and writes it to a file. It
9239 assumes that the variable fd refers to a file that is open for output:
9254 the 256 possible byte values. On systems that make a distinction
9258 If you want to write more than one pattern to a file, you will have to
9259 devise a way of separating them. For binary data, preceding each pat-
9262 binary, one pattern to a line.
9264 Saving compiled patterns in a file is only one possible way of storing
9265 them for later use. They could equally well be saved in a database, or
9270 study data in a similar way to the compiled pattern itself. However, if
9274 pcre[16|32]_study() returns a pointer to a pcre[16|32]_extra data
9275 block. Its format is defined in the section on matching a pattern in
9280 ber to check that pcre[16|32]_study() did return a non-NULL value
9286 Re-using a precompiled pattern is straightforward. Having reloaded it
9291 However, if you passed a pointer to custom character tables when the
9293 you must now pass a similar pointer to pcre[16|32]_exec() or
9295 tern will obviously be nonsense. A field in a pcre[16|32]_extra() block
9296 is used to pass this data, as described in the section on matching a
9314 optimization, that data cannot be saved, and so is lost by a
9321 update to a new PCRE release, though not all updates actually require
9349 cessing time. The way you express your pattern as a regular expression
9355 Patterns are compiled by PCRE into a reasonably efficient interpretive
9357 there is one case where the memory usage of a compiled pattern can be
9358 unexpectedly large. If a parenthesized subpattern has a quantifier with
9359 a minimum greater than 1 and/or a limited maximum, the whole subpattern
9372 is not usually a problem. However, if the numbers are large, and par-
9380 limit on a compiled pattern is 64K data units, and this is reached with
9394 as atomic groups into which there can be no backtracking if there is a
9396 rewriting automatically. Furthermore, there is a noticeable loss of
9398 grouping is not a problem and the loss of speed is acceptable, this
9417 ciently than others. It is more efficient to use a character class like
9418 [aeiou] than a set of single-character alternatives such as
9419 (a|e|i|o|u). In general, the simplest construction that provides the
9421 contains a lot of useful general discussion about optimizing regular
9422 expressions for efficient performance. This document contains a few
9426 slow, because PCRE has to use a multi-stage table lookup whenever it
9427 needs a character's property. If you can find an alternative pattern
9435 when matched with a traditional matching function; the performance loss
9436 is less with a DFA matching function, and in both cases there is not
9439 When a pattern begins with .* not in parentheses, or in parentheses
9440 that are not the subject of a backreference, and the PCRE_DOTALL option
9442 only at the start of a subject string. However, if PCRE_DOTALL is not
9444 does not then match a newline, and if the subject string contains new-
9450 matches the subject "first\nand second" (where \n stands for a newline
9455 If you are using such a pattern with subject strings that do not con-
9459 a newline to restart at.
9462 take a long time to run when applied to a string that does not match.
9465 ^(a+)*
9477 (a+)*b
9479 where a literal character follows. Before embarking on the standard
9480 matching procedure, PCRE checks that there is a "b" later in the sub-
9485 (a+)*\d
9487 with the pattern above. The former gives a failure almost instantly
9488 when applied to a whole line of "a" characters, whereas the latter
9492 an atomic group or a possessive quantifier.
9533 This set of functions provides a POSIX-style API for the PCRE regular
9534 expression 8-bit library. See the pcreapi documentation for a descrip-
9542 called pcreposix.a, so can be accessed by adding -lpcreposix to the
9550 easier to slot in PCRE as a replacement library. Other POSIX options
9576 The function regcomp() is called to compile a pattern into an internal
9577 form. The pattern is a C string terminated by a binary zero, and is
9578 passed in the argument pattern. The preg argument is a pointer to a
9579 regex_t structure that is used as a base for storing information about
9606 passed for compilation to the native function. In addition, when a pat-
9637 by a negative class such as [^a] (they are).
9653 then PCRE was never intended to be a POSIX engine. The following table
9660 newline matches [^a] yes not changeable
9670 newline matches [^a] yes REG_NEWLINE
9677 no way to stop newline from matching [^a].
9686 The function regexec() is called to match a compiled pattern preg
9687 against a given string, which is by default terminated by a zero byte
9711 have a terminating NUL located at string + pmatch[0].rm_eo (there need
9712 not actually be a NUL at that location), regardless of the value of
9713 nmatch. This is a BSD extension, compatible with but not specified by
9715 software intended to be portable to other systems. Note that a non-zero
9736 A successful match yields a zero return; various error codes are
9743 The regerror() function maps a non-zero errorcode from either regcomp()
9744 or regexec() to a printable message. If preg is not NULL, the error
9746 by a binary zero is placed in errbuf. The length of the message,
9753 Compiling a regular expression causes memory to be allocated and asso-
9755 memory, after which preg may no longer be used as a compiled expres-
9797 The "FullMatch" operation checks that supplied text matches a supplied
9809 Example: creating a temporary RE object:
9812 You can pass in a "const char*" or a "string" for "text". The examples
9813 below tend to use a const char*. You can, as in the different examples
9814 above, store the RE object explicitly in a variable or use a temporary
9852 a. "text" matches "pattern" exactly;
9857 c. The "i"th argument has a suitable type for holding the
9859 void * NULL for the "i"th argument, or a non-void * NULL
9866 return false (because the empty string is not a valid number):
9869 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
9875 NOTE: Do not use no_arg, which is used internally to mark the end of a
9876 list of optional arguments, as a placeholder for missing arguments, as
9883 potentially meaningful characters in a string. The returned string,
9884 used as a regular expression, will exactly match the original string.
9889 Note that it's legal to escape a character even if it has no special
9890 meaning in a regular expression -- so this function does that. (This
9901 Example: simple search for a string:
9904 Example: find first number in a string:
9915 string to be treated as UTF-8 text, still a byte stream but potentially
9920 of a multi-byte character.
9940 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur-
9959 For a full account on how each modifier works, please check the PCRE
9974 functions. Setting match_limit to a non-zero value will limit the exe-
9976 or taking an eternity to return a result. A value of 5000 is good
9977 enough to stop stack blowup in a 2MB thread stack. Setting match_limit
9984 Normally, to pass one or more modifiers to a RE class, you declare a
9986 a RE constructor. Example:
9993 ments and creates a set of flags that are off by default. The optional
10007 convenience functions that return a RE_Options class with the appropri-
10012 through the pains of declaring a RE_Options object and setting several
10013 options, there is a parallel method that give you such ability on the
10015 each of them returns a reference to its class object. For example, to
10016 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
10029 regular expressions at the front of a string and skip over them as they
10030 match. This requires use of the "StringPiece" type, which represents a
10031 sub-range of a real string. Like RE, StringPiece is defined in the
10034 Example: read lines of the form "var = value" from a string.
10036 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
10050 could extract all words from a string by repeatedly calling
10057 By default, if you pass a pointer to a numeric value, the corresponding
10058 text is interpreted as a base-10 number. You can instead wrap the
10059 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
10065 int a, b, c, d;
10068 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
10071 will leave 64 in a, b, c, and d.
10085 pattern matches and a replacement occurs, false otherwise.
10099 non-matching portions of "text" are ignored. Returns true iff a match
10128 do not have a copy of the PCRE distribution, you can save this listing
10140 subject string. The logic is a little bit tricky because of the possi-
10151 to the command line. For example, on a Unix-like system that has PCRE
10153 using a command like this:
10158 In a Windows environment, if you want to statically link the program
10159 against a non-dll pcre.a file, you must uncomment the line that defines
10170 Note that there is a much more comprehensive test program, called
10173 as a simple coding example.
10179 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
10214 The maximum length of a compiled pattern is approximately 64K data
10229 can be no more than 65535 capturing subpatterns. There is, however, a
10235 There is a limit to the number of forward references to subsequent sub-
10241 The maximum length of name for a named subpattern is 32 characters, and
10244 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
10248 The maximum length of a subject string is the largest positive number
10252 the size of a subject string that can be processed by certain patterns.
10253 For a discussion of stack issues, see the pcrestack documentation.
10282 back up and try a different alternative if the first one fails. As
10285 circumstances, for example, whenever a parenthesized sub-pattern is
10289 as a* it may be called several times at the same level, after matching
10290 different numbers of a's. Furthermore, in a number of cases where the
10292 result of the current call (a "tail recursion"), the function is just
10304 way, and uses recursion only when there is a regular expression recur-
10331 either one character that is not "<" or a "<" that is not followed by
10332 "inet". However, each time a parenthesis is processed, a recursion
10333 occurs, so this formulation uses a stack frame for each matched charac-
10334 ter. For a long string, a lot of stack is required. Consider now this
10341 sion happens only when a "<" character that is not followed by "inet"
10354 up points when pcre[16|32]_exec() is running. This makes it run a lot
10368 in total and recursively. If a limit is exceeded, pcre[16|32]_exec()
10376 As a very rough rule of thumb, you should reckon on about 500 bytes per
10381 In Unix-like environments, the pcretest test program has a command line
10384 the smallest limits that allow a particular pattern to match a given
10390 The actual amount of stack used per recursion can vary quite a lot,
10401 mation about stack use is given in a line like this:
10405 The value is approximate because some recursions need a bit more (up to
10414 In Unix-like environments, there is not often a problem with the stack
10422 though sometimes a more explicit error message is given. You can nor-
10437 is also possible to set a stack size when linking a program. There is a