xref: /PHP-5.3/ext/pcre/pcrelib/doc/pcre.txt (revision 357ab3cb)
1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcredemo program. There are separate text files for the pcregrep and
7pcretest commands.
8-----------------------------------------------------------------------------
9
10
11PCRE(3)                                                                PCRE(3)
12
13
14NAME
15       PCRE - Perl-compatible regular expressions
16
17
18INTRODUCTION
19
20       The  PCRE  library is a set of functions that implement regular expres-
21       sion pattern matching using the same syntax and semantics as Perl, with
22       just  a few differences. Some features that appeared in Python and PCRE
23       before they appeared in Perl are also available using the  Python  syn-
24       tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25       items, and there is an option for requesting some  minor  changes  that
26       give better JavaScript compatibility.
27
28       Starting with release 8.30, it is possible to compile two separate PCRE
29       libraries:  the  original,  which  supports  8-bit  character   strings
30       (including  UTF-8  strings),  and a second library that supports 16-bit
31       character strings (including UTF-16 strings). The build process  allows
32       either  one  or both to be built. The majority of the work to make this
33       possible was done by Zoltan Herczeg.
34
35       Starting with release 8.32 it is possible to compile a  third  separate
36       PCRE library, which supports 32-bit character strings (including UTF-32
37       strings). The build process allows any set of the 8-,  16-  and  32-bit
38       libraries. The work to make this possible was done by Christian Persch.
39
40       The  three  libraries  contain identical sets of functions, except that
41       the names in the 16-bit library start with pcre16_  instead  of  pcre_,
42       and  the  names  in  the  32-bit  library start with pcre32_ instead of
43       pcre_. To avoid over-complication and reduce the documentation  mainte-
44       nance load, most of the documentation describes the 8-bit library, with
45       the differences for the 16-bit and  32-bit  libraries  described  sepa-
46       rately  in  the  pcre16  and  pcre32  pages. References to functions or
47       structures of the  form  pcre[16|32]_xxx  should  be  read  as  meaning
48       "pcre_xxx  when  using  the  8-bit  library,  pcre16_xxx when using the
49       16-bit library, or pcre32_xxx when using the 32-bit library".
50
51       The current implementation of PCRE corresponds approximately with  Perl
52       5.12,  including  support  for  UTF-8/16/32 encoded strings and Unicode
53       general category properties. However, UTF-8/16/32 and  Unicode  support
54       has to be explicitly enabled; it is not the default. The Unicode tables
55       correspond to Unicode release 6.2.0.
56
57       In addition to the Perl-compatible matching function, PCRE contains  an
58       alternative  function that matches the same compiled patterns in a dif-
59       ferent way. In certain circumstances, the alternative function has some
60       advantages.   For  a discussion of the two matching algorithms, see the
61       pcrematching page.
62
63       PCRE is written in C and released as a C library. A  number  of  people
64       have  written  wrappers and interfaces of various kinds. In particular,
65       Google Inc.  have provided a comprehensive C++ wrapper  for  the  8-bit
66       library.  This  is  now  included as part of the PCRE distribution. The
67       pcrecpp page has details of this interface.  Other  people's  contribu-
68       tions  can  be  found in the Contrib directory at the primary FTP site,
69       which is:
70
71       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
72
73       Details of exactly which Perl regular expression features are  and  are
74       not supported by PCRE are given in separate documents. See the pcrepat-
75       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
76       page.
77
78       Some  features  of  PCRE can be included, excluded, or changed when the
79       library is built. The pcre_config() function makes it  possible  for  a
80       client  to  discover  which  features are available. The features them-
81       selves are described in the pcrebuild page. Documentation about  build-
82       ing  PCRE  for various operating systems can be found in the README and
83       NON-AUTOTOOLS_BUILD files in the source distribution.
84
85       The libraries contains a number of undocumented internal functions  and
86       data  tables  that  are  used by more than one of the exported external
87       functions, but which are not intended  for  use  by  external  callers.
88       Their  names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which
89       hopefully will not provoke any name clashes. In some  environments,  it
90       is  possible  to  control  which  external  symbols are exported when a
91       shared library is built, and in these cases  the  undocumented  symbols
92       are not exported.
93
94
95SECURITY CONSIDERATIONS
96
97       If  you  are  using PCRE in a non-UTF application that permits users to
98       supply arbitrary patterns for compilation, you should  be  aware  of  a
99       feature that allows users to turn on UTF support from within a pattern,
100       provided that PCRE was built with UTF support. For  example,  an  8-bit
101       pattern  that  begins  with  "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
102       which interprets patterns and subjects as strings of  UTF-8  characters
103       instead  of  individual 8-bit characters.  This causes both the pattern
104       and any data against which it is matched to be checked for UTF-8 valid-
105       ity.  If  the  data  string is very long, such a check might use suffi-
106       ciently many resources as to cause your  application  to  lose  perfor-
107       mance.
108
109       The  best  way  of  guarding  against  this  possibility  is to use the
110       pcre_fullinfo() function to check the compiled  pattern's  options  for
111       UTF.
112
113       If  your  application  is one that supports UTF, be aware that validity
114       checking can take time. If the same data string is to be  matched  many
115       times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
116       and subsequent matches to save redundant checks.
117
118       Another way that performance can be hit is by running  a  pattern  that
119       has  a  very  large search tree against a string that will never match.
120       Nested unlimited repeats in a pattern are a common example.  PCRE  pro-
121       vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
122       ture in the pcreapi page.
123
124
125USER DOCUMENTATION
126
127       The user documentation for PCRE comprises a number  of  different  sec-
128       tions.  In the "man" format, each of these is a separate "man page". In
129       the HTML format, each is a separate page, linked from the  index  page.
130       In  the  plain  text format, all the sections, except the pcredemo sec-
131       tion, are concatenated, for ease of searching. The sections are as fol-
132       lows:
133
134         pcre              this document
135         pcre16            details of the 16-bit library
136         pcre32            details of the 32-bit library
137         pcre-config       show PCRE installation configuration information
138         pcreapi           details of PCRE's native C API
139         pcrebuild         options for building PCRE
140         pcrecallout       details of the callout feature
141         pcrecompat        discussion of Perl compatibility
142         pcrecpp           details of the C++ wrapper for the 8-bit library
143         pcredemo          a demonstration C program that uses PCRE
144         pcregrep          description of the pcregrep command (8-bit only)
145         pcrejit           discussion of the just-in-time optimization support
146         pcrelimits        details of size and other limits
147         pcrematching      discussion of the two matching algorithms
148         pcrepartial       details of the partial matching facility
149         pcrepattern       syntax and semantics of supported
150                             regular expressions
151         pcreperform       discussion of performance issues
152         pcreposix         the POSIX-compatible C API for the 8-bit library
153         pcreprecompile    details of saving and re-using precompiled patterns
154         pcresample        discussion of the pcredemo program
155         pcrestack         discussion of stack usage
156         pcresyntax        quick syntax reference
157         pcretest          description of the pcretest testing command
158         pcreunicode       discussion of Unicode and UTF-8/16/32 support
159
160       In  addition,  in the "man" and HTML formats, there is a short page for
161       each C library function, listing its arguments and results.
162
163
164AUTHOR
165
166       Philip Hazel
167       University Computing Service
168       Cambridge CB2 3QH, England.
169
170       Putting an actual email address here seems to have been a spam  magnet,
171       so  I've  taken  it away. If you want to email me, use my two initials,
172       followed by the two digits 10, at the domain cam.ac.uk.
173
174
175REVISION
176
177       Last updated: 11 November 2012
178       Copyright (c) 1997-2012 University of Cambridge.
179------------------------------------------------------------------------------
180
181
182PCRE(3)                                                                PCRE(3)
183
184
185NAME
186       PCRE - Perl-compatible regular expressions
187
188       #include <pcre.h>
189
190
191PCRE 16-BIT API BASIC FUNCTIONS
192
193       pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
194            const char **errptr, int *erroffset,
195            const unsigned char *tableptr);
196
197       pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
198            int *errorcodeptr,
199            const char **errptr, int *erroffset,
200            const unsigned char *tableptr);
201
202       pcre16_extra *pcre16_study(const pcre16 *code, int options,
203            const char **errptr);
204
205       void pcre16_free_study(pcre16_extra *extra);
206
207       int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
208            PCRE_SPTR16 subject, int length, int startoffset,
209            int options, int *ovector, int ovecsize);
210
211       int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
212            PCRE_SPTR16 subject, int length, int startoffset,
213            int options, int *ovector, int ovecsize,
214            int *workspace, int wscount);
215
216
217PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
218
219       int pcre16_copy_named_substring(const pcre16 *code,
220            PCRE_SPTR16 subject, int *ovector,
221            int stringcount, PCRE_SPTR16 stringname,
222            PCRE_UCHAR16 *buffer, int buffersize);
223
224       int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
225            int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
226            int buffersize);
227
228       int pcre16_get_named_substring(const pcre16 *code,
229            PCRE_SPTR16 subject, int *ovector,
230            int stringcount, PCRE_SPTR16 stringname,
231            PCRE_SPTR16 *stringptr);
232
233       int pcre16_get_stringnumber(const pcre16 *code,
234            PCRE_SPTR16 name);
235
236       int pcre16_get_stringtable_entries(const pcre16 *code,
237            PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
238
239       int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
240            int stringcount, int stringnumber,
241            PCRE_SPTR16 *stringptr);
242
243       int pcre16_get_substring_list(PCRE_SPTR16 subject,
244            int *ovector, int stringcount, PCRE_SPTR16 **listptr);
245
246       void pcre16_free_substring(PCRE_SPTR16 stringptr);
247
248       void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
249
250
251PCRE 16-BIT API AUXILIARY FUNCTIONS
252
253       pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
254
255       void pcre16_jit_stack_free(pcre16_jit_stack *stack);
256
257       void pcre16_assign_jit_stack(pcre16_extra *extra,
258            pcre16_jit_callback callback, void *data);
259
260       const unsigned char *pcre16_maketables(void);
261
262       int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
263            int what, void *where);
264
265       int pcre16_refcount(pcre16 *code, int adjust);
266
267       int pcre16_config(int what, void *where);
268
269       const char *pcre16_version(void);
270
271       int pcre16_pattern_to_host_byte_order(pcre16 *code,
272            pcre16_extra *extra, const unsigned char *tables);
273
274
275PCRE 16-BIT API INDIRECTED FUNCTIONS
276
277       void *(*pcre16_malloc)(size_t);
278
279       void (*pcre16_free)(void *);
280
281       void *(*pcre16_stack_malloc)(size_t);
282
283       void (*pcre16_stack_free)(void *);
284
285       int (*pcre16_callout)(pcre16_callout_block *);
286
287
288PCRE 16-BIT API 16-BIT-ONLY FUNCTION
289
290       int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
291            PCRE_SPTR16 input, int length, int *byte_order,
292            int keep_boms);
293
294
295THE PCRE 16-BIT LIBRARY
296
297       Starting  with  release  8.30, it is possible to compile a PCRE library
298       that supports 16-bit character strings, including  UTF-16  strings,  as
299       well  as  or instead of the original 8-bit library. The majority of the
300       work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
301       libraries contain identical sets of functions, used in exactly the same
302       way. Only the names of the functions and the data types of their  argu-
303       ments  and results are different. To avoid over-complication and reduce
304       the documentation maintenance load,  most  of  the  PCRE  documentation
305       describes  the  8-bit  library,  with only occasional references to the
306       16-bit library. This page describes what is different when you use  the
307       16-bit library.
308
309       WARNING:  A  single  application can be linked with both libraries, but
310       you must take care when processing any particular pattern to use  func-
311       tions  from  just one library. For example, if you want to study a pat-
312       tern that was compiled with  pcre16_compile(),  you  must  do  so  with
313       pcre16_study(), not pcre_study(), and you must free the study data with
314       pcre16_free_study().
315
316
317THE HEADER FILE
318
319       There is only one header file, pcre.h. It contains prototypes  for  all
320       the functions in all libraries, as well as definitions of flags, struc-
321       tures, error codes, etc.
322
323
324THE LIBRARY NAME
325
326       In Unix-like systems, the 16-bit library is called libpcre16,  and  can
327       normally  be  accesss  by adding -lpcre16 to the command for linking an
328       application that uses PCRE.
329
330
331STRING TYPES
332
333       In the 8-bit library, strings are passed to PCRE library  functions  as
334       vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
335       strings are passed as vectors of unsigned 16-bit quantities. The  macro
336       PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
337       defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
338       int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
339       as "unsigned short int", but checks that it really  is  a  16-bit  data
340       type.  If  it is not, the build fails with an error message telling the
341       maintainer to modify the definition appropriately.
342
343
344STRUCTURE TYPES
345
346       The types of the opaque structures that are used  for  compiled  16-bit
347       patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
348       The  type  of  the  user-accessible  structure  that  is  returned   by
349       pcre16_study()  is  pcre16_extra, and the type of the structure that is
350       used for passing data to a callout  function  is  pcre16_callout_block.
351       These structures contain the same fields, with the same names, as their
352       8-bit counterparts. The only difference is that pointers  to  character
353       strings are 16-bit instead of 8-bit types.
354
355
35616-BIT FUNCTIONS
357
358       For  every function in the 8-bit library there is a corresponding func-
359       tion in the 16-bit library with a name that starts with pcre16_ instead
360       of  pcre_.  The  prototypes are listed above. In addition, there is one
361       extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
362       function  that converts a UTF-16 character string to host byte order if
363       necessary. The other 16-bit  functions  expect  the  strings  they  are
364       passed to be in host byte order.
365
366       The input and output arguments of pcre16_utf16_to_host_byte_order() may
367       point to the same address, that is, conversion in place  is  supported.
368       The output buffer must be at least as long as the input.
369
370       The  length  argument  specifies the number of 16-bit data units in the
371       input string; a negative value specifies a zero-terminated string.
372
373       If byte_order is NULL, it is assumed that the string starts off in host
374       byte  order. This may be changed by byte-order marks (BOMs) anywhere in
375       the string (commonly as the first character).
376
377       If byte_order is not NULL, a non-zero value of the integer to which  it
378       points  means  that  the input starts off in host byte order, otherwise
379       the opposite order is assumed. Again, BOMs in  the  string  can  change
380       this. The final byte order is passed back at the end of processing.
381
382       If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
383       copied into the output string. Otherwise they are discarded.
384
385       The result of the function is the number of 16-bit  units  placed  into
386       the  output  buffer,  including  the  zero terminator if the string was
387       zero-terminated.
388
389
390SUBJECT STRING OFFSETS
391
392       The offsets within subject strings that are returned  by  the  matching
393       functions are in 16-bit units rather than bytes.
394
395
396NAMED SUBPATTERNS
397
398       The  name-to-number translation table that is maintained for named sub-
399       patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
400       function returns the length of each entry in the table as the number of
401       16-bit data units.
402
403
404OPTION NAMES
405
406       There   are   two   new   general   option   names,   PCRE_UTF16    and
407       PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
408       PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
409       define  the  same bits in the options word. There is a discussion about
410       the validity of UTF-16 strings in the pcreunicode page.
411
412       For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
413       that  returns  1  if UTF-16 support is configured, otherwise 0. If this
414       option  is  given  to  pcre_config()  or  pcre32_config(),  or  if  the
415       PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF32  option is given to pcre16_con-
416       fig(), the result is the PCRE_ERROR_BADOPTION error.
417
418
419CHARACTER CODES
420
421       In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
422       treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
423       that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
424       types  for characters less than 0xff can therefore be influenced by the
425       locale in the same way as before.  Characters greater  than  0xff  have
426       only one case, and no "type" (such as letter or digit).
427
428       In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
429       0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
430       because  those  are "surrogate" values that are used in pairs to encode
431       values greater than 0xffff.
432
433       A UTF-16 string can indicate its endianness by special code knows as  a
434       byte-order mark (BOM). The PCRE functions do not handle this, expecting
435       strings  to  be  in  host  byte  order.  A  utility   function   called
436       pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
437       above).
438
439
440ERROR NAMES
441
442       The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
443       spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
444       given when a compiled pattern is passed to a  function  that  processes
445       patterns  in  the  other  mode, for example, if a pattern compiled with
446       pcre_compile() is passed to pcre16_exec().
447
448       There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
449       invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
450       UTF-8 strings that are described in the section entitled "Reason  codes
451       for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
452       are:
453
454         PCRE_UTF16_ERR1  Missing low surrogate at end of string
455         PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
456         PCRE_UTF16_ERR3  Isolated low surrogate
457         PCRE_UTF16_ERR4  Non-character
458
459
460ERROR TEXTS
461
462       If there is an error while compiling a pattern, the error text that  is
463       passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
464       character string, zero-terminated.
465
466
467CALLOUTS
468
469       The subject and mark fields in the callout block that is  passed  to  a
470       callout function point to 16-bit vectors.
471
472
473TESTING
474
475       The  pcretest  program continues to operate with 8-bit input and output
476       files, but it can be used for testing the 16-bit library. If it is  run
477       with the command line option -16, patterns and subject strings are con-
478       verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
479       library  functions  are used instead of the 8-bit ones. Returned 16-bit
480       strings are converted to 8-bit for output. If both the  8-bit  and  the
481       32-bit libraries were not compiled, pcretest defaults to 16-bit and the
482       -16 option is ignored.
483
484       When PCRE is being built, the RunTest script that is  called  by  "make
485       check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
486       16-bit and 32-bit libraries has been built, and runs the  tests  appro-
487       priately.
488
489
490NOT SUPPORTED IN 16-BIT MODE
491
492       Not all the features of the 8-bit library are available with the 16-bit
493       library. The C++ and POSIX wrapper functions  support  only  the  8-bit
494       library, and the pcregrep program is at present 8-bit only.
495
496
497AUTHOR
498
499       Philip Hazel
500       University Computing Service
501       Cambridge CB2 3QH, England.
502
503
504REVISION
505
506       Last updated: 08 November 2012
507       Copyright (c) 1997-2012 University of Cambridge.
508------------------------------------------------------------------------------
509
510
511PCRE(3)                                                                PCRE(3)
512
513
514NAME
515       PCRE - Perl-compatible regular expressions
516
517       #include <pcre.h>
518
519
520PCRE 32-BIT API BASIC FUNCTIONS
521
522       pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
523            const char **errptr, int *erroffset,
524            const unsigned char *tableptr);
525
526       pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
527            int *errorcodeptr,
528            const char **errptr, int *erroffset,
529            const unsigned char *tableptr);
530
531       pcre32_extra *pcre32_study(const pcre32 *code, int options,
532            const char **errptr);
533
534       void pcre32_free_study(pcre32_extra *extra);
535
536       int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
537            PCRE_SPTR32 subject, int length, int startoffset,
538            int options, int *ovector, int ovecsize);
539
540       int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
541            PCRE_SPTR32 subject, int length, int startoffset,
542            int options, int *ovector, int ovecsize,
543            int *workspace, int wscount);
544
545
546PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
547
548       int pcre32_copy_named_substring(const pcre32 *code,
549            PCRE_SPTR32 subject, int *ovector,
550            int stringcount, PCRE_SPTR32 stringname,
551            PCRE_UCHAR32 *buffer, int buffersize);
552
553       int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
554            int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
555            int buffersize);
556
557       int pcre32_get_named_substring(const pcre32 *code,
558            PCRE_SPTR32 subject, int *ovector,
559            int stringcount, PCRE_SPTR32 stringname,
560            PCRE_SPTR32 *stringptr);
561
562       int pcre32_get_stringnumber(const pcre32 *code,
563            PCRE_SPTR32 name);
564
565       int pcre32_get_stringtable_entries(const pcre32 *code,
566            PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
567
568       int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
569            int stringcount, int stringnumber,
570            PCRE_SPTR32 *stringptr);
571
572       int pcre32_get_substring_list(PCRE_SPTR32 subject,
573            int *ovector, int stringcount, PCRE_SPTR32 **listptr);
574
575       void pcre32_free_substring(PCRE_SPTR32 stringptr);
576
577       void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
578
579
580PCRE 32-BIT API AUXILIARY FUNCTIONS
581
582       pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
583
584       void pcre32_jit_stack_free(pcre32_jit_stack *stack);
585
586       void pcre32_assign_jit_stack(pcre32_extra *extra,
587            pcre32_jit_callback callback, void *data);
588
589       const unsigned char *pcre32_maketables(void);
590
591       int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
592            int what, void *where);
593
594       int pcre32_refcount(pcre32 *code, int adjust);
595
596       int pcre32_config(int what, void *where);
597
598       const char *pcre32_version(void);
599
600       int pcre32_pattern_to_host_byte_order(pcre32 *code,
601            pcre32_extra *extra, const unsigned char *tables);
602
603
604PCRE 32-BIT API INDIRECTED FUNCTIONS
605
606       void *(*pcre32_malloc)(size_t);
607
608       void (*pcre32_free)(void *);
609
610       void *(*pcre32_stack_malloc)(size_t);
611
612       void (*pcre32_stack_free)(void *);
613
614       int (*pcre32_callout)(pcre32_callout_block *);
615
616
617PCRE 32-BIT API 32-BIT-ONLY FUNCTION
618
619       int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
620            PCRE_SPTR32 input, int length, int *byte_order,
621            int keep_boms);
622
623
624THE PCRE 32-BIT LIBRARY
625
626       Starting  with  release  8.32, it is possible to compile a PCRE library
627       that supports 32-bit character strings, including  UTF-32  strings,  as
628       well as or instead of the original 8-bit library. This work was done by
629       Christian Persch, based on the work done  by  Zoltan  Herczeg  for  the
630       16-bit  library.  All  three  libraries contain identical sets of func-
631       tions, used in exactly the same way.  Only the names of  the  functions
632       and  the  data  types  of their arguments and results are different. To
633       avoid over-complication and reduce the documentation maintenance  load,
634       most  of  the PCRE documentation describes the 8-bit library, with only
635       occasional references to the 16-bit and  32-bit  libraries.  This  page
636       describes what is different when you use the 32-bit library.
637
638       WARNING:  A  single  application  can  be linked with all or any of the
639       three libraries, but you must take care when processing any  particular
640       pattern  to  use  functions  from just one library. For example, if you
641       want to study a pattern that was compiled  with  pcre32_compile(),  you
642       must do so with pcre32_study(), not pcre_study(), and you must free the
643       study data with pcre32_free_study().
644
645
646THE HEADER FILE
647
648       There is only one header file, pcre.h. It contains prototypes  for  all
649       the functions in all libraries, as well as definitions of flags, struc-
650       tures, error codes, etc.
651
652
653THE LIBRARY NAME
654
655       In Unix-like systems, the 32-bit library is called libpcre32,  and  can
656       normally  be  accesss  by adding -lpcre32 to the command for linking an
657       application that uses PCRE.
658
659
660STRING TYPES
661
662       In the 8-bit library, strings are passed to PCRE library  functions  as
663       vectors  of  bytes  with  the  C  type "char *". In the 32-bit library,
664       strings are passed as vectors of unsigned 32-bit quantities. The  macro
665       PCRE_UCHAR32  specifies  an  appropriate  data type, and PCRE_SPTR32 is
666       defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
667       int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
668       as "unsigned int", but checks that it really is a 32-bit data type.  If
669       it is not, the build fails with an error message telling the maintainer
670       to modify the definition appropriately.
671
672
673STRUCTURE TYPES
674
675       The types of the opaque structures that are used  for  compiled  32-bit
676       patterns  and  JIT stacks are pcre32 and pcre32_jit_stack respectively.
677       The  type  of  the  user-accessible  structure  that  is  returned   by
678       pcre32_study()  is  pcre32_extra, and the type of the structure that is
679       used for passing data to a callout  function  is  pcre32_callout_block.
680       These structures contain the same fields, with the same names, as their
681       8-bit counterparts. The only difference is that pointers  to  character
682       strings are 32-bit instead of 8-bit types.
683
684
68532-BIT FUNCTIONS
686
687       For  every function in the 8-bit library there is a corresponding func-
688       tion in the 32-bit library with a name that starts with pcre32_ instead
689       of  pcre_.  The  prototypes are listed above. In addition, there is one
690       extra function, pcre32_utf32_to_host_byte_order(). This  is  a  utility
691       function  that converts a UTF-32 character string to host byte order if
692       necessary. The other 32-bit  functions  expect  the  strings  they  are
693       passed to be in host byte order.
694
695       The input and output arguments of pcre32_utf32_to_host_byte_order() may
696       point to the same address, that is, conversion in place  is  supported.
697       The output buffer must be at least as long as the input.
698
699       The  length  argument  specifies the number of 32-bit data units in the
700       input string; a negative value specifies a zero-terminated string.
701
702       If byte_order is NULL, it is assumed that the string starts off in host
703       byte  order. This may be changed by byte-order marks (BOMs) anywhere in
704       the string (commonly as the first character).
705
706       If byte_order is not NULL, a non-zero value of the integer to which  it
707       points  means  that  the input starts off in host byte order, otherwise
708       the opposite order is assumed. Again, BOMs in  the  string  can  change
709       this. The final byte order is passed back at the end of processing.
710
711       If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
712       copied into the output string. Otherwise they are discarded.
713
714       The result of the function is the number of 32-bit  units  placed  into
715       the  output  buffer,  including  the  zero terminator if the string was
716       zero-terminated.
717
718
719SUBJECT STRING OFFSETS
720
721       The offsets within subject strings that are returned  by  the  matching
722       functions are in 32-bit units rather than bytes.
723
724
725NAMED SUBPATTERNS
726
727       The  name-to-number translation table that is maintained for named sub-
728       patterns uses 32-bit characters.  The  pcre32_get_stringtable_entries()
729       function returns the length of each entry in the table as the number of
730       32-bit data units.
731
732
733OPTION NAMES
734
735       There   are   two   new   general   option   names,   PCRE_UTF32    and
736       PCRE_NO_UTF32_CHECK,     which     correspond    to    PCRE_UTF8    and
737       PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
738       define  the  same bits in the options word. There is a discussion about
739       the validity of UTF-32 strings in the pcreunicode page.
740
741       For the pcre32_config() function there is an  option  PCRE_CONFIG_UTF32
742       that  returns  1  if UTF-32 support is configured, otherwise 0. If this
743       option  is  given  to  pcre_config()  or  pcre16_config(),  or  if  the
744       PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF16  option is given to pcre32_con-
745       fig(), the result is the PCRE_ERROR_BADOPTION error.
746
747
748CHARACTER CODES
749
750       In 32-bit mode, when  PCRE_UTF32  is  not  set,  character  values  are
751       treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
752       that they can range from 0 to 0x7fffffff instead of 0 to 0xff.  Charac-
753       ter  types for characters less than 0xff can therefore be influenced by
754       the locale in the same way as before.   Characters  greater  than  0xff
755       have only one case, and no "type" (such as letter or digit).
756
757       In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
758       0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
759       because those are "surrogate" values that are ill-formed in UTF-32.
760
761       A  UTF-32 string can indicate its endianness by special code knows as a
762       byte-order mark (BOM). The PCRE functions do not handle this, expecting
763       strings   to   be  in  host  byte  order.  A  utility  function  called
764       pcre32_utf32_to_host_byte_order() is provided to help  with  this  (see
765       above).
766
767
768ERROR NAMES
769
770       The  error  PCRE_ERROR_BADUTF32  corresponds  to its 8-bit counterpart.
771       The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
772       to  a  function that processes patterns in the other mode, for example,
773       if a pattern compiled with pcre_compile() is passed to pcre32_exec().
774
775       There are new error codes whose names  begin  with  PCRE_UTF32_ERR  for
776       invalid  UTF-32  strings,  corresponding to the PCRE_UTF8_ERR codes for
777       UTF-8 strings that are described in the section entitled "Reason  codes
778       for  invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
779       are:
780
781         PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
782         PCRE_UTF32_ERR2  Non-character
783         PCRE_UTF32_ERR3  Character > 0x10ffff
784
785
786ERROR TEXTS
787
788       If there is an error while compiling a pattern, the error text that  is
789       passed  back by pcre32_compile() or pcre32_compile2() is still an 8-bit
790       character string, zero-terminated.
791
792
793CALLOUTS
794
795       The subject and mark fields in the callout block that is  passed  to  a
796       callout function point to 32-bit vectors.
797
798
799TESTING
800
801       The  pcretest  program continues to operate with 8-bit input and output
802       files, but it can be used for testing the 32-bit library. If it is  run
803       with the command line option -32, patterns and subject strings are con-
804       verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
805       library  functions  are used instead of the 8-bit ones. Returned 32-bit
806       strings are converted to 8-bit for output. If both the  8-bit  and  the
807       16-bit libraries were not compiled, pcretest defaults to 32-bit and the
808       -32 option is ignored.
809
810       When PCRE is being built, the RunTest script that is  called  by  "make
811       check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
812       16-bit and 32-bit libraries has been built, and runs the  tests  appro-
813       priately.
814
815
816NOT SUPPORTED IN 32-BIT MODE
817
818       Not all the features of the 8-bit library are available with the 32-bit
819       library. The C++ and POSIX wrapper functions  support  only  the  8-bit
820       library, and the pcregrep program is at present 8-bit only.
821
822
823AUTHOR
824
825       Philip Hazel
826       University Computing Service
827       Cambridge CB2 3QH, England.
828
829
830REVISION
831
832       Last updated: 08 November 2012
833       Copyright (c) 1997-2012 University of Cambridge.
834------------------------------------------------------------------------------
835
836
837PCREBUILD(3)                                                      PCREBUILD(3)
838
839
840NAME
841       PCRE - Perl-compatible regular expressions
842
843
844PCRE BUILD-TIME OPTIONS
845
846       This  document  describes  the  optional  features  of PCRE that can be
847       selected when the library is compiled. It assumes use of the  configure
848       script,  where the optional features are selected or deselected by pro-
849       viding options to configure before running the make  command.  However,
850       the  same  options  can be selected in both Unix-like and non-Unix-like
851       environments using the GUI facility of cmake-gui if you are using CMake
852       instead of configure to build PCRE.
853
854       There  is a lot more information about building PCRE without using con-
855       figure (including information about using CMake or building "by  hand")
856       in  the file called NON-AUTOTOOLS-BUILD, which is part of the PCRE dis-
857       tribution. You should consult this file as well as the README  file  if
858       you are building in a non-Unix-like environment.
859
860       The complete list of options for configure (which includes the standard
861       ones such as the  selection  of  the  installation  directory)  can  be
862       obtained by running
863
864         ./configure --help
865
866       The  following  sections  include  descriptions  of options whose names
867       begin with --enable or --disable. These settings specify changes to the
868       defaults  for  the configure command. Because of the way that configure
869       works, --enable and --disable always come in pairs, so  the  complemen-
870       tary  option always exists as well, but as it specifies the default, it
871       is not described.
872
873
874BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
875
876       By default, a library called libpcre  is  built,  containing  functions
877       that  take  string  arguments  contained in vectors of bytes, either as
878       single-byte characters, or interpreted as UTF-8 strings. You  can  also
879       build  a  separate library, called libpcre16, in which strings are con-
880       tained in vectors of 16-bit data units and interpreted either  as  sin-
881       gle-unit characters or UTF-16 strings, by adding
882
883         --enable-pcre16
884
885       to the configure command. You can also build a separate library, called
886       libpcre32, in which strings are contained in  vectors  of  32-bit  data
887       units  and  interpreted  either  as  single-unit  characters  or UTF-32
888       strings, by adding
889
890         --enable-pcre32
891
892       to the configure command. If you do not want the 8-bit library, add
893
894         --disable-pcre8
895
896       as well. At least one of the three libraries must be built.  Note  that
897       the  C++  and  POSIX  wrappers are for the 8-bit library only, and that
898       pcregrep is an 8-bit program. None of these are  built  if  you  select
899       only the 16-bit or 32-bit libraries.
900
901
902BUILDING SHARED AND STATIC LIBRARIES
903
904       The  PCRE building process uses libtool to build both shared and static
905       Unix libraries by default. You can suppress one of these by adding  one
906       of
907
908         --disable-shared
909         --disable-static
910
911       to the configure command, as required.
912
913
914C++ SUPPORT
915
916       By  default,  if the 8-bit library is being built, the configure script
917       will search for a C++ compiler and C++ header files. If it finds  them,
918       it  automatically  builds  the C++ wrapper library (which supports only
919       8-bit strings). You can disable this by adding
920
921         --disable-cpp
922
923       to the configure command.
924
925
926UTF-8, UTF-16 AND UTF-32 SUPPORT
927
928       To build PCRE with support for UTF Unicode character strings, add
929
930         --enable-utf
931
932       to the configure command. This setting applies to all three  libraries,
933       adding  support  for  UTF-8 to the 8-bit library, support for UTF-16 to
934       the 16-bit library, and  support  for  UTF-32  to  the  to  the  32-bit
935       library.  There  are no separate options for enabling UTF-8, UTF-16 and
936       UTF-32 independently because that would allow ridiculous settings  such
937       as  requesting UTF-16 support while building only the 8-bit library. It
938       is not possible to build one library with UTF support and another with-
939       out  in the same configuration. (For backwards compatibility, --enable-
940       utf8 is a synonym of --enable-utf.)
941
942       Of itself, this setting does not make  PCRE  treat  strings  as  UTF-8,
943       UTF-16  or UTF-32. As well as compiling PCRE with this option, you also
944       have have to set the PCRE_UTF8, PCRE_UTF16  or  PCRE_UTF32  option  (as
945       appropriate) when you call one of the pattern compiling functions.
946
947       If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
948       expects its input to be either ASCII or UTF-8 (depending  on  the  run-
949       time option). It is not possible to support both EBCDIC and UTF-8 codes
950       in the same version of  the  library.  Consequently,  --enable-utf  and
951       --enable-ebcdic are mutually exclusive.
952
953
954UNICODE CHARACTER PROPERTY SUPPORT
955
956       UTF  support allows the libraries to process character codepoints up to
957       0x10ffff in the strings that they handle. On its own, however, it  does
958       not provide any facilities for accessing the properties of such charac-
959       ters. If you want to be able to use the pattern escapes \P, \p, and \X,
960       which refer to Unicode character properties, you must add
961
962         --enable-unicode-properties
963
964       to  the  configure  command. This implies UTF support, even if you have
965       not explicitly requested it.
966
967       Including Unicode property support adds around 30K  of  tables  to  the
968       PCRE  library.  Only  the general category properties such as Lu and Nd
969       are supported. Details are given in the pcrepattern documentation.
970
971
972JUST-IN-TIME COMPILER SUPPORT
973
974       Just-in-time compiler support is included in the build by specifying
975
976         --enable-jit
977
978       This support is available only for certain hardware  architectures.  If
979       this  option  is  set  for  an unsupported architecture, a compile time
980       error occurs.  See the pcrejit documentation for a  discussion  of  JIT
981       usage. When JIT support is enabled, pcregrep automatically makes use of
982       it, unless you add
983
984         --disable-pcregrep-jit
985
986       to the "configure" command.
987
988
989CODE VALUE OF NEWLINE
990
991       By default, PCRE interprets the linefeed (LF) character  as  indicating
992       the  end  of  a line. This is the normal newline character on Unix-like
993       systems. You can compile PCRE to use carriage return (CR)  instead,  by
994       adding
995
996         --enable-newline-is-cr
997
998       to  the  configure  command.  There  is  also  a --enable-newline-is-lf
999       option, which explicitly specifies linefeed as the newline character.
1000
1001       Alternatively, you can specify that line endings are to be indicated by
1002       the two character sequence CRLF. If you want this, add
1003
1004         --enable-newline-is-crlf
1005
1006       to the configure command. There is a fourth option, specified by
1007
1008         --enable-newline-is-anycrlf
1009
1010       which  causes  PCRE  to recognize any of the three sequences CR, LF, or
1011       CRLF as indicating a line ending. Finally, a fifth option, specified by
1012
1013         --enable-newline-is-any
1014
1015       causes PCRE to recognize any Unicode newline sequence.
1016
1017       Whatever line ending convention is selected when PCRE is built  can  be
1018       overridden  when  the library functions are called. At build time it is
1019       conventional to use the standard for your operating system.
1020
1021
1022WHAT \R MATCHES
1023
1024       By default, the sequence \R in a pattern matches  any  Unicode  newline
1025       sequence,  whatever  has  been selected as the line ending sequence. If
1026       you specify
1027
1028         --enable-bsr-anycrlf
1029
1030       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
1031       ever  is selected when PCRE is built can be overridden when the library
1032       functions are called.
1033
1034
1035POSIX MALLOC USAGE
1036
1037       When the 8-bit library is called through the POSIX interface  (see  the
1038       pcreposix  documentation),  additional  working storage is required for
1039       holding the pointers to capturing  substrings,  because  PCRE  requires
1040       three integers per substring, whereas the POSIX interface provides only
1041       two. If the number of expected substrings is small, the  wrapper  func-
1042       tion  uses  space  on the stack, because this is faster than using mal-
1043       loc() for each call. The default threshold above which the stack is  no
1044       longer used is 10; it can be changed by adding a setting such as
1045
1046         --with-posix-malloc-threshold=20
1047
1048       to the configure command.
1049
1050
1051HANDLING VERY LARGE PATTERNS
1052
1053       Within  a  compiled  pattern,  offset values are used to point from one
1054       part to another (for example, from an opening parenthesis to an  alter-
1055       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
1056       two-byte values are used for these offsets, leading to a  maximum  size
1057       for  a compiled pattern of around 64K. This is sufficient to handle all
1058       but the most gigantic patterns.  Nevertheless, some people do  want  to
1059       process  truly  enormous patterns, so it is possible to compile PCRE to
1060       use three-byte or four-byte offsets by adding a setting such as
1061
1062         --with-link-size=3
1063
1064       to the configure command. The value given must be 2, 3, or 4.  For  the
1065       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
1066       using longer offsets slows down the operation of PCRE because it has to
1067       load  additional  data  when  handling them. For the 32-bit library the
1068       value is always 4 and cannot be overridden; the value  of  --with-link-
1069       size is ignored.
1070
1071
1072AVOIDING EXCESSIVE STACK USAGE
1073
1074       When matching with the pcre_exec() function, PCRE implements backtrack-
1075       ing by making recursive calls to an internal function  called  match().
1076       In  environments  where  the size of the stack is limited, this can se-
1077       verely limit PCRE's operation. (The Unix environment does  not  usually
1078       suffer from this problem, but it may sometimes be necessary to increase
1079       the maximum stack size.  There is a discussion in the  pcrestack  docu-
1080       mentation.)  An alternative approach to recursion that uses memory from
1081       the heap to remember data, instead of using recursive  function  calls,
1082       has  been  implemented to work round the problem of limited stack size.
1083       If you want to build a version of PCRE that works this way, add
1084
1085         --disable-stack-for-recursion
1086
1087       to the configure command. With this configuration, PCRE  will  use  the
1088       pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
1089       ment functions. By default these point to malloc() and free(), but  you
1090       can replace the pointers so that your own functions are used instead.
1091
1092       Separate  functions  are  provided  rather  than  using pcre_malloc and
1093       pcre_free because the  usage  is  very  predictable:  the  block  sizes
1094       requested  are  always  the  same,  and  the blocks are always freed in
1095       reverse order. A calling program might be able to  implement  optimized
1096       functions  that  perform  better  than  malloc()  and free(). PCRE runs
1097       noticeably more slowly when built in this way. This option affects only
1098       the pcre_exec() function; it is not relevant for pcre_dfa_exec().
1099
1100
1101LIMITING PCRE RESOURCE USAGE
1102
1103       Internally,  PCRE has a function called match(), which it calls repeat-
1104       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
1105       pcre_exec()  function.  By controlling the maximum number of times this
1106       function may be called during a single matching operation, a limit  can
1107       be  placed  on  the resources used by a single call to pcre_exec(). The
1108       limit can be changed at run time, as described in the pcreapi  documen-
1109       tation.  The default is 10 million, but this can be changed by adding a
1110       setting such as
1111
1112         --with-match-limit=500000
1113
1114       to  the  configure  command.  This  setting  has  no  effect   on   the
1115       pcre_dfa_exec() matching function.
1116
1117       In  some  environments  it is desirable to limit the depth of recursive
1118       calls of match() more strictly than the total number of calls, in order
1119       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
1120       for-recursion is specified) that is used. A second limit controls this;
1121       it  defaults  to  the  value  that is set for --with-match-limit, which
1122       imposes no additional constraints. However, you can set a  lower  limit
1123       by adding, for example,
1124
1125         --with-match-limit-recursion=10000
1126
1127       to  the  configure  command.  This  value can also be overridden at run
1128       time.
1129
1130
1131CREATING CHARACTER TABLES AT BUILD TIME
1132
1133       PCRE uses fixed tables for processing characters whose code values  are
1134       less  than 256. By default, PCRE is built with a set of tables that are
1135       distributed in the file pcre_chartables.c.dist. These  tables  are  for
1136       ASCII codes only. If you add
1137
1138         --enable-rebuild-chartables
1139
1140       to  the  configure  command, the distributed tables are no longer used.
1141       Instead, a program called dftables is compiled and  run.  This  outputs
1142       the source for new set of tables, created in the default locale of your
1143       C run-time system. (This method of replacing the tables does  not  work
1144       if  you are cross compiling, because dftables is run on the local host.
1145       If you need to create alternative tables when cross compiling, you will
1146       have to do so "by hand".)
1147
1148
1149USING EBCDIC CODE
1150
1151       PCRE  assumes  by  default that it will run in an environment where the
1152       character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
1153       This  is  the  case for most computer operating systems. PCRE can, how-
1154       ever, be compiled to run in an EBCDIC environment by adding
1155
1156         --enable-ebcdic
1157
1158       to the configure command. This setting implies --enable-rebuild-charta-
1159       bles.  You  should  only  use  it if you know that you are in an EBCDIC
1160       environment (for example,  an  IBM  mainframe  operating  system).  The
1161       --enable-ebcdic option is incompatible with --enable-utf.
1162
1163       The EBCDIC character that corresponds to an ASCII LF is assumed to have
1164       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
1165       is used. In such an environment you should use
1166
1167         --enable-ebcdic-nl25
1168
1169       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
1170       has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
1171       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
1172       acter (which, in Unicode, is 0x85).
1173
1174       The options that select newline behaviour, such as --enable-newline-is-
1175       cr, and equivalent run-time options, refer to these character values in
1176       an EBCDIC environment.
1177
1178
1179PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
1180
1181       By default, pcregrep reads all files as plain text. You can build it so
1182       that it recognizes files whose names end in .gz or .bz2, and reads them
1183       with libz or libbz2, respectively, by adding one or both of
1184
1185         --enable-pcregrep-libz
1186         --enable-pcregrep-libbz2
1187
1188       to the configure command. These options naturally require that the rel-
1189       evant  libraries  are installed on your system. Configuration will fail
1190       if they are not.
1191
1192
1193PCREGREP BUFFER SIZE
1194
1195       pcregrep uses an internal buffer to hold a "window" on the file  it  is
1196       scanning, in order to be able to output "before" and "after" lines when
1197       it finds a match. The size of the buffer is controlled by  a  parameter
1198       whose default value is 20K. The buffer itself is three times this size,
1199       but because of the way it is used for holding "before" lines, the long-
1200       est  line  that  is guaranteed to be processable is the parameter size.
1201       You can change the default parameter value by adding, for example,
1202
1203         --with-pcregrep-bufsize=50K
1204
1205       to the configure command. The caller of pcregrep can, however, override
1206       this value by specifying a run-time option.
1207
1208
1209PCRETEST OPTION FOR LIBREADLINE SUPPORT
1210
1211       If you add
1212
1213         --enable-pcretest-libreadline
1214
1215       to  the  configure  command,  pcretest  is  linked with the libreadline
1216       library, and when its input is from a terminal, it reads it  using  the
1217       readline() function. This provides line-editing and history facilities.
1218       Note that libreadline is GPL-licensed, so if you distribute a binary of
1219       pcretest linked in this way, there may be licensing issues.
1220
1221       Setting  this  option  causes  the -lreadline option to be added to the
1222       pcretest build. In many operating environments with  a  sytem-installed
1223       libreadline this is sufficient. However, in some environments (e.g.  if
1224       an unmodified distribution version of readline is in use),  some  extra
1225       configuration  may  be necessary. The INSTALL file for libreadline says
1226       this:
1227
1228         "Readline uses the termcap functions, but does not link with the
1229         termcap or curses library itself, allowing applications which link
1230         with readline the to choose an appropriate library."
1231
1232       If your environment has not been set up so that an appropriate  library
1233       is automatically included, you may need to add something like
1234
1235         LIBS="-ncurses"
1236
1237       immediately before the configure command.
1238
1239
1240DEBUGGING WITH VALGRIND SUPPORT
1241
1242       By adding the
1243
1244         --enable-valgrind
1245
1246       option  to to the configure command, PCRE will use valgrind annotations
1247       to mark certain memory regions as  unaddressable.  This  allows  it  to
1248       detect invalid memory accesses, and is mostly useful for debugging PCRE
1249       itself.
1250
1251
1252CODE COVERAGE REPORTING
1253
1254       If your C compiler is gcc, you can build a version  of  PCRE  that  can
1255       generate a code coverage report for its test suite. To enable this, you
1256       must install lcov version 1.6 or above. Then specify
1257
1258         --enable-coverage
1259
1260       to the configure command and build PCRE in the usual way.
1261
1262       Note that using ccache (a caching C compiler) is incompatible with code
1263       coverage  reporting. If you have configured ccache to run automatically
1264       on your system, you must set the environment variable
1265
1266         CCACHE_DISABLE=1
1267
1268       before running make to build PCRE, so that ccache is not used.
1269
1270       When --enable-coverage is used,  the  following  addition  targets  are
1271       added to the Makefile:
1272
1273         make coverage
1274
1275       This  creates  a  fresh  coverage report for the PCRE test suite. It is
1276       equivalent to running "make coverage-reset", "make  coverage-baseline",
1277       "make check", and then "make coverage-report".
1278
1279         make coverage-reset
1280
1281       This zeroes the coverage counters, but does nothing else.
1282
1283         make coverage-baseline
1284
1285       This captures baseline coverage information.
1286
1287         make coverage-report
1288
1289       This creates the coverage report.
1290
1291         make coverage-clean-report
1292
1293       This  removes the generated coverage report without cleaning the cover-
1294       age data itself.
1295
1296         make coverage-clean-data
1297
1298       This removes the captured coverage data without removing  the  coverage
1299       files created at compile time (*.gcno).
1300
1301         make coverage-clean
1302
1303       This  cleans all coverage data including the generated coverage report.
1304       For more information about code coverage, see the gcov and  lcov  docu-
1305       mentation.
1306
1307
1308SEE ALSO
1309
1310       pcreapi(3), pcre16, pcre32, pcre_config(3).
1311
1312
1313AUTHOR
1314
1315       Philip Hazel
1316       University Computing Service
1317       Cambridge CB2 3QH, England.
1318
1319
1320REVISION
1321
1322       Last updated: 30 October 2012
1323       Copyright (c) 1997-2012 University of Cambridge.
1324------------------------------------------------------------------------------
1325
1326
1327PCREMATCHING(3)                                                PCREMATCHING(3)
1328
1329
1330NAME
1331       PCRE - Perl-compatible regular expressions
1332
1333
1334PCRE MATCHING ALGORITHMS
1335
1336       This document describes the two different algorithms that are available
1337       in PCRE for matching a compiled regular expression against a given sub-
1338       ject  string.  The  "standard"  algorithm  is  the  one provided by the
1339       pcre_exec(), pcre16_exec() and pcre32_exec() functions. These  work  in
1340       the  same as as Perl's matching function, and provide a Perl-compatible
1341       matching  operation.   The  just-in-time  (JIT)  optimization  that  is
1342       described  in  the pcrejit documentation is compatible with these func-
1343       tions.
1344
1345       An  alternative  algorithm  is   provided   by   the   pcre_dfa_exec(),
1346       pcre16_dfa_exec()  and  pcre32_dfa_exec()  functions; they operate in a
1347       different way, and are not Perl-compatible. This alternative has advan-
1348       tages and disadvantages compared with the standard algorithm, and these
1349       are described below.
1350
1351       When there is only one possible way in which a given subject string can
1352       match  a pattern, the two algorithms give the same answer. A difference
1353       arises, however, when there are multiple possibilities. For example, if
1354       the pattern
1355
1356         ^<.*>
1357
1358       is matched against the string
1359
1360         <something> <something else> <something further>
1361
1362       there are three possible answers. The standard algorithm finds only one
1363       of them, whereas the alternative algorithm finds all three.
1364
1365
1366REGULAR EXPRESSIONS AS TREES
1367
1368       The set of strings that are matched by a regular expression can be rep-
1369       resented  as  a  tree structure. An unlimited repetition in the pattern
1370       makes the tree of infinite size, but it is still a tree.  Matching  the
1371       pattern  to a given subject string (from a given starting point) can be
1372       thought of as a search of the tree.  There are two  ways  to  search  a
1373       tree:  depth-first  and  breadth-first, and these correspond to the two
1374       matching algorithms provided by PCRE.
1375
1376
1377THE STANDARD MATCHING ALGORITHM
1378
1379       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
1380       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
1381       depth-first search of the pattern tree. That is, it  proceeds  along  a
1382       single path through the tree, checking that the subject matches what is
1383       required. When there is a mismatch, the algorithm  tries  any  alterna-
1384       tives  at  the  current point, and if they all fail, it backs up to the
1385       previous branch point in the  tree,  and  tries  the  next  alternative
1386       branch  at  that  level.  This often involves backing up (moving to the
1387       left) in the subject string as well.  The  order  in  which  repetition
1388       branches  are  tried  is controlled by the greedy or ungreedy nature of
1389       the quantifier.
1390
1391       If a leaf node is reached, a matching string has  been  found,  and  at
1392       that  point the algorithm stops. Thus, if there is more than one possi-
1393       ble match, this algorithm returns the first one that it finds.  Whether
1394       this  is the shortest, the longest, or some intermediate length depends
1395       on the way the greedy and ungreedy repetition quantifiers are specified
1396       in the pattern.
1397
1398       Because  it  ends  up  with a single path through the tree, it is rela-
1399       tively straightforward for this algorithm to keep  track  of  the  sub-
1400       strings  that  are  matched  by portions of the pattern in parentheses.
1401       This provides support for capturing parentheses and back references.
1402
1403
1404THE ALTERNATIVE MATCHING ALGORITHM
1405
1406       This algorithm conducts a breadth-first search of  the  tree.  Starting
1407       from  the  first  matching  point  in the subject, it scans the subject
1408       string from left to right, once, character by character, and as it does
1409       this,  it remembers all the paths through the tree that represent valid
1410       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
1411       though  it is not implemented as a traditional finite state machine (it
1412       keeps multiple states active simultaneously).
1413
1414       Although the general principle of this matching algorithm  is  that  it
1415       scans  the subject string only once, without backtracking, there is one
1416       exception: when a lookaround assertion is encountered,  the  characters
1417       following  or  preceding  the  current  point  have to be independently
1418       inspected.
1419
1420       The scan continues until either the end of the subject is  reached,  or
1421       there  are  no more unterminated paths. At this point, terminated paths
1422       represent the different matching possibilities (if there are none,  the
1423       match  has  failed).   Thus,  if there is more than one possible match,
1424       this algorithm finds all of them, and in particular, it finds the long-
1425       est.  The  matches are returned in decreasing order of length. There is
1426       an option to stop the algorithm after the first match (which is  neces-
1427       sarily the shortest) is found.
1428
1429       Note that all the matches that are found start at the same point in the
1430       subject. If the pattern
1431
1432         cat(er(pillar)?)?
1433
1434       is matched against the string "the caterpillar catchment",  the  result
1435       will  be the three strings "caterpillar", "cater", and "cat" that start
1436       at the fifth character of the subject. The algorithm does not automati-
1437       cally move on to find matches that start at later positions.
1438
1439       There are a number of features of PCRE regular expressions that are not
1440       supported by the alternative matching algorithm. They are as follows:
1441
1442       1. Because the algorithm finds all  possible  matches,  the  greedy  or
1443       ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
1444       ungreedy quantifiers are treated in exactly the same way. However, pos-
1445       sessive  quantifiers can make a difference when what follows could also
1446       match what is quantified, for example in a pattern like this:
1447
1448         ^a++\w!
1449
1450       This pattern matches "aaab!" but not "aaa!", which would be matched  by
1451       a  non-possessive quantifier. Similarly, if an atomic group is present,
1452       it is matched as if it were a standalone pattern at the current  point,
1453       and  the  longest match is then "locked in" for the rest of the overall
1454       pattern.
1455
1456       2. When dealing with multiple paths through the tree simultaneously, it
1457       is  not  straightforward  to  keep track of captured substrings for the
1458       different matching possibilities, and  PCRE's  implementation  of  this
1459       algorithm does not attempt to do this. This means that no captured sub-
1460       strings are available.
1461
1462       3. Because no substrings are captured, back references within the  pat-
1463       tern are not supported, and cause errors if encountered.
1464
1465       4.  For  the same reason, conditional expressions that use a backrefer-
1466       ence as the condition or test for a specific group  recursion  are  not
1467       supported.
1468
1469       5.  Because  many  paths  through the tree may be active, the \K escape
1470       sequence, which resets the start of the match when encountered (but may
1471       be  on  some  paths  and not on others), is not supported. It causes an
1472       error if encountered.
1473
1474       6. Callouts are supported, but the value of the  capture_top  field  is
1475       always 1, and the value of the capture_last field is always -1.
1476
1477       7.  The  \C  escape  sequence, which (in the standard algorithm) always
1478       matches a single data unit, even in UTF-8, UTF-16 or UTF-32  modes,  is
1479       not  supported  in these modes, because the alternative algorithm moves
1480       through the subject string one character (not data unit) at a time, for
1481       all active paths through the tree.
1482
1483       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1484       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1485       negative assertion.
1486
1487
1488ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1489
1490       Using  the alternative matching algorithm provides the following advan-
1491       tages:
1492
1493       1. All possible matches (at a single point in the subject) are automat-
1494       ically  found,  and  in particular, the longest match is found. To find
1495       more than one match using the standard algorithm, you have to do kludgy
1496       things with callouts.
1497
1498       2.  Because  the  alternative  algorithm  scans the subject string just
1499       once, and never needs to backtrack (except for lookbehinds), it is pos-
1500       sible  to  pass  very  long subject strings to the matching function in
1501       several pieces, checking for partial matching each time. Although it is
1502       possible  to  do multi-segment matching using the standard algorithm by
1503       retaining partially matched substrings, it  is  more  complicated.  The
1504       pcrepartial  documentation  gives  details of partial matching and dis-
1505       cusses multi-segment matching.
1506
1507
1508DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1509
1510       The alternative algorithm suffers from a number of disadvantages:
1511
1512       1. It is substantially slower than  the  standard  algorithm.  This  is
1513       partly  because  it has to search for all possible matches, but is also
1514       because it is less susceptible to optimization.
1515
1516       2. Capturing parentheses and back references are not supported.
1517
1518       3. Although atomic groups are supported, their use does not provide the
1519       performance advantage that it does for the standard algorithm.
1520
1521
1522AUTHOR
1523
1524       Philip Hazel
1525       University Computing Service
1526       Cambridge CB2 3QH, England.
1527
1528
1529REVISION
1530
1531       Last updated: 08 January 2012
1532       Copyright (c) 1997-2012 University of Cambridge.
1533------------------------------------------------------------------------------
1534
1535
1536PCREAPI(3)                                                          PCREAPI(3)
1537
1538
1539NAME
1540       PCRE - Perl-compatible regular expressions
1541
1542       #include <pcre.h>
1543
1544
1545PCRE NATIVE API BASIC FUNCTIONS
1546
1547       pcre *pcre_compile(const char *pattern, int options,
1548            const char **errptr, int *erroffset,
1549            const unsigned char *tableptr);
1550
1551       pcre *pcre_compile2(const char *pattern, int options,
1552            int *errorcodeptr,
1553            const char **errptr, int *erroffset,
1554            const unsigned char *tableptr);
1555
1556       pcre_extra *pcre_study(const pcre *code, int options,
1557            const char **errptr);
1558
1559       void pcre_free_study(pcre_extra *extra);
1560
1561       int pcre_exec(const pcre *code, const pcre_extra *extra,
1562            const char *subject, int length, int startoffset,
1563            int options, int *ovector, int ovecsize);
1564
1565       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1566            const char *subject, int length, int startoffset,
1567            int options, int *ovector, int ovecsize,
1568            int *workspace, int wscount);
1569
1570
1571PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1572
1573       int pcre_copy_named_substring(const pcre *code,
1574            const char *subject, int *ovector,
1575            int stringcount, const char *stringname,
1576            char *buffer, int buffersize);
1577
1578       int pcre_copy_substring(const char *subject, int *ovector,
1579            int stringcount, int stringnumber, char *buffer,
1580            int buffersize);
1581
1582       int pcre_get_named_substring(const pcre *code,
1583            const char *subject, int *ovector,
1584            int stringcount, const char *stringname,
1585            const char **stringptr);
1586
1587       int pcre_get_stringnumber(const pcre *code,
1588            const char *name);
1589
1590       int pcre_get_stringtable_entries(const pcre *code,
1591            const char *name, char **first, char **last);
1592
1593       int pcre_get_substring(const char *subject, int *ovector,
1594            int stringcount, int stringnumber,
1595            const char **stringptr);
1596
1597       int pcre_get_substring_list(const char *subject,
1598            int *ovector, int stringcount, const char ***listptr);
1599
1600       void pcre_free_substring(const char *stringptr);
1601
1602       void pcre_free_substring_list(const char **stringptr);
1603
1604
1605PCRE NATIVE API AUXILIARY FUNCTIONS
1606
1607       int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
1608            const char *subject, int length, int startoffset,
1609            int options, int *ovector, int ovecsize,
1610            pcre_jit_stack *jstack);
1611
1612       pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1613
1614       void pcre_jit_stack_free(pcre_jit_stack *stack);
1615
1616       void pcre_assign_jit_stack(pcre_extra *extra,
1617            pcre_jit_callback callback, void *data);
1618
1619       const unsigned char *pcre_maketables(void);
1620
1621       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1622            int what, void *where);
1623
1624       int pcre_refcount(pcre *code, int adjust);
1625
1626       int pcre_config(int what, void *where);
1627
1628       const char *pcre_version(void);
1629
1630       int pcre_pattern_to_host_byte_order(pcre *code,
1631            pcre_extra *extra, const unsigned char *tables);
1632
1633
1634PCRE NATIVE API INDIRECTED FUNCTIONS
1635
1636       void *(*pcre_malloc)(size_t);
1637
1638       void (*pcre_free)(void *);
1639
1640       void *(*pcre_stack_malloc)(size_t);
1641
1642       void (*pcre_stack_free)(void *);
1643
1644       int (*pcre_callout)(pcre_callout_block *);
1645
1646
1647PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1648
1649       As  well  as  support  for  8-bit character strings, PCRE also supports
1650       16-bit strings (from release 8.30) and  32-bit  strings  (from  release
1651       8.32),  by means of two additional libraries. They can be built as well
1652       as, or instead of, the 8-bit library. To avoid too  much  complication,
1653       this  document describes the 8-bit versions of the functions, with only
1654       occasional references to the 16-bit and 32-bit libraries.
1655
1656       The 16-bit and 32-bit functions operate in the same way as their  8-bit
1657       counterparts;  they  just  use different data types for their arguments
1658       and results, and their names start with pcre16_ or pcre32_  instead  of
1659       pcre_.  For  every  option  that  has  UTF8  in  its name (for example,
1660       PCRE_UTF8), there are corresponding 16-bit and 32-bit names  with  UTF8
1661       replaced by UTF16 or UTF32, respectively. This facility is in fact just
1662       cosmetic; the 16-bit and 32-bit option names define the same  bit  val-
1663       ues.
1664
1665       References to bytes and UTF-8 in this document should be read as refer-
1666       ences to 16-bit data  quantities  and  UTF-16  when  using  the  16-bit
1667       library,  or  32-bit  data  quantities and UTF-32 when using the 32-bit
1668       library, unless specified otherwise. More details of the specific  dif-
1669       ferences  for  the  16-bit and 32-bit libraries are given in the pcre16
1670       and pcre32 pages.
1671
1672
1673PCRE API OVERVIEW
1674
1675       PCRE has its own native API, which is described in this document. There
1676       are  also some wrapper functions (for the 8-bit library only) that cor-
1677       respond to the POSIX regular expression  API,  but  they  do  not  give
1678       access  to  all  the functionality. They are described in the pcreposix
1679       documentation. Both of these APIs define a set of C function  calls.  A
1680       C++ wrapper (again for the 8-bit library only) is also distributed with
1681       PCRE. It is documented in the pcrecpp page.
1682
1683       The native API C function prototypes are defined  in  the  header  file
1684       pcre.h,  and  on Unix-like systems the (8-bit) library itself is called
1685       libpcre. It can normally be accessed by adding -lpcre  to  the  command
1686       for  linking an application that uses PCRE. The header file defines the
1687       macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1688       numbers  for the library. Applications can use these to include support
1689       for different releases of PCRE.
1690
1691       In a Windows environment, if you want to statically link an application
1692       program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
1693       before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
1694       loc()   and   pcre_free()   exported   functions   will   be   declared
1695       __declspec(dllimport), with unwanted results.
1696
1697       The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
1698       pcre_exec()  are used for compiling and matching regular expressions in
1699       a Perl-compatible manner. A sample program that demonstrates  the  sim-
1700       plest  way  of  using them is provided in the file called pcredemo.c in
1701       the PCRE source distribution. A listing of this program is given in the
1702       pcredemo  documentation, and the pcresample documentation describes how
1703       to compile and run it.
1704
1705       Just-in-time compiler support is an optional feature of PCRE  that  can
1706       be built in appropriate hardware environments. It greatly speeds up the
1707       matching performance of  many  patterns.  Simple  programs  can  easily
1708       request  that  it  be  used  if available, by setting an option that is
1709       ignored when it is not relevant. More complicated programs  might  need
1710       to     make    use    of    the    functions    pcre_jit_stack_alloc(),
1711       pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
1712       the JIT code's memory usage.
1713
1714       From  release  8.32 there is also a direct interface for JIT execution,
1715       which gives improved performance. The JIT-specific functions  are  dis-
1716       cussed in the pcrejit documentation.
1717
1718       A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1719       ble, is also provided. This uses a different algorithm for  the  match-
1720       ing.  The  alternative algorithm finds all possible matches (at a given
1721       point in the subject), and scans the subject just  once  (unless  there
1722       are  lookbehind  assertions).  However,  this algorithm does not return
1723       captured substrings. A description of the two matching  algorithms  and
1724       their  advantages  and disadvantages is given in the pcrematching docu-
1725       mentation.
1726
1727       In addition to the main compiling and  matching  functions,  there  are
1728       convenience functions for extracting captured substrings from a subject
1729       string that is matched by pcre_exec(). They are:
1730
1731         pcre_copy_substring()
1732         pcre_copy_named_substring()
1733         pcre_get_substring()
1734         pcre_get_named_substring()
1735         pcre_get_substring_list()
1736         pcre_get_stringnumber()
1737         pcre_get_stringtable_entries()
1738
1739       pcre_free_substring() and pcre_free_substring_list() are also provided,
1740       to free the memory used for extracted strings.
1741
1742       The  function  pcre_maketables()  is  used  to build a set of character
1743       tables  in  the  current  locale   for   passing   to   pcre_compile(),
1744       pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
1745       provided for specialist use.  Most  commonly,  no  special  tables  are
1746       passed,  in  which case internal tables that are generated when PCRE is
1747       built are used.
1748
1749       The function pcre_fullinfo() is used to find out  information  about  a
1750       compiled  pattern.  The  function pcre_version() returns a pointer to a
1751       string containing the version of PCRE and its date of release.
1752
1753       The function pcre_refcount() maintains a  reference  count  in  a  data
1754       block  containing  a compiled pattern. This is provided for the benefit
1755       of object-oriented applications.
1756
1757       The global variables pcre_malloc and pcre_free  initially  contain  the
1758       entry  points  of  the  standard malloc() and free() functions, respec-
1759       tively. PCRE calls the memory management functions via these variables,
1760       so  a  calling  program  can replace them if it wishes to intercept the
1761       calls. This should be done before calling any PCRE functions.
1762
1763       The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
1764       indirections  to  memory  management functions. These special functions
1765       are used only when PCRE is compiled to use  the  heap  for  remembering
1766       data, instead of recursive function calls, when running the pcre_exec()
1767       function. See the pcrebuild documentation for  details  of  how  to  do
1768       this.  It  is  a non-standard way of building PCRE, for use in environ-
1769       ments that have limited stacks. Because of the greater  use  of  memory
1770       management,  it  runs  more  slowly. Separate functions are provided so
1771       that special-purpose external code can be  used  for  this  case.  When
1772       used,  these  functions  are always called in a stack-like manner (last
1773       obtained, first freed), and always for memory blocks of the same  size.
1774       There  is  a discussion about PCRE's stack usage in the pcrestack docu-
1775       mentation.
1776
1777       The global variable pcre_callout initially contains NULL. It can be set
1778       by  the  caller  to  a "callout" function, which PCRE will then call at
1779       specified points during a matching operation. Details are given in  the
1780       pcrecallout documentation.
1781
1782
1783NEWLINES
1784
1785       PCRE  supports five different conventions for indicating line breaks in
1786       strings: a single CR (carriage return) character, a  single  LF  (line-
1787       feed) character, the two-character sequence CRLF, any of the three pre-
1788       ceding, or any Unicode newline sequence. The Unicode newline  sequences
1789       are  the  three just mentioned, plus the single characters VT (vertical
1790       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1791       separator, U+2028), and PS (paragraph separator, U+2029).
1792
1793       Each  of  the first three conventions is used by at least one operating
1794       system as its standard newline sequence. When PCRE is built, a  default
1795       can  be  specified.  The default default is LF, which is the Unix stan-
1796       dard. When PCRE is run, the default can be overridden,  either  when  a
1797       pattern is compiled, or when it is matched.
1798
1799       At compile time, the newline convention can be specified by the options
1800       argument of pcre_compile(), or it can be specified by special  text  at
1801       the start of the pattern itself; this overrides any other settings. See
1802       the pcrepattern page for details of the special character sequences.
1803
1804       In the PCRE documentation the word "newline" is used to mean "the char-
1805       acter  or pair of characters that indicate a line break". The choice of
1806       newline convention affects the handling of  the  dot,  circumflex,  and
1807       dollar metacharacters, the handling of #-comments in /x mode, and, when
1808       CRLF is a recognized line ending sequence, the match position  advance-
1809       ment for a non-anchored pattern. There is more detail about this in the
1810       section on pcre_exec() options below.
1811
1812       The choice of newline convention does not affect the interpretation  of
1813       the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
1814       which is controlled in a similar way, but by separate options.
1815
1816
1817MULTITHREADING
1818
1819       The PCRE functions can be used in  multi-threading  applications,  with
1820       the  proviso  that  the  memory  management  functions  pointed  to  by
1821       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1822       callout function pointed to by pcre_callout, are shared by all threads.
1823
1824       The  compiled form of a regular expression is not altered during match-
1825       ing, so the same compiled pattern can safely be used by several threads
1826       at once.
1827
1828       If  the just-in-time optimization feature is being used, it needs sepa-
1829       rate memory stack areas for each thread. See the pcrejit  documentation
1830       for more details.
1831
1832
1833SAVING PRECOMPILED PATTERNS FOR LATER USE
1834
1835       The compiled form of a regular expression can be saved and re-used at a
1836       later time, possibly by a different program, and even on a  host  other
1837       than  the  one  on  which  it  was  compiled.  Details are given in the
1838       pcreprecompile documentation,  which  includes  a  description  of  the
1839       pcre_pattern_to_host_byte_order()  function. However, compiling a regu-
1840       lar expression with one version of PCRE for use with a  different  ver-
1841       sion is not guaranteed to work and may cause crashes.
1842
1843
1844CHECKING BUILD-TIME OPTIONS
1845
1846       int pcre_config(int what, void *where);
1847
1848       The  function pcre_config() makes it possible for a PCRE client to dis-
1849       cover which optional features have been compiled into the PCRE library.
1850       The  pcrebuild documentation has more details about these optional fea-
1851       tures.
1852
1853       The first argument for pcre_config() is an  integer,  specifying  which
1854       information is required; the second argument is a pointer to a variable
1855       into which the information is placed. The returned  value  is  zero  on
1856       success,  or  the negative error code PCRE_ERROR_BADOPTION if the value
1857       in the first argument is not recognized. The following  information  is
1858       available:
1859
1860         PCRE_CONFIG_UTF8
1861
1862       The  output is an integer that is set to one if UTF-8 support is avail-
1863       able; otherwise it is set to zero. This value should normally be  given
1864       to the 8-bit version of this function, pcre_config(). If it is given to
1865       the  16-bit  or  32-bit  version  of  this  function,  the  result   is
1866       PCRE_ERROR_BADOPTION.
1867
1868         PCRE_CONFIG_UTF16
1869
1870       The output is an integer that is set to one if UTF-16 support is avail-
1871       able; otherwise it is set to zero. This value should normally be  given
1872       to the 16-bit version of this function, pcre16_config(). If it is given
1873       to the 8-bit  or  32-bit  version  of  this  function,  the  result  is
1874       PCRE_ERROR_BADOPTION.
1875
1876         PCRE_CONFIG_UTF32
1877
1878       The output is an integer that is set to one if UTF-32 support is avail-
1879       able; otherwise it is set to zero. This value should normally be  given
1880       to the 32-bit version of this function, pcre32_config(). If it is given
1881       to the 8-bit  or  16-bit  version  of  this  function,  the  result  is
1882       PCRE_ERROR_BADOPTION.
1883
1884         PCRE_CONFIG_UNICODE_PROPERTIES
1885
1886       The  output  is  an  integer  that is set to one if support for Unicode
1887       character properties is available; otherwise it is set to zero.
1888
1889         PCRE_CONFIG_JIT
1890
1891       The output is an integer that is set to one if support for just-in-time
1892       compiling is available; otherwise it is set to zero.
1893
1894         PCRE_CONFIG_JITTARGET
1895
1896       The  output is a pointer to a zero-terminated "const char *" string. If
1897       JIT support is available, the string contains the name of the architec-
1898       ture  for  which the JIT compiler is configured, for example "x86 32bit
1899       (little endian + unaligned)". If JIT  support  is  not  available,  the
1900       result is NULL.
1901
1902         PCRE_CONFIG_NEWLINE
1903
1904       The  output  is  an integer whose value specifies the default character
1905       sequence that is recognized as meaning "newline". The values  that  are
1906       supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
1907       for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC  environments,  CR,
1908       ANYCRLF,  and  ANY  yield the same values. However, the value for LF is
1909       normally 21, though some EBCDIC environments use 37. The  corresponding
1910       values  for  CRLF are 3349 and 3365. The default should normally corre-
1911       spond to the standard sequence for your operating system.
1912
1913         PCRE_CONFIG_BSR
1914
1915       The output is an integer whose value indicates what character sequences
1916       the  \R  escape sequence matches by default. A value of 0 means that \R
1917       matches any Unicode line ending sequence; a value of 1  means  that  \R
1918       matches only CR, LF, or CRLF. The default can be overridden when a pat-
1919       tern is compiled or matched.
1920
1921         PCRE_CONFIG_LINK_SIZE
1922
1923       The output is an integer that contains the number  of  bytes  used  for
1924       internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
1925       library, the value can be 2, 3, or 4. For the 16-bit library, the value
1926       is  either  2  or  4  and  is  still  a number of bytes. For the 32-bit
1927       library, the value is either 2 or 4 and is still a number of bytes. The
1928       default value of 2 is sufficient for all but the most massive patterns,
1929       since it allows the compiled pattern to be up to 64K  in  size.  Larger
1930       values  allow larger regular expressions to be compiled, at the expense
1931       of slower matching.
1932
1933         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1934
1935       The output is an integer that contains the threshold  above  which  the
1936       POSIX  interface  uses malloc() for output vectors. Further details are
1937       given in the pcreposix documentation.
1938
1939         PCRE_CONFIG_MATCH_LIMIT
1940
1941       The output is a long integer that gives the default limit for the  num-
1942       ber  of  internal  matching  function calls in a pcre_exec() execution.
1943       Further details are given with pcre_exec() below.
1944
1945         PCRE_CONFIG_MATCH_LIMIT_RECURSION
1946
1947       The output is a long integer that gives the default limit for the depth
1948       of   recursion  when  calling  the  internal  matching  function  in  a
1949       pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1950       below.
1951
1952         PCRE_CONFIG_STACKRECURSE
1953
1954       The  output is an integer that is set to one if internal recursion when
1955       running pcre_exec() is implemented by recursive function calls that use
1956       the  stack  to remember their state. This is the usual way that PCRE is
1957       compiled. The output is zero if PCRE was compiled to use blocks of data
1958       on  the  heap  instead  of  recursive  function  calls.  In  this case,
1959       pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1960       blocks on the heap, thus avoiding the use of the stack.
1961
1962
1963COMPILING A PATTERN
1964
1965       pcre *pcre_compile(const char *pattern, int options,
1966            const char **errptr, int *erroffset,
1967            const unsigned char *tableptr);
1968
1969       pcre *pcre_compile2(const char *pattern, int options,
1970            int *errorcodeptr,
1971            const char **errptr, int *erroffset,
1972            const unsigned char *tableptr);
1973
1974       Either of the functions pcre_compile() or pcre_compile2() can be called
1975       to compile a pattern into an internal form. The only difference between
1976       the  two interfaces is that pcre_compile2() has an additional argument,
1977       errorcodeptr, via which a numerical error  code  can  be  returned.  To
1978       avoid  too  much repetition, we refer just to pcre_compile() below, but
1979       the information applies equally to pcre_compile2().
1980
1981       The pattern is a C string terminated by a binary zero, and is passed in
1982       the  pattern  argument.  A  pointer to a single block of memory that is
1983       obtained via pcre_malloc is returned. This contains the  compiled  code
1984       and related data. The pcre type is defined for the returned block; this
1985       is a typedef for a structure whose contents are not externally defined.
1986       It is up to the caller to free the memory (via pcre_free) when it is no
1987       longer required.
1988
1989       Although the compiled code of a PCRE regex is relocatable, that is,  it
1990       does not depend on memory location, the complete pcre data block is not
1991       fully relocatable, because it may contain a copy of the tableptr  argu-
1992       ment, which is an address (see below).
1993
1994       The options argument contains various bit settings that affect the com-
1995       pilation. It should be zero if no options are required.  The  available
1996       options  are  described  below. Some of them (in particular, those that
1997       are compatible with Perl, but some others as well) can also be set  and
1998       unset  from  within  the  pattern  (see the detailed description in the
1999       pcrepattern documentation). For those options that can be different  in
2000       different  parts  of  the pattern, the contents of the options argument
2001       specifies their settings at the start of compilation and execution. The
2002       PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2003       PCRE_NO_START_OPTIMIZE options can be set at the time  of  matching  as
2004       well as at compile time.
2005
2006       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
2007       if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
2008       sets the variable pointed to by errptr to point to a textual error mes-
2009       sage. This is a static string that is part of the library. You must not
2010       try  to  free it. Normally, the offset from the start of the pattern to
2011       the byte that was being processed when  the  error  was  discovered  is
2012       placed  in the variable pointed to by erroffset, which must not be NULL
2013       (if it is, an immediate error is given). However, for an invalid  UTF-8
2014       string, the offset is that of the first byte of the failing character.
2015
2016       Some  errors are not detected until the whole pattern has been scanned;
2017       in these cases, the offset passed back is the length  of  the  pattern.
2018       Note  that  the offset is in bytes, not characters, even in UTF-8 mode.
2019       It may sometimes point into the middle of a UTF-8 character.
2020
2021       If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
2022       codeptr  argument is not NULL, a non-zero error code number is returned
2023       via this argument in the event of an error. This is in addition to  the
2024       textual error message. Error codes and messages are listed below.
2025
2026       If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
2027       character tables that are  built  when  PCRE  is  compiled,  using  the
2028       default  C  locale.  Otherwise, tableptr must be an address that is the
2029       result of a call to pcre_maketables(). This value is  stored  with  the
2030       compiled  pattern,  and used again by pcre_exec(), unless another table
2031       pointer is passed to it. For more discussion, see the section on locale
2032       support below.
2033
2034       This  code  fragment  shows a typical straightforward call to pcre_com-
2035       pile():
2036
2037         pcre *re;
2038         const char *error;
2039         int erroffset;
2040         re = pcre_compile(
2041           "^A.*Z",          /* the pattern */
2042           0,                /* default options */
2043           &error,           /* for error message */
2044           &erroffset,       /* for error offset */
2045           NULL);            /* use default character tables */
2046
2047       The following names for option bits are defined in  the  pcre.h  header
2048       file:
2049
2050         PCRE_ANCHORED
2051
2052       If this bit is set, the pattern is forced to be "anchored", that is, it
2053       is constrained to match only at the first matching point in the  string
2054       that  is being searched (the "subject string"). This effect can also be
2055       achieved by appropriate constructs in the pattern itself, which is  the
2056       only way to do it in Perl.
2057
2058         PCRE_AUTO_CALLOUT
2059
2060       If this bit is set, pcre_compile() automatically inserts callout items,
2061       all with number 255, before each pattern item. For  discussion  of  the
2062       callout facility, see the pcrecallout documentation.
2063
2064         PCRE_BSR_ANYCRLF
2065         PCRE_BSR_UNICODE
2066
2067       These options (which are mutually exclusive) control what the \R escape
2068       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2069       or to match any Unicode newline sequence. The default is specified when
2070       PCRE is built. It can be overridden from within the pattern, or by set-
2071       ting an option when a compiled pattern is matched.
2072
2073         PCRE_CASELESS
2074
2075       If  this  bit is set, letters in the pattern match both upper and lower
2076       case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
2077       changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2078       always understands the concept of case for characters whose values  are
2079       less  than 128, so caseless matching is always possible. For characters
2080       with higher values, the concept of case is supported if  PCRE  is  com-
2081       piled  with Unicode property support, but not otherwise. If you want to
2082       use caseless matching for characters 128 and  above,  you  must  ensure
2083       that  PCRE  is  compiled  with Unicode property support as well as with
2084       UTF-8 support.
2085
2086         PCRE_DOLLAR_ENDONLY
2087
2088       If this bit is set, a dollar metacharacter in the pattern matches  only
2089       at  the  end  of the subject string. Without this option, a dollar also
2090       matches immediately before a newline at the end of the string (but  not
2091       before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
2092       if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
2093       Perl, and no way to set it within a pattern.
2094
2095         PCRE_DOTALL
2096
2097       If  this bit is set, a dot metacharacter in the pattern matches a char-
2098       acter of any value, including one that indicates a newline. However, it
2099       only  ever  matches  one character, even if newlines are coded as CRLF.
2100       Without this option, a dot does not match when the current position  is
2101       at a newline. This option is equivalent to Perl's /s option, and it can
2102       be changed within a pattern by a (?s) option setting. A negative  class
2103       such as [^a] always matches newline characters, independent of the set-
2104       ting of this option.
2105
2106         PCRE_DUPNAMES
2107
2108       If this bit is set, names used to identify capturing  subpatterns  need
2109       not be unique. This can be helpful for certain types of pattern when it
2110       is known that only one instance of the named  subpattern  can  ever  be
2111       matched.  There  are  more details of named subpatterns below; see also
2112       the pcrepattern documentation.
2113
2114         PCRE_EXTENDED
2115
2116       If this bit is set, white space data  characters  in  the  pattern  are
2117       totally  ignored except when escaped or inside a character class. White
2118       space does not include the VT character (code 11). In addition, charac-
2119       ters between an unescaped # outside a character class and the next new-
2120       line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
2121       option,  and  it  can be changed within a pattern by a (?x) option set-
2122       ting.
2123
2124       Which characters are interpreted  as  newlines  is  controlled  by  the
2125       options  passed to pcre_compile() or by a special sequence at the start
2126       of the pattern, as described in the section entitled  "Newline  conven-
2127       tions" in the pcrepattern documentation. Note that the end of this type
2128       of comment is  a  literal  newline  sequence  in  the  pattern;  escape
2129       sequences that happen to represent a newline do not count.
2130
2131       This  option  makes  it possible to include comments inside complicated
2132       patterns.  Note, however, that this applies only  to  data  characters.
2133       White  space  characters  may  never  appear  within  special character
2134       sequences in a pattern, for example within the sequence (?( that intro-
2135       duces a conditional subpattern.
2136
2137         PCRE_EXTRA
2138
2139       This  option  was invented in order to turn on additional functionality
2140       of PCRE that is incompatible with Perl, but it  is  currently  of  very
2141       little  use. When set, any backslash in a pattern that is followed by a
2142       letter that has no special meaning  causes  an  error,  thus  reserving
2143       these  combinations  for  future  expansion.  By default, as in Perl, a
2144       backslash followed by a letter with no special meaning is treated as  a
2145       literal. (Perl can, however, be persuaded to give an error for this, by
2146       running it with the -w option.) There are at present no other  features
2147       controlled  by this option. It can also be set by a (?X) option setting
2148       within a pattern.
2149
2150         PCRE_FIRSTLINE
2151
2152       If this option is set, an  unanchored  pattern  is  required  to  match
2153       before  or  at  the  first  newline  in  the subject string, though the
2154       matched text may continue over the newline.
2155
2156         PCRE_JAVASCRIPT_COMPAT
2157
2158       If this option is set, PCRE's behaviour is changed in some ways so that
2159       it  is  compatible with JavaScript rather than Perl. The changes are as
2160       follows:
2161
2162       (1) A lone closing square bracket in a pattern  causes  a  compile-time
2163       error,  because this is illegal in JavaScript (by default it is treated
2164       as a data character). Thus, the pattern AB]CD becomes illegal when this
2165       option is set.
2166
2167       (2)  At run time, a back reference to an unset subpattern group matches
2168       an empty string (by default this causes the current  matching  alterna-
2169       tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
2170       set (assuming it can find an "a" in the subject), whereas it  fails  by
2171       default, for Perl compatibility.
2172
2173       (3) \U matches an upper case "U" character; by default \U causes a com-
2174       pile time error (Perl uses \U to upper case subsequent characters).
2175
2176       (4) \u matches a lower case "u" character unless it is followed by four
2177       hexadecimal  digits,  in  which case the hexadecimal number defines the
2178       code point to match. By default, \u causes a compile time  error  (Perl
2179       uses it to upper case the following character).
2180
2181       (5)  \x matches a lower case "x" character unless it is followed by two
2182       hexadecimal digits, in which case the hexadecimal  number  defines  the
2183       code  point  to  match. By default, as in Perl, a hexadecimal number is
2184       always expected after \x, but it may have zero, one, or two digits (so,
2185       for example, \xz matches a binary zero character followed by z).
2186
2187         PCRE_MULTILINE
2188
2189       By  default,  PCRE  treats the subject string as consisting of a single
2190       line of characters (even if it actually contains newlines). The  "start
2191       of  line"  metacharacter  (^)  matches only at the start of the string,
2192       while the "end of line" metacharacter ($) matches only at  the  end  of
2193       the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
2194       is set). This is the same as Perl.
2195
2196       When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
2197       constructs  match  immediately following or immediately before internal
2198       newlines in the subject string, respectively, as well as  at  the  very
2199       start  and  end.  This is equivalent to Perl's /m option, and it can be
2200       changed within a pattern by a (?m) option setting. If there are no new-
2201       lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,
2202       setting PCRE_MULTILINE has no effect.
2203
2204         PCRE_NEWLINE_CR
2205         PCRE_NEWLINE_LF
2206         PCRE_NEWLINE_CRLF
2207         PCRE_NEWLINE_ANYCRLF
2208         PCRE_NEWLINE_ANY
2209
2210       These options override the default newline definition that  was  chosen
2211       when  PCRE  was built. Setting the first or the second specifies that a
2212       newline is indicated by a single character (CR  or  LF,  respectively).
2213       Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2214       two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
2215       that any of the three preceding sequences should be recognized. Setting
2216       PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
2217       recognized.
2218
2219       In  an ASCII/Unicode environment, the Unicode newline sequences are the
2220       three just mentioned, plus the  single  characters  VT  (vertical  tab,
2221       U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2222       arator, U+2028), and PS (paragraph separator, U+2029).  For  the  8-bit
2223       library, the last two are recognized only in UTF-8 mode.
2224
2225       When  PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2226       code for CR is 0x0d, the same as ASCII. However, the character code for
2227       LF  is  normally 0x15, though in some EBCDIC environments 0x25 is used.
2228       Whichever of these is not LF is made to  correspond  to  Unicode's  NEL
2229       character.  EBCDIC  codes  are all less than 256. For more details, see
2230       the pcrebuild documentation.
2231
2232       The newline setting in the  options  word  uses  three  bits  that  are
2233       treated as a number, giving eight possibilities. Currently only six are
2234       used (default plus the five values above). This means that if  you  set
2235       more  than one newline option, the combination may or may not be sensi-
2236       ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
2237       PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
2238       cause an error.
2239
2240       The only time that a line break in a pattern  is  specially  recognized
2241       when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
2242       characters, and so are ignored in this mode. Also, an unescaped #  out-
2243       side  a  character class indicates a comment that lasts until after the
2244       next line break sequence. In other circumstances, line break  sequences
2245       in patterns are treated as literal data.
2246
2247       The newline option that is set at compile time becomes the default that
2248       is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
2249
2250         PCRE_NO_AUTO_CAPTURE
2251
2252       If this option is set, it disables the use of numbered capturing paren-
2253       theses  in the pattern. Any opening parenthesis that is not followed by
2254       ? behaves as if it were followed by ?: but named parentheses can  still
2255       be  used  for  capturing  (and  they acquire numbers in the usual way).
2256       There is no equivalent of this option in Perl.
2257
2258         NO_START_OPTIMIZE
2259
2260       This is an option that acts at matching time; that is, it is really  an
2261       option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
2262       time, it is remembered with the compiled pattern and assumed at  match-
2263       ing  time.  For  details  see  the discussion of PCRE_NO_START_OPTIMIZE
2264       below.
2265
2266         PCRE_UCP
2267
2268       This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
2269       \w,  and  some  of  the POSIX character classes. By default, only ASCII
2270       characters are recognized, but if PCRE_UCP is set,  Unicode  properties
2271       are  used instead to classify characters. More details are given in the
2272       section on generic character types in the pcrepattern page. If you  set
2273       PCRE_UCP,  matching  one of the items it affects takes much longer. The
2274       option is available only if PCRE has been compiled with  Unicode  prop-
2275       erty support.
2276
2277         PCRE_UNGREEDY
2278
2279       This  option  inverts  the "greediness" of the quantifiers so that they
2280       are not greedy by default, but become greedy if followed by "?". It  is
2281       not  compatible  with Perl. It can also be set by a (?U) option setting
2282       within the pattern.
2283
2284         PCRE_UTF8
2285
2286       This option causes PCRE to regard both the pattern and the  subject  as
2287       strings of UTF-8 characters instead of single-byte strings. However, it
2288       is available only when PCRE is built to include UTF  support.  If  not,
2289       the  use  of  this option provokes an error. Details of how this option
2290       changes the behaviour of PCRE are given in the pcreunicode page.
2291
2292         PCRE_NO_UTF8_CHECK
2293
2294       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2295       automatically  checked.  There  is  a  discussion about the validity of
2296       UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
2297       found,  pcre_compile()  returns an error. If you already know that your
2298       pattern is valid, and you want to skip this check for performance  rea-
2299       sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
2300       effect of passing an invalid UTF-8 string as a pattern is undefined. It
2301       may  cause  your  program  to  crash. Note that this option can also be
2302       passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
2303       checking  of  subject strings only. If the same string is being matched
2304       many times, the option can be safely set for the second and  subsequent
2305       matchings to improve performance.
2306
2307
2308COMPILATION ERROR CODES
2309
2310       The  following  table  lists  the  error  codes than may be returned by
2311       pcre_compile2(), along with the error messages that may be returned  by
2312       both  compiling  functions.  Note  that error messages are always 8-bit
2313       ASCII strings, even in 16-bit or 32-bit mode. As  PCRE  has  developed,
2314       some  error codes have fallen out of use. To avoid confusion, they have
2315       not been re-used.
2316
2317          0  no error
2318          1  \ at end of pattern
2319          2  \c at end of pattern
2320          3  unrecognized character follows \
2321          4  numbers out of order in {} quantifier
2322          5  number too big in {} quantifier
2323          6  missing terminating ] for character class
2324          7  invalid escape sequence in character class
2325          8  range out of order in character class
2326          9  nothing to repeat
2327         10  [this code is not in use]
2328         11  internal error: unexpected repeat
2329         12  unrecognized character after (? or (?-
2330         13  POSIX named classes are supported only within a class
2331         14  missing )
2332         15  reference to non-existent subpattern
2333         16  erroffset passed as NULL
2334         17  unknown option bit(s) set
2335         18  missing ) after comment
2336         19  [this code is not in use]
2337         20  regular expression is too large
2338         21  failed to get memory
2339         22  unmatched parentheses
2340         23  internal error: code overflow
2341         24  unrecognized character after (?<
2342         25  lookbehind assertion is not fixed length
2343         26  malformed number or name after (?(
2344         27  conditional group contains more than two branches
2345         28  assertion expected after (?(
2346         29  (?R or (?[+-]digits must be followed by )
2347         30  unknown POSIX class name
2348         31  POSIX collating elements are not supported
2349         32  this version of PCRE is compiled without UTF support
2350         33  [this code is not in use]
2351         34  character value in \x{...} sequence is too large
2352         35  invalid condition (?(0)
2353         36  \C not allowed in lookbehind assertion
2354         37  PCRE does not support \L, \l, \N{name}, \U, or \u
2355         38  number after (?C is > 255
2356         39  closing ) for (?C expected
2357         40  recursive call could loop indefinitely
2358         41  unrecognized character after (?P
2359         42  syntax error in subpattern name (missing terminator)
2360         43  two named subpatterns have the same name
2361         44  invalid UTF-8 string (specifically UTF-8)
2362         45  support for \P, \p, and \X has not been compiled
2363         46  malformed \P or \p sequence
2364         47  unknown property name after \P or \p
2365         48  subpattern name is too long (maximum 32 characters)
2366         49  too many named subpatterns (maximum 10000)
2367         50  [this code is not in use]
2368         51  octal value is greater than \377 in 8-bit non-UTF-8 mode
2369         52  internal error: overran compiling workspace
2370         53  internal error: previously-checked referenced subpattern
2371               not found
2372         54  DEFINE group contains more than one branch
2373         55  repeating a DEFINE group is not allowed
2374         56  inconsistent NEWLINE options
2375         57  \g is not followed by a braced, angle-bracketed, or quoted
2376               name/number or by a plain number
2377         58  a numbered reference must not be zero
2378         59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
2379         60  (*VERB) not recognized
2380         61  number is too big
2381         62  subpattern name expected
2382         63  digit expected after (?+
2383         64  ] is an invalid data character in JavaScript compatibility mode
2384         65  different names for subpatterns of the same number are
2385               not allowed
2386         66  (*MARK) must have an argument
2387         67  this version of PCRE is not compiled with Unicode property
2388               support
2389         68  \c must be followed by an ASCII character
2390         69  \k is not followed by a braced, angle-bracketed, or quoted name
2391         70  internal error: unknown opcode in find_fixedlength()
2392         71  \N is not supported in a class
2393         72  too many forward references
2394         73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
2395         74  invalid UTF-16 string (specifically UTF-16)
2396         75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2397         76  character value in \u.... sequence is too large
2398         77  invalid UTF-32 string (specifically UTF-32)
2399
2400       The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
2401       values may be used if the limits were changed when PCRE was built.
2402
2403
2404STUDYING A PATTERN
2405
2406       pcre_extra *pcre_study(const pcre *code, int options
2407            const char **errptr);
2408
2409       If  a  compiled  pattern is going to be used several times, it is worth
2410       spending more time analyzing it in order to speed up the time taken for
2411       matching.  The function pcre_study() takes a pointer to a compiled pat-
2412       tern as its first argument. If studying the pattern produces additional
2413       information  that  will  help speed up matching, pcre_study() returns a
2414       pointer to a pcre_extra block, in which the study_data field points  to
2415       the results of the study.
2416
2417       The  returned  value  from  pcre_study()  can  be  passed  directly  to
2418       pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
2419       tains  other  fields  that can be set by the caller before the block is
2420       passed; these are described below in the section on matching a pattern.
2421
2422       If studying the  pattern  does  not  produce  any  useful  information,
2423       pcre_study()  returns  NULL  by  default.  In that circumstance, if the
2424       calling program wants to pass any of the other fields to pcre_exec() or
2425       pcre_dfa_exec(),  it  must set up its own pcre_extra block. However, if
2426       pcre_study() is called  with  the  PCRE_STUDY_EXTRA_NEEDED  option,  it
2427       returns a pcre_extra block even if studying did not find any additional
2428       information. It may still return NULL, however, if an error  occurs  in
2429       pcre_study().
2430
2431       The  second  argument  of  pcre_study() contains option bits. There are
2432       three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2433
2434         PCRE_STUDY_JIT_COMPILE
2435         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2436         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2437
2438       If any of these are set, and the just-in-time  compiler  is  available,
2439       the  pattern  is  further compiled into machine code that executes much
2440       faster than the pcre_exec()  interpretive  matching  function.  If  the
2441       just-in-time  compiler is not available, these options are ignored. All
2442       undefined bits in the options argument must be zero.
2443
2444       JIT compilation is a heavyweight optimization. It can  take  some  time
2445       for  patterns  to  be analyzed, and for one-off matches and simple pat-
2446       terns the benefit of faster execution might be offset by a much  slower
2447       study time.  Not all patterns can be optimized by the JIT compiler. For
2448       those that cannot be handled, matching automatically falls back to  the
2449       pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
2450       tion.
2451
2452       The third argument for pcre_study() is a pointer for an error  message.
2453       If  studying  succeeds  (even  if no data is returned), the variable it
2454       points to is set to NULL. Otherwise it is set to  point  to  a  textual
2455       error message. This is a static string that is part of the library. You
2456       must not try to free it. You should test the  error  pointer  for  NULL
2457       after calling pcre_study(), to be sure that it has run successfully.
2458
2459       When  you are finished with a pattern, you can free the memory used for
2460       the study data by calling pcre_free_study(). This function was added to
2461       the  API  for  release  8.20. For earlier versions, the memory could be
2462       freed with pcre_free(), just like the pattern itself. This  will  still
2463       work  in  cases where JIT optimization is not used, but it is advisable
2464       to change to the new function when convenient.
2465
2466       This is a typical way in which pcre_study() is used (except that  in  a
2467       real application there should be tests for errors):
2468
2469         int rc;
2470         pcre *re;
2471         pcre_extra *sd;
2472         re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
2473         sd = pcre_study(
2474           re,             /* result of pcre_compile() */
2475           0,              /* no options */
2476           &error);        /* set to NULL or points to a message */
2477         rc = pcre_exec(   /* see below for details of pcre_exec() options */
2478           re, sd, "subject", 7, 0, 0, ovector, 30);
2479         ...
2480         pcre_free_study(sd);
2481         pcre_free(re);
2482
2483       Studying a pattern does two things: first, a lower bound for the length
2484       of subject string that is needed to match the pattern is computed. This
2485       does not mean that there are any strings of that length that match, but
2486       it does guarantee that no shorter strings match. The value is  used  to
2487       avoid wasting time by trying to match strings that are shorter than the
2488       lower bound. You can find out the value in a calling  program  via  the
2489       pcre_fullinfo() function.
2490
2491       Studying a pattern is also useful for non-anchored patterns that do not
2492       have a single fixed starting character. A bitmap of  possible  starting
2493       bytes  is  created. This speeds up finding a position in the subject at
2494       which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2495       values  less  than  256.  In 32-bit mode, the bitmap is used for 32-bit
2496       values less than 256.)
2497
2498       These two optimizations apply to both pcre_exec() and  pcre_dfa_exec(),
2499       and  the  information  is also used by the JIT compiler.  The optimiza-
2500       tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
2501       calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
2502       tion is also disabled. You might want to do this if your  pattern  con-
2503       tains  callouts or (*MARK) and you want to make use of these facilities
2504       in   cases   where   matching   fails.   See    the    discussion    of
2505       PCRE_NO_START_OPTIMIZE below.
2506
2507
2508LOCALE SUPPORT
2509
2510       PCRE  handles  caseless matching, and determines whether characters are
2511       letters, digits, or whatever, by reference to a set of tables,  indexed
2512       by  character  value.  When running in UTF-8 mode, this applies only to
2513       characters with codes less than 128. By  default,  higher-valued  codes
2514       never match escapes such as \w or \d, but they can be tested with \p if
2515       PCRE is built with Unicode character property  support.  Alternatively,
2516       the  PCRE_UCP  option  can  be  set at compile time; this causes \w and
2517       friends to use Unicode property support instead of built-in tables. The
2518       use of locales with Unicode is discouraged. If you are handling charac-
2519       ters with codes greater than 128, you should either use UTF-8 and  Uni-
2520       code, or use locales, but not try to mix the two.
2521
2522       PCRE  contains  an  internal set of tables that are used when the final
2523       argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
2524       applications.  Normally, the internal tables recognize only ASCII char-
2525       acters. However, when PCRE is built, it is possible to cause the inter-
2526       nal tables to be rebuilt in the default "C" locale of the local system,
2527       which may cause them to be different.
2528
2529       The internal tables can always be overridden by tables supplied by  the
2530       application that calls PCRE. These may be created in a different locale
2531       from the default. As more and more applications change  to  using  Uni-
2532       code, the need for this locale support is expected to die away.
2533
2534       External  tables  are  built by calling the pcre_maketables() function,
2535       which has no arguments, in the relevant locale. The result can then  be
2536       passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
2537       example, to build and use tables that are appropriate  for  the  French
2538       locale  (where  accented  characters  with  values greater than 128 are
2539       treated as letters), the following code could be used:
2540
2541         setlocale(LC_CTYPE, "fr_FR");
2542         tables = pcre_maketables();
2543         re = pcre_compile(..., tables);
2544
2545       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2546       if you are using Windows, the name for the French locale is "french".
2547
2548       When  pcre_maketables()  runs,  the  tables are built in memory that is
2549       obtained via pcre_malloc. It is the caller's responsibility  to  ensure
2550       that  the memory containing the tables remains available for as long as
2551       it is needed.
2552
2553       The pointer that is passed to pcre_compile() is saved with the compiled
2554       pattern,  and the same tables are used via this pointer by pcre_study()
2555       and normally also by pcre_exec(). Thus, by default, for any single pat-
2556       tern, compilation, studying and matching all happen in the same locale,
2557       but different patterns can be compiled in different locales.
2558
2559       It is possible to pass a table pointer or NULL (indicating the  use  of
2560       the  internal  tables)  to  pcre_exec(). Although not intended for this
2561       purpose, this facility could be used to match a pattern in a  different
2562       locale from the one in which it was compiled. Passing table pointers at
2563       run time is discussed below in the section on matching a pattern.
2564
2565
2566INFORMATION ABOUT A PATTERN
2567
2568       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2569            int what, void *where);
2570
2571       The pcre_fullinfo() function returns information about a compiled  pat-
2572       tern.  It replaces the pcre_info() function, which was removed from the
2573       library at version 8.30, after more than 10 years of obsolescence.
2574
2575       The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
2576       pattern.  The second argument is the result of pcre_study(), or NULL if
2577       the pattern was not studied. The third argument specifies  which  piece
2578       of  information  is required, and the fourth argument is a pointer to a
2579       variable to receive the data. The yield of the  function  is  zero  for
2580       success, or one of the following negative numbers:
2581
2582         PCRE_ERROR_NULL           the argument code was NULL
2583                                   the argument where was NULL
2584         PCRE_ERROR_BADMAGIC       the "magic number" was not found
2585         PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
2586                                   endianness
2587         PCRE_ERROR_BADOPTION      the value of what was invalid
2588
2589       The  "magic  number" is placed at the start of each compiled pattern as
2590       an simple check against passing an arbitrary memory pointer. The  endi-
2591       anness error can occur if a compiled pattern is saved and reloaded on a
2592       different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
2593       the length of the compiled pattern:
2594
2595         int rc;
2596         size_t length;
2597         rc = pcre_fullinfo(
2598           re,               /* result of pcre_compile() */
2599           sd,               /* result of pcre_study(), or NULL */
2600           PCRE_INFO_SIZE,   /* what is required */
2601           &length);         /* where to put the data */
2602
2603       The  possible  values for the third argument are defined in pcre.h, and
2604       are as follows:
2605
2606         PCRE_INFO_BACKREFMAX
2607
2608       Return the number of the highest back reference  in  the  pattern.  The
2609       fourth  argument  should  point to an int variable. Zero is returned if
2610       there are no back references.
2611
2612         PCRE_INFO_CAPTURECOUNT
2613
2614       Return the number of capturing subpatterns in the pattern.  The  fourth
2615       argument should point to an int variable.
2616
2617         PCRE_INFO_DEFAULT_TABLES
2618
2619       Return  a pointer to the internal default character tables within PCRE.
2620       The fourth argument should point to an unsigned char *  variable.  This
2621       information call is provided for internal use by the pcre_study() func-
2622       tion. External callers can cause PCRE to use  its  internal  tables  by
2623       passing a NULL table pointer.
2624
2625         PCRE_INFO_FIRSTBYTE
2626
2627       Return information about the first data unit of any matched string, for
2628       a non-anchored pattern. (The name of this option refers  to  the  8-bit
2629       library,  where data units are bytes.) The fourth argument should point
2630       to an int variable.
2631
2632       If there is a fixed first value, for example, the  letter  "c"  from  a
2633       pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
2634       library, the value is always less than 256. In the 16-bit  library  the
2635       value can be up to 0xffff. In the 32-bit library the value can be up to
2636       0x10ffff.
2637
2638       If there is no fixed first value, and if either
2639
2640       (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2641       branch starts with "^", or
2642
2643       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2644       set (if it were set, the pattern would be anchored),
2645
2646       -1 is returned, indicating that the pattern matches only at  the  start
2647       of  a  subject string or after any newline within the string. Otherwise
2648       -2 is returned. For anchored patterns, -2 is returned.
2649
2650       Since for the 32-bit library using the non-UTF-32 mode,  this  function
2651       is  unable to return the full 32-bit range of the character, this value
2652       is   deprecated;   instead   the   PCRE_INFO_FIRSTCHARACTERFLAGS    and
2653       PCRE_INFO_FIRSTCHARACTER values should be used.
2654
2655         PCRE_INFO_FIRSTTABLE
2656
2657       If  the pattern was studied, and this resulted in the construction of a
2658       256-bit table indicating a fixed set of values for the first data  unit
2659       in  any  matching string, a pointer to the table is returned. Otherwise
2660       NULL is returned. The fourth argument should point to an unsigned  char
2661       * variable.
2662
2663         PCRE_INFO_HASCRORLF
2664
2665       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2666       characters, otherwise 0. The fourth argument should  point  to  an  int
2667       variable.  An explicit match is either a literal CR or LF character, or
2668       \r or \n.
2669
2670         PCRE_INFO_JCHANGED
2671
2672       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2673       otherwise  0. The fourth argument should point to an int variable. (?J)
2674       and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2675
2676         PCRE_INFO_JIT
2677
2678       Return 1 if the pattern was studied with one of the  JIT  options,  and
2679       just-in-time compiling was successful. The fourth argument should point
2680       to an int variable. A return value of 0 means that JIT support  is  not
2681       available  in this version of PCRE, or that the pattern was not studied
2682       with a JIT option, or that the JIT compiler could not handle this  par-
2683       ticular  pattern. See the pcrejit documentation for details of what can
2684       and cannot be handled.
2685
2686         PCRE_INFO_JITSIZE
2687
2688       If the pattern was successfully studied with a JIT option,  return  the
2689       size  of the JIT compiled code, otherwise return zero. The fourth argu-
2690       ment should point to a size_t variable.
2691
2692         PCRE_INFO_LASTLITERAL
2693
2694       Return the value of the rightmost literal data unit that must exist  in
2695       any  matched  string, other than at its start, if such a value has been
2696       recorded. The fourth argument should point to an int variable. If there
2697       is no such value, -1 is returned. For anchored patterns, a last literal
2698       value is recorded only if it follows something of variable length.  For
2699       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2700       /^a\dz\d/ the returned value is -1.
2701
2702       Since for the 32-bit library using the non-UTF-32 mode,  this  function
2703       is  unable to return the full 32-bit range of the character, this value
2704       is   deprecated;   instead    the    PCRE_INFO_REQUIREDCHARFLAGS    and
2705       PCRE_INFO_REQUIREDCHAR values should be used.
2706
2707         PCRE_INFO_MAXLOOKBEHIND
2708
2709       Return  the  number of characters (NB not bytes) in the longest lookbe-
2710       hind assertion in the pattern. Note that the simple assertions  \b  and
2711       \B  require a one-character lookbehind. This information is useful when
2712       doing multi-segment matching using the partial matching facilities.
2713
2714         PCRE_INFO_MINLENGTH
2715
2716       If the pattern was studied and a minimum length  for  matching  subject
2717       strings  was  computed,  its  value is returned. Otherwise the returned
2718       value is -1. The value is a number of characters, which in  UTF-8  mode
2719       may  be  different from the number of bytes. The fourth argument should
2720       point to an int variable. A non-negative value is a lower bound to  the
2721       length  of  any  matching  string. There may not be any strings of that
2722       length that do actually match, but every string that does match  is  at
2723       least that long.
2724
2725         PCRE_INFO_NAMECOUNT
2726         PCRE_INFO_NAMEENTRYSIZE
2727         PCRE_INFO_NAMETABLE
2728
2729       PCRE  supports the use of named as well as numbered capturing parenthe-
2730       ses. The names are just an additional way of identifying the  parenthe-
2731       ses, which still acquire numbers. Several convenience functions such as
2732       pcre_get_named_substring() are provided for  extracting  captured  sub-
2733       strings  by  name. It is also possible to extract the data directly, by
2734       first converting the name to a number in order to  access  the  correct
2735       pointers in the output vector (described with pcre_exec() below). To do
2736       the conversion, you need  to  use  the  name-to-number  map,  which  is
2737       described by these three values.
2738
2739       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2740       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2741       of  each  entry;  both  of  these  return  an int value. The entry size
2742       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
2743       a pointer to the first entry of the table. This is a pointer to char in
2744       the 8-bit library, where the first two bytes of each entry are the num-
2745       ber  of  the capturing parenthesis, most significant byte first. In the
2746       16-bit library, the pointer points to 16-bit data units, the  first  of
2747       which  contains  the  parenthesis  number.   In the 32-bit library, the
2748       pointer points to 32-bit data units, the first of  which  contains  the
2749       parenthesis  number.  The  rest of the entry is the corresponding name,
2750       zero terminated.
2751
2752       The names are in alphabetical order. Duplicate names may appear if  (?|
2753       is used to create multiple groups with the same number, as described in
2754       the section on duplicate subpattern numbers in  the  pcrepattern  page.
2755       Duplicate  names  for  subpatterns with different numbers are permitted
2756       only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they
2757       appear  in  the table in the order in which they were found in the pat-
2758       tern. In the absence of (?| this is the  order  of  increasing  number;
2759       when (?| is used this is not necessarily the case because later subpat-
2760       terns may have lower numbers.
2761
2762       As a simple example of the name/number table,  consider  the  following
2763       pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2764       set, so white space - including newlines - is ignored):
2765
2766         (?<date> (?<year>(\d\d)?\d\d) -
2767         (?<month>\d\d) - (?<day>\d\d) )
2768
2769       There are four named subpatterns, so the table has  four  entries,  and
2770       each  entry  in the table is eight bytes long. The table is as follows,
2771       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2772       as ??:
2773
2774         00 01 d  a  t  e  00 ??
2775         00 05 d  a  y  00 ?? ??
2776         00 04 m  o  n  t  h  00
2777         00 02 y  e  a  r  00 ??
2778
2779       When  writing  code  to  extract  data from named subpatterns using the
2780       name-to-number map, remember that the length of the entries  is  likely
2781       to be different for each compiled pattern.
2782
2783         PCRE_INFO_OKPARTIAL
2784
2785       Return  1  if  the  pattern  can  be  used  for  partial  matching with
2786       pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
2787       variable.  From  release  8.00,  this  always  returns  1,  because the
2788       restrictions that previously applied  to  partial  matching  have  been
2789       lifted.  The  pcrepartial documentation gives details of partial match-
2790       ing.
2791
2792         PCRE_INFO_OPTIONS
2793
2794       Return a copy of the options with which the pattern was  compiled.  The
2795       fourth  argument  should  point to an unsigned long int variable. These
2796       option bits are those specified in the call to pcre_compile(), modified
2797       by any top-level option settings at the start of the pattern itself. In
2798       other words, they are the options that will be in force  when  matching
2799       starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
2800       the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
2801       and PCRE_EXTENDED.
2802
2803       A  pattern  is  automatically  anchored by PCRE if all of its top-level
2804       alternatives begin with one of the following:
2805
2806         ^     unless PCRE_MULTILINE is set
2807         \A    always
2808         \G    always
2809         .*    if PCRE_DOTALL is set and there are no back
2810                 references to the subpattern in which .* appears
2811
2812       For such patterns, the PCRE_ANCHORED bit is set in the options returned
2813       by pcre_fullinfo().
2814
2815         PCRE_INFO_SIZE
2816
2817       Return  the size of the compiled pattern in bytes (for both libraries).
2818       The fourth argument should point to a size_t variable. This value  does
2819       not  include  the  size  of  the  pcre  structure  that  is returned by
2820       pcre_compile(). The value that is passed as the argument  to  pcre_mal-
2821       loc()  when pcre_compile() is getting memory in which to place the com-
2822       piled data is the value returned by this option plus the  size  of  the
2823       pcre  structure. Studying a compiled pattern, with or without JIT, does
2824       not alter the value returned by this option.
2825
2826         PCRE_INFO_STUDYSIZE
2827
2828       Return the size in bytes of the data block pointed to by the study_data
2829       field  in  a  pcre_extra  block.  If pcre_extra is NULL, or there is no
2830       study data, zero is returned. The fourth argument  should  point  to  a
2831       size_t  variable. The study_data field is set by pcre_study() to record
2832       information that will speed  up  matching  (see  the  section  entitled
2833       "Studying a pattern" above). The format of the study_data block is pri-
2834       vate, but its length is made available via this option so that  it  can
2835       be  saved  and  restored  (see  the  pcreprecompile  documentation  for
2836       details).
2837
2838         PCRE_INFO_FIRSTCHARACTERFLAGS
2839
2840       Return information about the first data unit of any matched string, for
2841       a  non-anchored  pattern.  The  fourth  argument should point to an int
2842       variable.
2843
2844       If there is a fixed first value, for example, the  letter  "c"  from  a
2845       pattern  such  as  (cat|cow|coyote),  1  is returned, and the character
2846       value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
2847
2848       If there is no fixed first value, and if either
2849
2850       (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2851       branch starts with "^", or
2852
2853       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2854       set (if it were set, the pattern would be anchored),
2855
2856       2 is returned, indicating that the pattern matches only at the start of
2857       a subject string or after any newline within the string. Otherwise 0 is
2858       returned. For anchored patterns, 0 is returned.
2859
2860         PCRE_INFO_FIRSTCHARACTER
2861
2862       Return the fixed first character  value,  if  PCRE_INFO_FIRSTCHARACTER-
2863       FLAGS returned 1; otherwise returns 0. The fourth argument should point
2864       to an uint_t variable.
2865
2866       In the 8-bit library, the value is always less than 256. In the  16-bit
2867       library  the value can be up to 0xffff. In the 32-bit library in UTF-32
2868       mode the value can be up to 0x10ffff, and up  to  0xffffffff  when  not
2869       using UTF-32 mode.
2870
2871       If there is no fixed first value, and if either
2872
2873       (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
2874       branch starts with "^", or
2875
2876       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2877       set (if it were set, the pattern would be anchored),
2878
2879       -1  is  returned, indicating that the pattern matches only at the start
2880       of a subject string or after any newline within the  string.  Otherwise
2881       -2 is returned. For anchored patterns, -2 is returned.
2882
2883         PCRE_INFO_REQUIREDCHARFLAGS
2884
2885       Returns  1 if there is a rightmost literal data unit that must exist in
2886       any matched string, other than at its start. The fourth argument should
2887       point  to an int variable. If there is no such value, 0 is returned. If
2888       returning  1,  the  character  value  itself  can  be  retrieved  using
2889       PCRE_INFO_REQUIREDCHAR.
2890
2891       For anchored patterns, a last literal value is recorded only if it fol-
2892       lows something  of  variable  length.  For  example,  for  the  pattern
2893       /^a\d+z\d+/   the   returned   value   1   (with   "z"   returned  from
2894       PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
2895
2896         PCRE_INFO_REQUIREDCHAR
2897
2898       Return the value of the rightmost literal data unit that must exist  in
2899       any  matched  string, other than at its start, if such a value has been
2900       recorded. The fourth argument should point to an uint32_t variable.  If
2901       there is no such value, 0 is returned.
2902
2903
2904REFERENCE COUNTS
2905
2906       int pcre_refcount(pcre *code, int adjust);
2907
2908       The  pcre_refcount()  function is used to maintain a reference count in
2909       the data block that contains a compiled pattern. It is provided for the
2910       benefit  of  applications  that  operate  in an object-oriented manner,
2911       where different parts of the application may be using the same compiled
2912       pattern, but you want to free the block when they are all done.
2913
2914       When a pattern is compiled, the reference count field is initialized to
2915       zero.  It is changed only by calling this function, whose action is  to
2916       add  the  adjust  value  (which may be positive or negative) to it. The
2917       yield of the function is the new value. However, the value of the count
2918       is  constrained to lie between 0 and 65535, inclusive. If the new value
2919       is outside these limits, it is forced to the appropriate limit value.
2920
2921       Except when it is zero, the reference count is not correctly  preserved
2922       if  a  pattern  is  compiled on one host and then transferred to a host
2923       whose byte-order is different. (This seems a highly unlikely scenario.)
2924
2925
2926MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2927
2928       int pcre_exec(const pcre *code, const pcre_extra *extra,
2929            const char *subject, int length, int startoffset,
2930            int options, int *ovector, int ovecsize);
2931
2932       The function pcre_exec() is called to match a subject string against  a
2933       compiled  pattern, which is passed in the code argument. If the pattern
2934       was studied, the result of the study should  be  passed  in  the  extra
2935       argument.  You  can call pcre_exec() with the same code and extra argu-
2936       ments as many times as you like, in order to  match  different  subject
2937       strings with the same pattern.
2938
2939       This  function  is  the  main  matching facility of the library, and it
2940       operates in a Perl-like manner. For specialist use  there  is  also  an
2941       alternative  matching function, which is described below in the section
2942       about the pcre_dfa_exec() function.
2943
2944       In most applications, the pattern will have been compiled (and  option-
2945       ally  studied)  in the same process that calls pcre_exec(). However, it
2946       is possible to save compiled patterns and study data, and then use them
2947       later  in  different processes, possibly even on different hosts. For a
2948       discussion about this, see the pcreprecompile documentation.
2949
2950       Here is an example of a simple call to pcre_exec():
2951
2952         int rc;
2953         int ovector[30];
2954         rc = pcre_exec(
2955           re,             /* result of pcre_compile() */
2956           NULL,           /* we didn't study the pattern */
2957           "some string",  /* the subject string */
2958           11,             /* the length of the subject string */
2959           0,              /* start at offset 0 in the subject */
2960           0,              /* default options */
2961           ovector,        /* vector of integers for substring information */
2962           30);            /* number of elements (NOT size in bytes) */
2963
2964   Extra data for pcre_exec()
2965
2966       If the extra argument is not NULL, it must point to a  pcre_extra  data
2967       block.  The pcre_study() function returns such a block (when it doesn't
2968       return NULL), but you can also create one for yourself, and pass  addi-
2969       tional  information  in it. The pcre_extra block contains the following
2970       fields (not necessarily in this order):
2971
2972         unsigned long int flags;
2973         void *study_data;
2974         void *executable_jit;
2975         unsigned long int match_limit;
2976         unsigned long int match_limit_recursion;
2977         void *callout_data;
2978         const unsigned char *tables;
2979         unsigned char **mark;
2980
2981       In the 16-bit version of  this  structure,  the  mark  field  has  type
2982       "PCRE_UCHAR16 **".
2983
2984       In  the  32-bit  version  of  this  structure,  the mark field has type
2985       "PCRE_UCHAR32 **".
2986
2987       The flags field is used to specify which of the other fields  are  set.
2988       The flag bits are:
2989
2990         PCRE_EXTRA_CALLOUT_DATA
2991         PCRE_EXTRA_EXECUTABLE_JIT
2992         PCRE_EXTRA_MARK
2993         PCRE_EXTRA_MATCH_LIMIT
2994         PCRE_EXTRA_MATCH_LIMIT_RECURSION
2995         PCRE_EXTRA_STUDY_DATA
2996         PCRE_EXTRA_TABLES
2997
2998       Other  flag  bits should be set to zero. The study_data field and some-
2999       times the executable_jit field are set in the pcre_extra block that  is
3000       returned  by pcre_study(), together with the appropriate flag bits. You
3001       should not set these yourself, but you may add to the block by  setting
3002       other fields and their corresponding flag bits.
3003
3004       The match_limit field provides a means of preventing PCRE from using up
3005       a vast amount of resources when running patterns that are not going  to
3006       match,  but  which  have  a very large number of possibilities in their
3007       search trees. The classic example is a pattern that uses nested  unlim-
3008       ited repeats.
3009
3010       Internally,  pcre_exec() uses a function called match(), which it calls
3011       repeatedly (sometimes recursively). The limit  set  by  match_limit  is
3012       imposed  on the number of times this function is called during a match,
3013       which has the effect of limiting the amount of  backtracking  that  can
3014       take place. For patterns that are not anchored, the count restarts from
3015       zero for each position in the subject string.
3016
3017       When pcre_exec() is called with a pattern that was successfully studied
3018       with  a  JIT  option, the way that the matching is executed is entirely
3019       different.  However, there is still the possibility of runaway matching
3020       that goes on for a very long time, and so the match_limit value is also
3021       used in this case (but in a different way) to limit how long the match-
3022       ing can continue.
3023
3024       The  default  value  for  the  limit can be set when PCRE is built; the
3025       default default is 10 million, which handles all but the  most  extreme
3026       cases.  You  can  override  the  default by suppling pcre_exec() with a
3027       pcre_extra    block    in    which    match_limit    is    set,     and
3028       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
3029       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
3030
3031       The match_limit_recursion field is similar to match_limit, but  instead
3032       of limiting the total number of times that match() is called, it limits
3033       the depth of recursion. The recursion depth is a  smaller  number  than
3034       the  total number of calls, because not all calls to match() are recur-
3035       sive.  This limit is of use only if it is set smaller than match_limit.
3036
3037       Limiting the recursion depth limits the amount of  machine  stack  that
3038       can  be used, or, when PCRE has been compiled to use memory on the heap
3039       instead of the stack, the amount of heap memory that can be used.  This
3040       limit  is not relevant, and is ignored, when matching is done using JIT
3041       compiled code.
3042
3043       The default value for match_limit_recursion can be  set  when  PCRE  is
3044       built;  the  default  default  is  the  same  value  as the default for
3045       match_limit. You can override the default by suppling pcre_exec()  with
3046       a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
3047       PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
3048       limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
3049
3050       The  callout_data  field is used in conjunction with the "callout" fea-
3051       ture, and is described in the pcrecallout documentation.
3052
3053       The tables field  is  used  to  pass  a  character  tables  pointer  to
3054       pcre_exec();  this overrides the value that is stored with the compiled
3055       pattern. A non-NULL value is stored with the compiled pattern  only  if
3056       custom  tables  were  supplied to pcre_compile() via its tableptr argu-
3057       ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
3058       PCRE's  internal  tables  to be used. This facility is helpful when re-
3059       using patterns that have been saved after compiling  with  an  external
3060       set  of  tables,  because  the  external tables might be at a different
3061       address when pcre_exec() is called. See the  pcreprecompile  documenta-
3062       tion for a discussion of saving compiled patterns for later use.
3063
3064       If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
3065       set to point to a suitable variable. If the pattern contains any  back-
3066       tracking  control verbs such as (*MARK:NAME), and the execution ends up
3067       with a name to pass back, a pointer to the  name  string  (zero  termi-
3068       nated)  is  placed  in  the  variable pointed to by the mark field. The
3069       names are within the compiled pattern; if you wish  to  retain  such  a
3070       name  you must copy it before freeing the memory of a compiled pattern.
3071       If there is no name to pass back, the variable pointed to by  the  mark
3072       field  is  set  to NULL. For details of the backtracking control verbs,
3073       see the section entitled "Backtracking control" in the pcrepattern doc-
3074       umentation.
3075
3076   Option bits for pcre_exec()
3077
3078       The  unused  bits of the options argument for pcre_exec() must be zero.
3079       The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
3080       PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
3081       PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,   and
3082       PCRE_PARTIAL_SOFT.
3083
3084       If  the  pattern  was successfully studied with one of the just-in-time
3085       (JIT) compile options, the only supported options for JIT execution are
3086       PCRE_NO_UTF8_CHECK,     PCRE_NOTBOL,     PCRE_NOTEOL,    PCRE_NOTEMPTY,
3087       PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If  an
3088       unsupported  option  is  used, JIT execution is disabled and the normal
3089       interpretive code in pcre_exec() is run.
3090
3091         PCRE_ANCHORED
3092
3093       The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
3094       matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
3095       turned out to be anchored by virtue of its contents, it cannot be  made
3096       unachored at matching time.
3097
3098         PCRE_BSR_ANYCRLF
3099         PCRE_BSR_UNICODE
3100
3101       These options (which are mutually exclusive) control what the \R escape
3102       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
3103       or  to  match  any Unicode newline sequence. These options override the
3104       choice that was made or defaulted when the pattern was compiled.
3105
3106         PCRE_NEWLINE_CR
3107         PCRE_NEWLINE_LF
3108         PCRE_NEWLINE_CRLF
3109         PCRE_NEWLINE_ANYCRLF
3110         PCRE_NEWLINE_ANY
3111
3112       These options override  the  newline  definition  that  was  chosen  or
3113       defaulted  when the pattern was compiled. For details, see the descrip-
3114       tion of pcre_compile()  above.  During  matching,  the  newline  choice
3115       affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
3116       ters. It may also alter the way the match position is advanced after  a
3117       match failure for an unanchored pattern.
3118
3119       When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
3120       set, and a match attempt for an unanchored pattern fails when the  cur-
3121       rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
3122       explicit matches for  CR  or  LF  characters,  the  match  position  is
3123       advanced by two characters instead of one, in other words, to after the
3124       CRLF.
3125
3126       The above rule is a compromise that makes the most common cases work as
3127       expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
3128       option is not set), it does not match the string "\r\nA" because, after
3129       failing  at the start, it skips both the CR and the LF before retrying.
3130       However, the pattern [\r\n]A does match that string,  because  it  con-
3131       tains an explicit CR or LF reference, and so advances only by one char-
3132       acter after the first failure.
3133
3134       An explicit match for CR of LF is either a literal appearance of one of
3135       those  characters,  or  one  of the \r or \n escape sequences. Implicit
3136       matches such as [^X] do not count, nor does \s (which includes  CR  and
3137       LF in the characters that it matches).
3138
3139       Notwithstanding  the above, anomalous effects may still occur when CRLF
3140       is a valid newline sequence and explicit \r or \n escapes appear in the
3141       pattern.
3142
3143         PCRE_NOTBOL
3144
3145       This option specifies that first character of the subject string is not
3146       the beginning of a line, so the  circumflex  metacharacter  should  not
3147       match  before it. Setting this without PCRE_MULTILINE (at compile time)
3148       causes circumflex never to match. This option affects only  the  behav-
3149       iour of the circumflex metacharacter. It does not affect \A.
3150
3151         PCRE_NOTEOL
3152
3153       This option specifies that the end of the subject string is not the end
3154       of a line, so the dollar metacharacter should not match it nor  (except
3155       in  multiline mode) a newline immediately before it. Setting this with-
3156       out PCRE_MULTILINE (at compile time) causes dollar never to match. This
3157       option  affects only the behaviour of the dollar metacharacter. It does
3158       not affect \Z or \z.
3159
3160         PCRE_NOTEMPTY
3161
3162       An empty string is not considered to be a valid match if this option is
3163       set.  If  there are alternatives in the pattern, they are tried. If all
3164       the alternatives match the empty string, the entire  match  fails.  For
3165       example, if the pattern
3166
3167         a?b?
3168
3169       is  applied  to  a  string not beginning with "a" or "b", it matches an
3170       empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
3171       match is not valid, so PCRE searches further into the string for occur-
3172       rences of "a" or "b".
3173
3174         PCRE_NOTEMPTY_ATSTART
3175
3176       This is like PCRE_NOTEMPTY, except that an empty string match  that  is
3177       not  at  the  start  of  the  subject  is  permitted. If the pattern is
3178       anchored, such a match can occur only if the pattern contains \K.
3179
3180       Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
3181       PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
3182       match of the empty string within its split() function, and  when  using
3183       the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
3184       matching a null string by first trying the match again at the same off-
3185       set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
3186       fails, by advancing the starting offset (see below) and trying an ordi-
3187       nary  match  again. There is some code that demonstrates how to do this
3188       in the pcredemo sample program. In the most general case, you  have  to
3189       check  to  see  if the newline convention recognizes CRLF as a newline,
3190       and if so, and the current character is CR followed by LF, advance  the
3191       starting offset by two characters instead of one.
3192
3193         PCRE_NO_START_OPTIMIZE
3194
3195       There  are a number of optimizations that pcre_exec() uses at the start
3196       of a match, in order to speed up the process. For  example,  if  it  is
3197       known that an unanchored match must start with a specific character, it
3198       searches the subject for that character, and fails  immediately  if  it
3199       cannot  find  it,  without actually running the main matching function.
3200       This means that a special item such as (*COMMIT) at the start of a pat-
3201       tern  is  not  considered until after a suitable starting point for the
3202       match has been found. When callouts or (*MARK) items are in use,  these
3203       "start-up" optimizations can cause them to be skipped if the pattern is
3204       never actually used. The start-up optimizations are in  effect  a  pre-
3205       scan of the subject that takes place before the pattern is run.
3206
3207       The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
3208       possibly causing performance to suffer,  but  ensuring  that  in  cases
3209       where  the  result is "no match", the callouts do occur, and that items
3210       such as (*COMMIT) and (*MARK) are considered at every possible starting
3211       position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
3212       compile time,  it  cannot  be  unset  at  matching  time.  The  use  of
3213       PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching
3214       is always done using interpretively.
3215
3216       Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
3217       operation.  Consider the pattern
3218
3219         (*COMMIT)ABC
3220
3221       When  this  is  compiled, PCRE records the fact that a match must start
3222       with the character "A". Suppose the subject  string  is  "DEFABC".  The
3223       start-up  optimization  scans along the subject, finds "A" and runs the
3224       first match attempt from there. The (*COMMIT) item means that the  pat-
3225       tern  must  match the current starting position, which in this case, it
3226       does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
3227       set,  the  initial  scan  along the subject string does not happen. The
3228       first match attempt is run starting  from  "D"  and  when  this  fails,
3229       (*COMMIT)  prevents  any  further  matches  being tried, so the overall
3230       result is "no match". If the pattern is studied,  more  start-up  opti-
3231       mizations  may  be  used. For example, a minimum length for the subject
3232       may be recorded. Consider the pattern
3233
3234         (*MARK:A)(X|Y)
3235
3236       The minimum length for a match is one  character.  If  the  subject  is
3237       "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
3238       finally an empty string.  If the pattern is studied, the final  attempt
3239       does  not take place, because PCRE knows that the subject is too short,
3240       and so the (*MARK) is never encountered.  In this  case,  studying  the
3241       pattern  does  not  affect the overall match result, which is still "no
3242       match", but it does affect the auxiliary information that is returned.
3243
3244         PCRE_NO_UTF8_CHECK
3245
3246       When PCRE_UTF8 is set at compile time, the validity of the subject as a
3247       UTF-8  string is automatically checked when pcre_exec() is subsequently
3248       called.  The entire string is checked before any other processing takes
3249       place.  The  value  of  startoffset  is  also checked to ensure that it
3250       points to the start of a UTF-8 character. There is a  discussion  about
3251       the  validity  of  UTF-8 strings in the pcreunicode page. If an invalid
3252       sequence  of  bytes   is   found,   pcre_exec()   returns   the   error
3253       PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3254       truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
3255       both  cases, information about the precise nature of the error may also
3256       be returned (see the descriptions of these errors in the section  enti-
3257       tled  Error return values from pcre_exec() below).  If startoffset con-
3258       tains a value that does not point to the start of a UTF-8 character (or
3259       to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
3260
3261       If  you  already  know that your subject is valid, and you want to skip
3262       these   checks   for   performance   reasons,   you   can    set    the
3263       PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
3264       do this for the second and subsequent calls to pcre_exec() if  you  are
3265       making  repeated  calls  to  find  all  the matches in a single subject
3266       string. However, you should be  sure  that  the  value  of  startoffset
3267       points  to  the  start of a character (or the end of the subject). When
3268       PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3269       subject  or  an invalid value of startoffset is undefined. Your program
3270       may crash.
3271
3272         PCRE_PARTIAL_HARD
3273         PCRE_PARTIAL_SOFT
3274
3275       These options turn on the partial matching feature. For backwards  com-
3276       patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
3277       match occurs if the end of the subject string is reached  successfully,
3278       but  there  are not enough subject characters to complete the match. If
3279       this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
3280       matching  continues  by  testing any remaining alternatives. Only if no
3281       complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
3282       PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
3283       caller is prepared to handle a partial match, but only if  no  complete
3284       match can be found.
3285
3286       If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
3287       case, if a partial match  is  found,  pcre_exec()  immediately  returns
3288       PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
3289       other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
3290       ered to be more important that an alternative complete match.
3291
3292       In  both  cases,  the portion of the string that was inspected when the
3293       partial match was found is set as the first matching string. There is a
3294       more  detailed  discussion  of partial and multi-segment matching, with
3295       examples, in the pcrepartial documentation.
3296
3297   The string to be matched by pcre_exec()
3298
3299       The subject string is passed to pcre_exec() as a pointer in subject,  a
3300       length  in  bytes in length, and a starting byte offset in startoffset.
3301       If this is  negative  or  greater  than  the  length  of  the  subject,
3302       pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
3303       zero, the search for a match starts at the beginning  of  the  subject,
3304       and this is by far the most common case. In UTF-8 mode, the byte offset
3305       must point to the start of a UTF-8 character (or the end  of  the  sub-
3306       ject).  Unlike  the pattern string, the subject may contain binary zero
3307       bytes.
3308
3309       A non-zero starting offset is useful when searching for  another  match
3310       in  the same subject by calling pcre_exec() again after a previous suc-
3311       cess.  Setting startoffset differs from just passing over  a  shortened
3312       string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
3313       with any kind of lookbehind. For example, consider the pattern
3314
3315         \Biss\B
3316
3317       which finds occurrences of "iss" in the middle of  words.  (\B  matches
3318       only  if  the  current position in the subject is not a word boundary.)
3319       When applied to the string "Mississipi" the first call  to  pcre_exec()
3320       finds  the  first  occurrence. If pcre_exec() is called again with just
3321       the remainder of the subject,  namely  "issipi",  it  does  not  match,
3322       because \B is always false at the start of the subject, which is deemed
3323       to be a word boundary. However, if pcre_exec()  is  passed  the  entire
3324       string again, but with startoffset set to 4, it finds the second occur-
3325       rence of "iss" because it is able to look behind the starting point  to
3326       discover that it is preceded by a letter.
3327
3328       Finding  all  the  matches  in a subject is tricky when the pattern can
3329       match an empty string. It is possible to emulate Perl's /g behaviour by
3330       first   trying   the   match   again  at  the  same  offset,  with  the
3331       PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
3332       fails,  advancing  the  starting  offset  and  trying an ordinary match
3333       again. There is some code that demonstrates how to do this in the pcre-
3334       demo sample program. In the most general case, you have to check to see
3335       if the newline convention recognizes CRLF as a newline, and if so,  and
3336       the current character is CR followed by LF, advance the starting offset
3337       by two characters instead of one.
3338
3339       If a non-zero starting offset is passed when the pattern  is  anchored,
3340       one attempt to match at the given offset is made. This can only succeed
3341       if the pattern does not require the match to be at  the  start  of  the
3342       subject.
3343
3344   How pcre_exec() returns captured substrings
3345
3346       In  general, a pattern matches a certain portion of the subject, and in
3347       addition, further substrings from the subject  may  be  picked  out  by
3348       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
3349       this is called "capturing" in what follows, and the  phrase  "capturing
3350       subpattern"  is  used for a fragment of a pattern that picks out a sub-
3351       string. PCRE supports several other kinds of  parenthesized  subpattern
3352       that do not cause substrings to be captured.
3353
3354       Captured substrings are returned to the caller via a vector of integers
3355       whose address is passed in ovector. The number of elements in the  vec-
3356       tor  is  passed in ovecsize, which must be a non-negative number. Note:
3357       this argument is NOT the size of ovector in bytes.
3358
3359       The first two-thirds of the vector is used to pass back  captured  sub-
3360       strings,  each  substring using a pair of integers. The remaining third
3361       of the vector is used as workspace by pcre_exec() while  matching  cap-
3362       turing  subpatterns, and is not available for passing back information.
3363       The number passed in ovecsize should always be a multiple of three.  If
3364       it is not, it is rounded down.
3365
3366       When  a  match  is successful, information about captured substrings is
3367       returned in pairs of integers, starting at the  beginning  of  ovector,
3368       and  continuing  up  to two-thirds of its length at the most. The first
3369       element of each pair is set to the byte offset of the  first  character
3370       in  a  substring, and the second is set to the byte offset of the first
3371       character after the end of a substring. Note: these values  are  always
3372       byte offsets, even in UTF-8 mode. They are not character counts.
3373
3374       The  first  pair  of  integers, ovector[0] and ovector[1], identify the
3375       portion of the subject string matched by the entire pattern.  The  next
3376       pair  is  used for the first capturing subpattern, and so on. The value
3377       returned by pcre_exec() is one more than the highest numbered pair that
3378       has  been  set.  For example, if two substrings have been captured, the
3379       returned value is 3. If there are no capturing subpatterns, the  return
3380       value from a successful match is 1, indicating that just the first pair
3381       of offsets has been set.
3382
3383       If a capturing subpattern is matched repeatedly, it is the last portion
3384       of the string that it matched that is returned.
3385
3386       If  the vector is too small to hold all the captured substring offsets,
3387       it is used as far as possible (up to two-thirds of its length), and the
3388       function  returns a value of zero. If neither the actual string matched
3389       nor any captured substrings are of interest, pcre_exec() may be  called
3390       with  ovector passed as NULL and ovecsize as zero. However, if the pat-
3391       tern contains back references and the ovector  is  not  big  enough  to
3392       remember  the related substrings, PCRE has to get additional memory for
3393       use during matching. Thus it is usually advisable to supply an  ovector
3394       of reasonable size.
3395
3396       There  are  some  cases where zero is returned (indicating vector over-
3397       flow) when in fact the vector is exactly the right size for  the  final
3398       match. For example, consider the pattern
3399
3400         (a)(?:(b)c|bd)
3401
3402       If  a  vector of 6 elements (allowing for only 1 captured substring) is
3403       given with subject string "abd", pcre_exec() will try to set the second
3404       captured string, thereby recording a vector overflow, before failing to
3405       match "c" and backing up  to  try  the  second  alternative.  The  zero
3406       return,  however,  does  correctly  indicate that the maximum number of
3407       slots (namely 2) have been filled. In similar cases where there is tem-
3408       porary  overflow,  but  the final number of used slots is actually less
3409       than the maximum, a non-zero value is returned.
3410
3411       The pcre_fullinfo() function can be used to find out how many capturing
3412       subpatterns  there  are  in  a  compiled pattern. The smallest size for
3413       ovector that will allow for n captured substrings, in addition  to  the
3414       offsets of the substring matched by the whole pattern, is (n+1)*3.
3415
3416       It  is  possible for capturing subpattern number n+1 to match some part
3417       of the subject when subpattern n has not been used at all. For example,
3418       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
3419       return from the function is 4, and subpatterns 1 and 3 are matched, but
3420       2  is  not.  When  this happens, both values in the offset pairs corre-
3421       sponding to unused subpatterns are set to -1.
3422
3423       Offset values that correspond to unused subpatterns at the end  of  the
3424       expression  are  also  set  to  -1. For example, if the string "abc" is
3425       matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
3426       matched.  The  return  from the function is 2, because the highest used
3427       capturing subpattern number is 1, and the offsets for  for  the  second
3428       and  third  capturing subpatterns (assuming the vector is large enough,
3429       of course) are set to -1.
3430
3431       Note: Elements in the first two-thirds of ovector that  do  not  corre-
3432       spond  to  capturing parentheses in the pattern are never changed. That
3433       is, if a pattern contains n capturing parentheses, no more  than  ovec-
3434       tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
3435       the first two-thirds) retain whatever values they previously had.
3436
3437       Some convenience functions are provided  for  extracting  the  captured
3438       substrings as separate strings. These are described below.
3439
3440   Error return values from pcre_exec()
3441
3442       If  pcre_exec()  fails, it returns a negative number. The following are
3443       defined in the header file:
3444
3445         PCRE_ERROR_NOMATCH        (-1)
3446
3447       The subject string did not match the pattern.
3448
3449         PCRE_ERROR_NULL           (-2)
3450
3451       Either code or subject was passed as NULL,  or  ovector  was  NULL  and
3452       ovecsize was not zero.
3453
3454         PCRE_ERROR_BADOPTION      (-3)
3455
3456       An unrecognized bit was set in the options argument.
3457
3458         PCRE_ERROR_BADMAGIC       (-4)
3459
3460       PCRE  stores a 4-byte "magic number" at the start of the compiled code,
3461       to catch the case when it is passed a junk pointer and to detect when a
3462       pattern that was compiled in an environment of one endianness is run in
3463       an environment with the other endianness. This is the error  that  PCRE
3464       gives when the magic number is not present.
3465
3466         PCRE_ERROR_UNKNOWN_OPCODE (-5)
3467
3468       While running the pattern match, an unknown item was encountered in the
3469       compiled pattern. This error could be caused by a bug  in  PCRE  or  by
3470       overwriting of the compiled pattern.
3471
3472         PCRE_ERROR_NOMEMORY       (-6)
3473
3474       If  a  pattern contains back references, but the ovector that is passed
3475       to pcre_exec() is not big enough to remember the referenced substrings,
3476       PCRE  gets  a  block of memory at the start of matching to use for this
3477       purpose. If the call via pcre_malloc() fails, this error is given.  The
3478       memory is automatically freed at the end of matching.
3479
3480       This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
3481       This can happen only when PCRE has been compiled with  --disable-stack-
3482       for-recursion.
3483
3484         PCRE_ERROR_NOSUBSTRING    (-7)
3485
3486       This  error is used by the pcre_copy_substring(), pcre_get_substring(),
3487       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
3488       returned by pcre_exec().
3489
3490         PCRE_ERROR_MATCHLIMIT     (-8)
3491
3492       The  backtracking  limit,  as  specified  by the match_limit field in a
3493       pcre_extra structure (or defaulted) was reached.  See  the  description
3494       above.
3495
3496         PCRE_ERROR_CALLOUT        (-9)
3497
3498       This error is never generated by pcre_exec() itself. It is provided for
3499       use by callout functions that want to yield a distinctive  error  code.
3500       See the pcrecallout documentation for details.
3501
3502         PCRE_ERROR_BADUTF8        (-10)
3503
3504       A  string  that contains an invalid UTF-8 byte sequence was passed as a
3505       subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
3506       the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
3507       start of the the invalid UTF-8 character is placed in  the  first  ele-
3508       ment,  and  a  reason  code is placed in the second element. The reason
3509       codes are listed in the following section.  For backward compatibility,
3510       if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
3511       acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
3512       PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
3513
3514         PCRE_ERROR_BADUTF8_OFFSET (-11)
3515
3516       The  UTF-8  byte  sequence that was passed as a subject was checked and
3517       found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
3518       value  of startoffset did not point to the beginning of a UTF-8 charac-
3519       ter or the end of the subject.
3520
3521         PCRE_ERROR_PARTIAL        (-12)
3522
3523       The subject string did not match, but it did match partially.  See  the
3524       pcrepartial documentation for details of partial matching.
3525
3526         PCRE_ERROR_BADPARTIAL     (-13)
3527
3528       This  code  is  no  longer  in  use.  It was formerly returned when the
3529       PCRE_PARTIAL option was used with a compiled pattern  containing  items
3530       that  were  not  supported  for  partial  matching.  From  release 8.00
3531       onwards, there are no restrictions on partial matching.
3532
3533         PCRE_ERROR_INTERNAL       (-14)
3534
3535       An unexpected internal error has occurred. This error could  be  caused
3536       by a bug in PCRE or by overwriting of the compiled pattern.
3537
3538         PCRE_ERROR_BADCOUNT       (-15)
3539
3540       This error is given if the value of the ovecsize argument is negative.
3541
3542         PCRE_ERROR_RECURSIONLIMIT (-21)
3543
3544       The internal recursion limit, as specified by the match_limit_recursion
3545       field in a pcre_extra structure (or defaulted)  was  reached.  See  the
3546       description above.
3547
3548         PCRE_ERROR_BADNEWLINE     (-23)
3549
3550       An invalid combination of PCRE_NEWLINE_xxx options was given.
3551
3552         PCRE_ERROR_BADOFFSET      (-24)
3553
3554       The value of startoffset was negative or greater than the length of the
3555       subject, that is, the value in length.
3556
3557         PCRE_ERROR_SHORTUTF8      (-25)
3558
3559       This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
3560       string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
3561       option is set.  Information  about  the  failure  is  returned  as  for
3562       PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
3563       this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
3564       tion  of returned information; it is retained for backwards compatibil-
3565       ity.
3566
3567         PCRE_ERROR_RECURSELOOP    (-26)
3568
3569       This error is returned when pcre_exec() detects a recursion loop within
3570       the  pattern. Specifically, it means that either the whole pattern or a
3571       subpattern has been called recursively for the second time at the  same
3572       position in the subject string. Some simple patterns that might do this
3573       are detected and faulted at compile time, but more  complicated  cases,
3574       in particular mutual recursions between two different subpatterns, can-
3575       not be detected until run time.
3576
3577         PCRE_ERROR_JIT_STACKLIMIT (-27)
3578
3579       This error is returned when a pattern  that  was  successfully  studied
3580       using  a  JIT compile option is being matched, but the memory available
3581       for the just-in-time processing stack is  not  large  enough.  See  the
3582       pcrejit documentation for more details.
3583
3584         PCRE_ERROR_BADMODE        (-28)
3585
3586       This error is given if a pattern that was compiled by the 8-bit library
3587       is passed to a 16-bit or 32-bit library function, or vice versa.
3588
3589         PCRE_ERROR_BADENDIANNESS  (-29)
3590
3591       This error is given if  a  pattern  that  was  compiled  and  saved  is
3592       reloaded  on  a  host  with  different endianness. The utility function
3593       pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3594       so that it runs on the new host.
3595
3596         PCRE_ERROR_JIT_BADOPTION
3597
3598       This  error  is  returned  when a pattern that was successfully studied
3599       using a JIT compile option is being  matched,  but  the  matching  mode
3600       (partial  or complete match) does not correspond to any JIT compilation
3601       mode. When the JIT fast path function is used, this error may  be  also
3602       given  for  invalid  options.  See  the  pcrejit documentation for more
3603       details.
3604
3605         PCRE_ERROR_BADLENGTH      (-32)
3606
3607       This error is given if pcre_exec() is called with a negative value  for
3608       the length argument.
3609
3610       Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
3611
3612   Reason codes for invalid UTF-8 strings
3613
3614       This  section  applies  only  to  the  8-bit library. The corresponding
3615       information for the 16-bit and 32-bit libraries is given in the  pcre16
3616       and pcre32 pages.
3617
3618       When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3619       UTF8, and the size of the output vector (ovecsize) is at least  2,  the
3620       offset  of  the  start  of the invalid UTF-8 character is placed in the
3621       first output vector element (ovector[0]) and a reason code is placed in
3622       the  second  element  (ovector[1]). The reason codes are given names in
3623       the pcre.h header file:
3624
3625         PCRE_UTF8_ERR1
3626         PCRE_UTF8_ERR2
3627         PCRE_UTF8_ERR3
3628         PCRE_UTF8_ERR4
3629         PCRE_UTF8_ERR5
3630
3631       The string ends with a truncated UTF-8 character;  the  code  specifies
3632       how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
3633       characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
3634       nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
3635       checked first; hence the possibility of 4 or 5 missing bytes.
3636
3637         PCRE_UTF8_ERR6
3638         PCRE_UTF8_ERR7
3639         PCRE_UTF8_ERR8
3640         PCRE_UTF8_ERR9
3641         PCRE_UTF8_ERR10
3642
3643       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3644       the  character  do  not have the binary value 0b10 (that is, either the
3645       most significant bit is 0, or the next bit is 1).
3646
3647         PCRE_UTF8_ERR11
3648         PCRE_UTF8_ERR12
3649
3650       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
3651       long; these code points are excluded by RFC 3629.
3652
3653         PCRE_UTF8_ERR13
3654
3655       A  4-byte character has a value greater than 0x10fff; these code points
3656       are excluded by RFC 3629.
3657
3658         PCRE_UTF8_ERR14
3659
3660       A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
3661       range  of code points are reserved by RFC 3629 for use with UTF-16, and
3662       so are excluded from UTF-8.
3663
3664         PCRE_UTF8_ERR15
3665         PCRE_UTF8_ERR16
3666         PCRE_UTF8_ERR17
3667         PCRE_UTF8_ERR18
3668         PCRE_UTF8_ERR19
3669
3670       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
3671       for  a  value that can be represented by fewer bytes, which is invalid.
3672       For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
3673       rect coding uses just one byte.
3674
3675         PCRE_UTF8_ERR20
3676
3677       The two most significant bits of the first byte of a character have the
3678       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
3679       ond  is  0). Such a byte can only validly occur as the second or subse-
3680       quent byte of a multi-byte character.
3681
3682         PCRE_UTF8_ERR21
3683
3684       The first byte of a character has the value 0xfe or 0xff. These  values
3685       can never occur in a valid UTF-8 string.
3686
3687         PCRE_UTF8_ERR2
3688
3689       Non-character. These are the last two characters in each plane (0xfffe,
3690       0xffff, 0x1fffe, 0x1ffff .. 0x10fffe,  0x10ffff),  and  the  characters
3691       0xfdd0..0xfdef.
3692
3693
3694EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3695
3696       int pcre_copy_substring(const char *subject, int *ovector,
3697            int stringcount, int stringnumber, char *buffer,
3698            int buffersize);
3699
3700       int pcre_get_substring(const char *subject, int *ovector,
3701            int stringcount, int stringnumber,
3702            const char **stringptr);
3703
3704       int pcre_get_substring_list(const char *subject,
3705            int *ovector, int stringcount, const char ***listptr);
3706
3707       Captured  substrings  can  be  accessed  directly  by using the offsets
3708       returned by pcre_exec() in  ovector.  For  convenience,  the  functions
3709       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
3710       string_list() are provided for extracting captured substrings  as  new,
3711       separate,  zero-terminated strings. These functions identify substrings
3712       by number. The next section describes functions  for  extracting  named
3713       substrings.
3714
3715       A  substring that contains a binary zero is correctly extracted and has
3716       a further zero added on the end, but the result is not, of course, a  C
3717       string.   However,  you  can  process such a string by referring to the
3718       length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
3719       string().  Unfortunately, the interface to pcre_get_substring_list() is
3720       not adequate for handling strings containing binary zeros, because  the
3721       end of the final string is not independently indicated.
3722
3723       The  first  three  arguments  are the same for all three of these func-
3724       tions: subject is the subject string that has  just  been  successfully
3725       matched, ovector is a pointer to the vector of integer offsets that was
3726       passed to pcre_exec(), and stringcount is the number of substrings that
3727       were  captured  by  the match, including the substring that matched the
3728       entire regular expression. This is the value returned by pcre_exec() if
3729       it  is greater than zero. If pcre_exec() returned zero, indicating that
3730       it ran out of space in ovector, the value passed as stringcount  should
3731       be the number of elements in the vector divided by three.
3732
3733       The  functions pcre_copy_substring() and pcre_get_substring() extract a
3734       single substring, whose number is given as  stringnumber.  A  value  of
3735       zero  extracts  the  substring that matched the entire pattern, whereas
3736       higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
3737       string(),  the  string  is  placed  in buffer, whose length is given by
3738       buffersize, while for pcre_get_substring() a new  block  of  memory  is
3739       obtained  via  pcre_malloc,  and its address is returned via stringptr.
3740       The yield of the function is the length of the  string,  not  including
3741       the terminating zero, or one of these error codes:
3742
3743         PCRE_ERROR_NOMEMORY       (-6)
3744
3745       The  buffer  was too small for pcre_copy_substring(), or the attempt to
3746       get memory failed for pcre_get_substring().
3747
3748         PCRE_ERROR_NOSUBSTRING    (-7)
3749
3750       There is no substring whose number is stringnumber.
3751
3752       The pcre_get_substring_list()  function  extracts  all  available  sub-
3753       strings  and  builds  a list of pointers to them. All this is done in a
3754       single block of memory that is obtained via pcre_malloc. The address of
3755       the  memory  block  is returned via listptr, which is also the start of
3756       the list of string pointers. The end of the list is marked  by  a  NULL
3757       pointer.  The  yield  of  the function is zero if all went well, or the
3758       error code
3759
3760         PCRE_ERROR_NOMEMORY       (-6)
3761
3762       if the attempt to get the memory block failed.
3763
3764       When any of these functions encounter a substring that is unset,  which
3765       can  happen  when  capturing subpattern number n+1 matches some part of
3766       the subject, but subpattern n has not been used at all, they return  an
3767       empty string. This can be distinguished from a genuine zero-length sub-
3768       string by inspecting the appropriate offset in ovector, which is  nega-
3769       tive for unset substrings.
3770
3771       The  two convenience functions pcre_free_substring() and pcre_free_sub-
3772       string_list() can be used to free the memory  returned  by  a  previous
3773       call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
3774       tively. They do nothing more than  call  the  function  pointed  to  by
3775       pcre_free,  which  of course could be called directly from a C program.
3776       However, PCRE is used in some situations where it is linked via a  spe-
3777       cial   interface  to  another  programming  language  that  cannot  use
3778       pcre_free directly; it is for these cases that the functions  are  pro-
3779       vided.
3780
3781
3782EXTRACTING CAPTURED SUBSTRINGS BY NAME
3783
3784       int pcre_get_stringnumber(const pcre *code,
3785            const char *name);
3786
3787       int pcre_copy_named_substring(const pcre *code,
3788            const char *subject, int *ovector,
3789            int stringcount, const char *stringname,
3790            char *buffer, int buffersize);
3791
3792       int pcre_get_named_substring(const pcre *code,
3793            const char *subject, int *ovector,
3794            int stringcount, const char *stringname,
3795            const char **stringptr);
3796
3797       To  extract a substring by name, you first have to find associated num-
3798       ber.  For example, for this pattern
3799
3800         (a+)b(?<xxx>\d+)...
3801
3802       the number of the subpattern called "xxx" is 2. If the name is known to
3803       be unique (PCRE_DUPNAMES was not set), you can find the number from the
3804       name by calling pcre_get_stringnumber(). The first argument is the com-
3805       piled pattern, and the second is the name. The yield of the function is
3806       the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
3807       subpattern of that name.
3808
3809       Given the number, you can extract the substring directly, or use one of
3810       the functions described in the previous section. For convenience, there
3811       are also two functions that do the whole job.
3812
3813       Most    of    the    arguments   of   pcre_copy_named_substring()   and
3814       pcre_get_named_substring() are the same  as  those  for  the  similarly
3815       named  functions  that extract by number. As these are described in the
3816       previous section, they are not re-described here. There  are  just  two
3817       differences:
3818
3819       First,  instead  of a substring number, a substring name is given. Sec-
3820       ond, there is an extra argument, given at the start, which is a pointer
3821       to  the compiled pattern. This is needed in order to gain access to the
3822       name-to-number translation table.
3823
3824       These functions call pcre_get_stringnumber(), and if it succeeds,  they
3825       then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
3826       ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
3827       behaviour may not be what you want (see the next section).
3828
3829       Warning: If the pattern uses the (?| feature to set up multiple subpat-
3830       terns with the same number, as described in the  section  on  duplicate
3831       subpattern  numbers  in  the  pcrepattern page, you cannot use names to
3832       distinguish the different subpatterns, because names are  not  included
3833       in  the compiled code. The matching process uses only numbers. For this
3834       reason, the use of different names for subpatterns of the  same  number
3835       causes an error at compile time.
3836
3837
3838DUPLICATE SUBPATTERN NAMES
3839
3840       int pcre_get_stringtable_entries(const pcre *code,
3841            const char *name, char **first, char **last);
3842
3843       When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
3844       subpatterns are not required to be unique. (Duplicate names are  always
3845       allowed  for subpatterns with the same number, created by using the (?|
3846       feature. Indeed, if such subpatterns are named, they  are  required  to
3847       use the same names.)
3848
3849       Normally, patterns with duplicate names are such that in any one match,
3850       only one of the named subpatterns participates. An example is shown  in
3851       the pcrepattern documentation.
3852
3853       When    duplicates   are   present,   pcre_copy_named_substring()   and
3854       pcre_get_named_substring() return the first substring corresponding  to
3855       the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
3856       (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
3857       function  returns one of the numbers that are associated with the name,
3858       but it is not defined which it is.
3859
3860       If you want to get full details of all captured substrings for a  given
3861       name,  you  must  use  the pcre_get_stringtable_entries() function. The
3862       first argument is the compiled pattern, and the second is the name. The
3863       third  and  fourth  are  pointers to variables which are updated by the
3864       function. After it has run, they point to the first and last entries in
3865       the  name-to-number  table  for  the  given  name.  The function itself
3866       returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
3867       there  are none. The format of the table is described above in the sec-
3868       tion entitled Information about a pattern above.  Given all  the  rele-
3869       vant  entries  for the name, you can extract each of their numbers, and
3870       hence the captured data, if any.
3871
3872
3873FINDING ALL POSSIBLE MATCHES
3874
3875       The traditional matching function uses a  similar  algorithm  to  Perl,
3876       which stops when it finds the first match, starting at a given point in
3877       the subject. If you want to find all possible matches, or  the  longest
3878       possible  match,  consider using the alternative matching function (see
3879       below) instead. If you cannot use the alternative function,  but  still
3880       need  to  find all possible matches, you can kludge it up by making use
3881       of the callout facility, which is described in the pcrecallout documen-
3882       tation.
3883
3884       What you have to do is to insert a callout right at the end of the pat-
3885       tern.  When your callout function is called, extract and save the  cur-
3886       rent  matched  substring.  Then  return  1, which forces pcre_exec() to
3887       backtrack and try other alternatives. Ultimately, when it runs  out  of
3888       matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3889
3890
3891OBTAINING AN ESTIMATE OF STACK USAGE
3892
3893       Matching  certain  patterns  using pcre_exec() can use a lot of process
3894       stack, which in certain environments can be  rather  limited  in  size.
3895       Some  users  find it helpful to have an estimate of the amount of stack
3896       that is used by pcre_exec(), to help  them  set  recursion  limits,  as
3897       described  in  the pcrestack documentation. The estimate that is output
3898       by pcretest when called with the -m and -C options is obtained by call-
3899       ing  pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
3900       first five arguments.
3901
3902       Normally, if  its  first  argument  is  NULL,  pcre_exec()  immediately
3903       returns  the negative error code PCRE_ERROR_NULL, but with this special
3904       combination of arguments, it returns instead a  negative  number  whose
3905       absolute  value  is the approximate stack frame size in bytes. (A nega-
3906       tive number is used so that it is clear that no  match  has  happened.)
3907       The  value  is  approximate  because  in some cases, recursive calls to
3908       pcre_exec() occur when there are one or two additional variables on the
3909       stack.
3910
3911       If  PCRE  has  been  compiled  to use the heap instead of the stack for
3912       recursion, the value returned  is  the  size  of  each  block  that  is
3913       obtained from the heap.
3914
3915
3916MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3917
3918       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
3919            const char *subject, int length, int startoffset,
3920            int options, int *ovector, int ovecsize,
3921            int *workspace, int wscount);
3922
3923       The  function  pcre_dfa_exec()  is  called  to  match  a subject string
3924       against a compiled pattern, using a matching algorithm that  scans  the
3925       subject  string  just  once, and does not backtrack. This has different
3926       characteristics to the normal algorithm, and  is  not  compatible  with
3927       Perl.  Some  of the features of PCRE patterns are not supported. Never-
3928       theless, there are times when this kind of matching can be useful.  For
3929       a  discussion  of  the  two matching algorithms, and a list of features
3930       that pcre_dfa_exec() does not support, see the pcrematching  documenta-
3931       tion.
3932
3933       The  arguments  for  the  pcre_dfa_exec()  function are the same as for
3934       pcre_exec(), plus two extras. The ovector argument is used in a differ-
3935       ent  way,  and  this is described below. The other common arguments are
3936       used in the same way as for pcre_exec(), so their  description  is  not
3937       repeated here.
3938
3939       The  two  additional  arguments provide workspace for the function. The
3940       workspace vector should contain at least 20 elements. It  is  used  for
3941       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3942       workspace will be needed for patterns and subjects where  there  are  a
3943       lot of potential matches.
3944
3945       Here is an example of a simple call to pcre_dfa_exec():
3946
3947         int rc;
3948         int ovector[10];
3949         int wspace[20];
3950         rc = pcre_dfa_exec(
3951           re,             /* result of pcre_compile() */
3952           NULL,           /* we didn't study the pattern */
3953           "some string",  /* the subject string */
3954           11,             /* the length of the subject string */
3955           0,              /* start at offset 0 in the subject */
3956           0,              /* default options */
3957           ovector,        /* vector of integers for substring information */
3958           10,             /* number of elements (NOT size in bytes) */
3959           wspace,         /* working space vector */
3960           20);            /* number of elements (NOT size in bytes) */
3961
3962   Option bits for pcre_dfa_exec()
3963
3964       The  unused  bits  of  the options argument for pcre_dfa_exec() must be
3965       zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
3966       LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
3967       PCRE_NOTEMPTY_ATSTART,      PCRE_NO_UTF8_CHECK,       PCRE_BSR_ANYCRLF,
3968       PCRE_BSR_UNICODE,  PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
3969       TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.  All but  the  last
3970       four  of  these  are  exactly  the  same  as  for pcre_exec(), so their
3971       description is not repeated here.
3972
3973         PCRE_PARTIAL_HARD
3974         PCRE_PARTIAL_SOFT
3975
3976       These have the same general effect as they do for pcre_exec(), but  the
3977       details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for
3978       pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-
3979       ject  is  reached  and there is still at least one matching possibility
3980       that requires additional characters. This happens even if some complete
3981       matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
3982       code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
3983       of  the  subject  is  reached, there have been no complete matches, but
3984       there is still at least one matching possibility. The  portion  of  the
3985       string  that  was inspected when the longest partial match was found is
3986       set as the first matching string  in  both  cases.   There  is  a  more
3987       detailed  discussion  of partial and multi-segment matching, with exam-
3988       ples, in the pcrepartial documentation.
3989
3990         PCRE_DFA_SHORTEST
3991
3992       Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
3993       stop as soon as it has found one match. Because of the way the alterna-
3994       tive algorithm works, this is necessarily the shortest  possible  match
3995       at the first possible matching point in the subject string.
3996
3997         PCRE_DFA_RESTART
3998
3999       When pcre_dfa_exec() returns a partial match, it is possible to call it
4000       again, with additional subject characters, and have  it  continue  with
4001       the  same match. The PCRE_DFA_RESTART option requests this action; when
4002       it is set, the workspace and wscount options must  reference  the  same
4003       vector  as  before  because data about the match so far is left in them
4004       after a partial match. There is more discussion of this facility in the
4005       pcrepartial documentation.
4006
4007   Successful returns from pcre_dfa_exec()
4008
4009       When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
4010       string in the subject. Note, however, that all the matches from one run
4011       of  the  function  start  at the same point in the subject. The shorter
4012       matches are all initial substrings of the longer matches. For  example,
4013       if the pattern
4014
4015         <.*>
4016
4017       is matched against the string
4018
4019         This is <something> <something else> <something further> no more
4020
4021       the three matched strings are
4022
4023         <something>
4024         <something> <something else>
4025         <something> <something else> <something further>
4026
4027       On  success,  the  yield of the function is a number greater than zero,
4028       which is the number of matched substrings.  The  substrings  themselves
4029       are  returned  in  ovector. Each string uses two elements; the first is
4030       the offset to the start, and the second is the offset to  the  end.  In
4031       fact,  all  the  strings  have the same start offset. (Space could have
4032       been saved by giving this only once, but it was decided to retain  some
4033       compatibility  with  the  way pcre_exec() returns data, even though the
4034       meaning of the strings is different.)
4035
4036       The strings are returned in reverse order of length; that is, the long-
4037       est  matching  string is given first. If there were too many matches to
4038       fit into ovector, the yield of the function is zero, and the vector  is
4039       filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()
4040       can use the entire ovector for returning matched strings.
4041
4042   Error returns from pcre_dfa_exec()
4043
4044       The pcre_dfa_exec() function returns a negative number when  it  fails.
4045       Many  of  the  errors  are  the  same as for pcre_exec(), and these are
4046       described above.  There are in addition the following errors  that  are
4047       specific to pcre_dfa_exec():
4048
4049         PCRE_ERROR_DFA_UITEM      (-16)
4050
4051       This  return is given if pcre_dfa_exec() encounters an item in the pat-
4052       tern that it does not support, for instance, the use of \C  or  a  back
4053       reference.
4054
4055         PCRE_ERROR_DFA_UCOND      (-17)
4056
4057       This  return  is  given  if pcre_dfa_exec() encounters a condition item
4058       that uses a back reference for the condition, or a test  for  recursion
4059       in a specific group. These are not supported.
4060
4061         PCRE_ERROR_DFA_UMLIMIT    (-18)
4062
4063       This  return  is given if pcre_dfa_exec() is called with an extra block
4064       that contains a setting of  the  match_limit  or  match_limit_recursion
4065       fields.  This  is  not  supported (these fields are meaningless for DFA
4066       matching).
4067
4068         PCRE_ERROR_DFA_WSSIZE     (-19)
4069
4070       This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
4071       workspace vector.
4072
4073         PCRE_ERROR_DFA_RECURSE    (-20)
4074
4075       When  a  recursive subpattern is processed, the matching function calls
4076       itself recursively, using private vectors for  ovector  and  workspace.
4077       This  error  is  given  if  the output vector is not large enough. This
4078       should be extremely rare, as a vector of size 1000 is used.
4079
4080         PCRE_ERROR_DFA_BADRESTART (-30)
4081
4082       When pcre_dfa_exec() is called with the PCRE_DFA_RESTART  option,  some
4083       plausibility  checks  are  made on the contents of the workspace, which
4084       should contain data about the previous partial match. If any  of  these
4085       checks fail, this error is given.
4086
4087
4088SEE ALSO
4089
4090       pcre16(3),   pcre32(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),
4091       pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
4092       sample(3), pcrestack(3).
4093
4094
4095AUTHOR
4096
4097       Philip Hazel
4098       University Computing Service
4099       Cambridge CB2 3QH, England.
4100
4101
4102REVISION
4103
4104       Last updated: 08 November 2012
4105       Copyright (c) 1997-2012 University of Cambridge.
4106------------------------------------------------------------------------------
4107
4108
4109PCRECALLOUT(3)                                                  PCRECALLOUT(3)
4110
4111
4112NAME
4113       PCRE - Perl-compatible regular expressions
4114
4115
4116SYNOPSIS
4117
4118       #include <pcre.h>
4119
4120       int (*pcre_callout)(pcre_callout_block *);
4121
4122       int (*pcre16_callout)(pcre16_callout_block *);
4123
4124       int (*pcre32_callout)(pcre32_callout_block *);
4125
4126
4127DESCRIPTION
4128
4129       PCRE provides a feature called "callout", which is a means of temporar-
4130       ily passing control to the caller of PCRE  in  the  middle  of  pattern
4131       matching.  The  caller of PCRE provides an external function by putting
4132       its entry point in the global variable pcre_callout (pcre16_callout for
4133       the 16-bit library, pcre32_callout for the 32-bit library). By default,
4134       this variable contains NULL, which disables all calling out.
4135
4136       Within a regular expression, (?C) indicates the  points  at  which  the
4137       external  function  is  to  be  called. Different callout points can be
4138       identified by putting a number less than 256 after the  letter  C.  The
4139       default  value  is  zero.   For  example,  this pattern has two callout
4140       points:
4141
4142         (?C1)abc(?C2)def
4143
4144       If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
4145       PCRE  automatically  inserts callouts, all with number 255, before each
4146       item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
4147       pattern
4148
4149         A(\d{2}|--)
4150
4151       it is processed as if it were
4152
4153       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4154
4155       Notice  that  there  is a callout before and after each parenthesis and
4156       alternation bar. Automatic  callouts  can  be  used  for  tracking  the
4157       progress  of  pattern matching. The pcretest command has an option that
4158       sets automatic callouts; when it is used, the output indicates how  the
4159       pattern  is  matched. This is useful information when you are trying to
4160       optimize the performance of a particular pattern.
4161
4162       The use of callouts in a pattern makes it ineligible  for  optimization
4163       by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the
4164       PCRE_STUDY_JIT_COMPILE option always fails.
4165
4166
4167MISSING CALLOUTS
4168
4169       You should be aware that, because of  optimizations  in  the  way  PCRE
4170       matches  patterns  by  default,  callouts  sometimes do not happen. For
4171       example, if the pattern is
4172
4173         ab(?C4)cd
4174
4175       PCRE knows that any matching string must contain the letter "d". If the
4176       subject  string  is "abyz", the lack of "d" means that matching doesn't
4177       ever start, and the callout is never  reached.  However,  with  "abyd",
4178       though the result is still no match, the callout is obeyed.
4179
4180       If  the pattern is studied, PCRE knows the minimum length of a matching
4181       string, and will immediately give a "no match" return without  actually
4182       running  a  match if the subject is not long enough, or, for unanchored
4183       patterns, if it has been scanned far enough.
4184
4185       You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
4186       MIZE  option  to the matching function, or by starting the pattern with
4187       (*NO_START_OPT). This slows down the matching process, but does  ensure
4188       that callouts such as the example above are obeyed.
4189
4190
4191THE CALLOUT INTERFACE
4192
4193       During  matching, when PCRE reaches a callout point, the external func-
4194       tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
4195       set).  This  applies to both normal and DFA matching. The only argument
4196       to  the  callout  function  is  a  pointer   to   a   pcre_callout   or
4197       pcre[16|32]_callout  block.   These  structures  contains the following
4198       fields:
4199
4200         int           version;
4201         int           callout_number;
4202         int          *offset_vector;
4203         const char   *subject;           (8-bit version)
4204         PCRE_SPTR16   subject;           (16-bit version)
4205         PCRE_SPTR32   subject;           (32-bit version)
4206         int           subject_length;
4207         int           start_match;
4208         int           current_position;
4209         int           capture_top;
4210         int           capture_last;
4211         void         *callout_data;
4212         int           pattern_position;
4213         int           next_item_length;
4214         const unsigned char *mark;       (8-bit version)
4215         const PCRE_UCHAR16  *mark;       (16-bit version)
4216         const PCRE_UCHAR32  *mark;       (32-bit version)
4217
4218       The version field is an integer containing the version  number  of  the
4219       block  format. The initial version was 0; the current version is 2. The
4220       version number will change again in future  if  additional  fields  are
4221       added, but the intention is never to remove any of the existing fields.
4222
4223       The  callout_number  field  contains the number of the callout, as com-
4224       piled into the pattern (that is, the number after ?C for  manual  call-
4225       outs, and 255 for automatically generated callouts).
4226
4227       The  offset_vector field is a pointer to the vector of offsets that was
4228       passed by the caller to the  matching  function.  When  pcre_exec()  or
4229       pcre[16|32]_exec()  is used, the contents can be inspected, in order to
4230       extract substrings that have been matched so far, in the  same  way  as
4231       for  extracting  substrings  after  a  match has completed. For the DFA
4232       matching functions, this field is not useful.
4233
4234       The subject and subject_length fields contain copies of the values that
4235       were passed to the matching function.
4236
4237       The  start_match  field normally contains the offset within the subject
4238       at which the current match attempt  started.  However,  if  the  escape
4239       sequence  \K has been encountered, this value is changed to reflect the
4240       modified starting point. If the pattern is not  anchored,  the  callout
4241       function may be called several times from the same point in the pattern
4242       for different starting points in the subject.
4243
4244       The current_position field contains the offset within  the  subject  of
4245       the current match pointer.
4246
4247       When  the  pcre_exec()  or  pcre[16|32]_exec() is used, the capture_top
4248       field contains one more than the number of the  highest  numbered  cap-
4249       tured  substring so far. If no substrings have been captured, the value
4250       of capture_top is one. This is always the case when the  DFA  functions
4251       are used, because they do not support captured substrings.
4252
4253       The  capture_last  field  contains the number of the most recently cap-
4254       tured substring. If no substrings have been captured, its value is  -1.
4255       This is always the case for the DFA matching functions.
4256
4257       The  callout_data  field  contains a value that is passed to a matching
4258       function specifically so that it can be passed back in callouts. It  is
4259       passed  in  the callout_data field of a pcre_extra or pcre[16|32]_extra
4260       data structure. If no such data was passed, the value  of  callout_data
4261       in  a  callout  block is NULL. There is a description of the pcre_extra
4262       structure in the pcreapi documentation.
4263
4264       The pattern_position field is present from version  1  of  the  callout
4265       structure. It contains the offset to the next item to be matched in the
4266       pattern string.
4267
4268       The next_item_length field is present from version  1  of  the  callout
4269       structure. It contains the length of the next item to be matched in the
4270       pattern string. When the callout immediately  precedes  an  alternation
4271       bar,  a  closing  parenthesis, or the end of the pattern, the length is
4272       zero. When the callout precedes an opening parenthesis, the  length  is
4273       that of the entire subpattern.
4274
4275       The  pattern_position  and next_item_length fields are intended to help
4276       in distinguishing between different automatic callouts, which all  have
4277       the same callout number. However, they are set for all callouts.
4278
4279       The  mark  field is present from version 2 of the callout structure. In
4280       callouts from pcre_exec() or pcre[16|32]_exec() it contains  a  pointer
4281       to  the  zero-terminated  name  of  the  most  recently passed (*MARK),
4282       (*PRUNE), or (*THEN) item in the match, or NULL if no such  items  have
4283       been  passed.  Instances  of  (*PRUNE) or (*THEN) without a name do not
4284       obliterate a previous (*MARK). In callouts from the DFA matching  func-
4285       tions this field always contains NULL.
4286
4287
4288RETURN VALUES
4289
4290       The  external callout function returns an integer to PCRE. If the value
4291       is zero, matching proceeds as normal. If  the  value  is  greater  than
4292       zero,  matching  fails  at  the current point, but the testing of other
4293       matching possibilities goes ahead, just as if a lookahead assertion had
4294       failed.  If  the  value  is less than zero, the match is abandoned, the
4295       matching function returns the negative value.
4296
4297       Negative  values  should  normally  be   chosen   from   the   set   of
4298       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4299       dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
4300       reserved  for  use  by callout functions; it will never be used by PCRE
4301       itself.
4302
4303
4304AUTHOR
4305
4306       Philip Hazel
4307       University Computing Service
4308       Cambridge CB2 3QH, England.
4309
4310
4311REVISION
4312
4313       Last updated: 24 June 2012
4314       Copyright (c) 1997-2012 University of Cambridge.
4315------------------------------------------------------------------------------
4316
4317
4318PCRECOMPAT(3)                                                    PCRECOMPAT(3)
4319
4320
4321NAME
4322       PCRE - Perl-compatible regular expressions
4323
4324
4325DIFFERENCES BETWEEN PCRE AND PERL
4326
4327       This  document describes the differences in the ways that PCRE and Perl
4328       handle regular expressions. The differences  described  here  are  with
4329       respect to Perl versions 5.10 and above.
4330
4331       1. PCRE has only a subset of Perl's Unicode support. Details of what it
4332       does have are given in the pcreunicode page.
4333
4334       2. PCRE allows repeat quantifiers only on parenthesized assertions, but
4335       they  do  not mean what you might think. For example, (?!a){3} does not
4336       assert that the next three characters are not "a". It just asserts that
4337       the next character is not "a" three times (in principle: PCRE optimizes
4338       this to run the assertion just once). Perl allows repeat quantifiers on
4339       other assertions such as \b, but these do not seem to have any use.
4340
4341       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
4342       tions are counted, but their entries in the offsets  vector  are  never
4343       set.  Perl sets its numerical variables from any such patterns that are
4344       matched before the assertion fails to match something (thereby succeed-
4345       ing),  but  only  if the negative lookahead assertion contains just one
4346       branch.
4347
4348       4. Though binary zero characters are supported in the  subject  string,
4349       they are not allowed in a pattern string because it is passed as a nor-
4350       mal C string, terminated by zero. The escape sequence \0 can be used in
4351       the pattern to represent a binary zero.
4352
4353       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
4354       \U, and \N when followed by a character name or Unicode value.  (\N  on
4355       its own, matching a non-newline character, is supported.) In fact these
4356       are implemented by Perl's general string-handling and are not  part  of
4357       its  pattern  matching engine. If any of these are encountered by PCRE,
4358       an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
4359       PAT  option  is set, \U and \u are interpreted as JavaScript interprets
4360       them.
4361
4362       6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
4363       is  built  with Unicode character property support. The properties that
4364       can be tested with \p and \P are limited to the general category  prop-
4365       erties  such  as  Lu and Nd, script names such as Greek or Han, and the
4366       derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
4367       property,  which  Perl  does  not; the Perl documentation says "Because
4368       Perl hides the need for the user to understand the internal representa-
4369       tion  of Unicode characters, there is no need to implement the somewhat
4370       messy concept of surrogates."
4371
4372       7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
4373       ters  in  between  are  treated as literals. This is slightly different
4374       from Perl in that $ and @ are  also  handled  as  literals  inside  the
4375       quotes.  In Perl, they cause variable interpolation (but of course PCRE
4376       does not have variables). Note the following examples:
4377
4378           Pattern            PCRE matches      Perl matches
4379
4380           \Qabc$xyz\E        abc$xyz           abc followed by the
4381                                                  contents of $xyz
4382           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4383           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4384
4385       The \Q...\E sequence is recognized both inside  and  outside  character
4386       classes.
4387
4388       8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
4389       constructions. However, there is support for recursive  patterns.  This
4390       is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
4391       "callout" feature allows an external function to be called during  pat-
4392       tern matching. See the pcrecallout documentation for details.
4393
4394       9.  Subpatterns  that  are called as subroutines (whether or not recur-
4395       sively) are always treated as atomic  groups  in  PCRE.  This  is  like
4396       Python,  but  unlike Perl.  Captured values that are set outside a sub-
4397       routine call can be reference from inside in PCRE,  but  not  in  Perl.
4398       There is a discussion that explains these differences in more detail in
4399       the section on recursion differences from Perl in the pcrepattern page.
4400
4401       10. If any of the backtracking control verbs are used in  an  assertion
4402       or  in  a  subpattern  that  is  called as a subroutine (whether or not
4403       recursively), their effect is confined to that subpattern; it does  not
4404       extend to the surrounding pattern. This is not always the case in Perl.
4405       In particular, if (*THEN) is present in a group that  is  called  as  a
4406       subroutine, its action is limited to that group, even if the group does
4407       not contain any | characters. There is one exception to this: the  name
4408       from  a *(MARK), (*PRUNE), or (*THEN) that is encountered in a success-
4409       ful positive assertion is passed back when a  match  succeeds  (compare
4410       capturing  parentheses  in  assertions). Note that such subpatterns are
4411       processed as anchored at the point where they are tested.
4412
4413       11. There are some differences that are concerned with the settings  of
4414       captured  strings  when  part  of  a  pattern is repeated. For example,
4415       matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
4416       unset, but in PCRE it is set to "b".
4417
4418       12.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
4419       pattern names is not as general as Perl's. This is a consequence of the
4420       fact the PCRE works internally just with numbers, using an external ta-
4421       ble to translate between numbers and names. In  particular,  a  pattern
4422       such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
4423       the same number but different names, is not supported,  and  causes  an
4424       error  at compile time. If it were allowed, it would not be possible to
4425       distinguish which parentheses matched, because both names map  to  cap-
4426       turing subpattern number 1. To avoid this confusing situation, an error
4427       is given at compile time.
4428
4429       13. Perl recognizes comments in some places that  PCRE  does  not,  for
4430       example,  between  the  ( and ? at the start of a subpattern. If the /x
4431       modifier is set, Perl allows white space between ( and ? but PCRE never
4432       does, even if the PCRE_EXTENDED option is set.
4433
4434       14. PCRE provides some extensions to the Perl regular expression facil-
4435       ities.  Perl 5.10 includes new features that are not  in  earlier  ver-
4436       sions  of  Perl, some of which (such as named parentheses) have been in
4437       PCRE for some time. This list is with respect to Perl 5.10:
4438
4439       (a) Although lookbehind assertions in  PCRE  must  match  fixed  length
4440       strings,  each alternative branch of a lookbehind assertion can match a
4441       different length of string. Perl requires them all  to  have  the  same
4442       length.
4443
4444       (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
4445       meta-character matches only at the very end of the string.
4446
4447       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
4448       cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
4449       ignored.  (Perl can be made to issue a warning.)
4450
4451       (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
4452       fiers is inverted, that is, by default they are not greedy, but if fol-
4453       lowed by a question mark they are.
4454
4455       (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
4456       tried only at the first matching position in the subject string.
4457
4458       (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
4459       and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
4460       lents.
4461
4462       (g)  The  \R escape sequence can be restricted to match only CR, LF, or
4463       CRLF by the PCRE_BSR_ANYCRLF option.
4464
4465       (h) The callout facility is PCRE-specific.
4466
4467       (i) The partial matching facility is PCRE-specific.
4468
4469       (j) Patterns compiled by PCRE can be saved and re-used at a later time,
4470       even  on  different hosts that have the other endianness. However, this
4471       does not apply to optimized data created by the just-in-time compiler.
4472
4473       (k)    The    alternative    matching    functions    (pcre_dfa_exec(),
4474       pcre16_dfa_exec()  and pcre32_dfa_exec(),) match in a different way and
4475       are not Perl-compatible.
4476
4477       (l) PCRE recognizes some special sequences such as (*CR) at  the  start
4478       of a pattern that set overall options that cannot be changed within the
4479       pattern.
4480
4481
4482AUTHOR
4483
4484       Philip Hazel
4485       University Computing Service
4486       Cambridge CB2 3QH, England.
4487
4488
4489REVISION
4490
4491       Last updated: 25 August 2012
4492       Copyright (c) 1997-2012 University of Cambridge.
4493------------------------------------------------------------------------------
4494
4495
4496PCREPATTERN(3)                                                  PCREPATTERN(3)
4497
4498
4499NAME
4500       PCRE - Perl-compatible regular expressions
4501
4502
4503PCRE REGULAR EXPRESSION DETAILS
4504
4505       The  syntax and semantics of the regular expressions that are supported
4506       by PCRE are described in detail below. There is a quick-reference  syn-
4507       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
4508       semantics as closely as it can. PCRE  also  supports  some  alternative
4509       regular  expression  syntax (which does not conflict with the Perl syn-
4510       tax) in order to provide some compatibility with regular expressions in
4511       Python, .NET, and Oniguruma.
4512
4513       Perl's  regular expressions are described in its own documentation, and
4514       regular expressions in general are covered in a number of  books,  some
4515       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
4516       Expressions", published by  O'Reilly,  covers  regular  expressions  in
4517       great  detail.  This  description  of  PCRE's  regular  expressions  is
4518       intended as reference material.
4519
4520       The original operation of PCRE was on strings of  one-byte  characters.
4521       However,  there  is  now also support for UTF-8 strings in the original
4522       library, an extra library that supports  16-bit  and  UTF-16  character
4523       strings,  and a third library that supports 32-bit and UTF-32 character
4524       strings. To use these features, PCRE must be built to include appropri-
4525       ate  support. When using UTF strings you must either call the compiling
4526       function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option,  or  the
4527       pattern must start with one of these special sequences:
4528
4529         (*UTF8)
4530         (*UTF16)
4531         (*UTF32)
4532         (*UTF)
4533
4534       (*UTF)  is  a  generic  sequence  that  can  be  used  with  any of the
4535       libraries.  Starting a pattern with such a sequence  is  equivalent  to
4536       setting  the  relevant option. This feature is not Perl-compatible. How
4537       setting a UTF mode affects pattern matching  is  mentioned  in  several
4538       places  below.  There  is also a summary of features in the pcreunicode
4539       page.
4540
4541       Another special sequence that may appear at the start of a  pattern  or
4542       in combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
4543
4544         (*UCP)
4545
4546       This  has  the  same  effect  as setting the PCRE_UCP option: it causes
4547       sequences such as \d and \w to  use  Unicode  properties  to  determine
4548       character types, instead of recognizing only characters with codes less
4549       than 128 via a lookup table.
4550
4551       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
4552       setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
4553       time. There are also some more of these special sequences that are con-
4554       cerned with the handling of newlines; they are described below.
4555
4556       The  remainder  of  this  document discusses the patterns that are sup-
4557       ported by PCRE  when  one  its  main  matching  functions,  pcre_exec()
4558       (8-bit)  or  pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has
4559       alternative      matching      functions,      pcre_dfa_exec()      and
4560       pcre[16|32_dfa_exec(),  which match using a different algorithm that is
4561       not Perl-compatible. Some of  the  features  discussed  below  are  not
4562       available  when  DFA matching is used. The advantages and disadvantages
4563       of the alternative functions, and how they differ from the normal func-
4564       tions, are discussed in the pcrematching page.
4565
4566
4567EBCDIC CHARACTER CODES
4568
4569       PCRE  can  be compiled to run in an environment that uses EBCDIC as its
4570       character code rather than ASCII or Unicode (typically a mainframe sys-
4571       tem).  In  the  sections below, character code values are ASCII or Uni-
4572       code; in an EBCDIC environment these characters may have different code
4573       values, and there are no code points greater than 255.
4574
4575
4576NEWLINE CONVENTIONS
4577
4578       PCRE  supports five different conventions for indicating line breaks in
4579       strings: a single CR (carriage return) character, a  single  LF  (line-
4580       feed) character, the two-character sequence CRLF, any of the three pre-
4581       ceding, or any Unicode newline sequence. The pcreapi page  has  further
4582       discussion  about newlines, and shows how to set the newline convention
4583       in the options arguments for the compiling and matching functions.
4584
4585       It is also possible to specify a newline convention by starting a  pat-
4586       tern string with one of the following five sequences:
4587
4588         (*CR)        carriage return
4589         (*LF)        linefeed
4590         (*CRLF)      carriage return, followed by linefeed
4591         (*ANYCRLF)   any of the three above
4592         (*ANY)       all Unicode newline sequences
4593
4594       These override the default and the options given to the compiling func-
4595       tion. For example, on a Unix system where LF  is  the  default  newline
4596       sequence, the pattern
4597
4598         (*CR)a.b
4599
4600       changes the convention to CR. That pattern matches "a\nb" because LF is
4601       no longer a newline. Note that these special settings,  which  are  not
4602       Perl-compatible,  are  recognized  only at the very start of a pattern,
4603       and that they must be in upper case.  If  more  than  one  of  them  is
4604       present, the last one is used.
4605
4606       The  newline  convention affects where the circumflex and dollar asser-
4607       tions are true. It also affects the interpretation of the dot metachar-
4608       acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
4609       does not affect what the \R escape sequence matches. By  default,  this
4610       is  any Unicode newline sequence, for Perl compatibility. However, this
4611       can be changed; see the description of \R in the section entitled "New-
4612       line  sequences"  below.  A change of \R setting can be combined with a
4613       change of newline convention.
4614
4615
4616CHARACTERS AND METACHARACTERS
4617
4618       A regular expression is a pattern that is  matched  against  a  subject
4619       string  from  left  to right. Most characters stand for themselves in a
4620       pattern, and match the corresponding characters in the  subject.  As  a
4621       trivial example, the pattern
4622
4623         The quick brown fox
4624
4625       matches a portion of a subject string that is identical to itself. When
4626       caseless matching is specified (the PCRE_CASELESS option), letters  are
4627       matched  independently  of case. In a UTF mode, PCRE always understands
4628       the concept of case for characters whose values are less than  128,  so
4629       caseless  matching  is always possible. For characters with higher val-
4630       ues, the concept of case is supported if PCRE is compiled with  Unicode
4631       property  support,  but  not  otherwise.   If  you want to use caseless
4632       matching for characters 128 and above, you must  ensure  that  PCRE  is
4633       compiled with Unicode property support as well as with UTF support.
4634
4635       The  power  of  regular  expressions  comes from the ability to include
4636       alternatives and repetitions in the pattern. These are encoded  in  the
4637       pattern by the use of metacharacters, which do not stand for themselves
4638       but instead are interpreted in some special way.
4639
4640       There are two different sets of metacharacters: those that  are  recog-
4641       nized  anywhere in the pattern except within square brackets, and those
4642       that are recognized within square brackets.  Outside  square  brackets,
4643       the metacharacters are as follows:
4644
4645         \      general escape character with several uses
4646         ^      assert start of string (or line, in multiline mode)
4647         $      assert end of string (or line, in multiline mode)
4648         .      match any character except newline (by default)
4649         [      start character class definition
4650         |      start of alternative branch
4651         (      start subpattern
4652         )      end subpattern
4653         ?      extends the meaning of (
4654                also 0 or 1 quantifier
4655                also quantifier minimizer
4656         *      0 or more quantifier
4657         +      1 or more quantifier
4658                also "possessive quantifier"
4659         {      start min/max quantifier
4660
4661       Part  of  a  pattern  that is in square brackets is called a "character
4662       class". In a character class the only metacharacters are:
4663
4664         \      general escape character
4665         ^      negate the class, but only if the first character
4666         -      indicates character range
4667         [      POSIX character class (only if followed by POSIX
4668                  syntax)
4669         ]      terminates the character class
4670
4671       The following sections describe the use of each of the metacharacters.
4672
4673
4674BACKSLASH
4675
4676       The backslash character has several uses. Firstly, if it is followed by
4677       a character that is not a number or a letter, it takes away any special
4678       meaning that character may have. This use of  backslash  as  an  escape
4679       character applies both inside and outside character classes.
4680
4681       For  example,  if  you want to match a * character, you write \* in the
4682       pattern.  This escaping action applies whether  or  not  the  following
4683       character  would  otherwise be interpreted as a metacharacter, so it is
4684       always safe to precede a non-alphanumeric  with  backslash  to  specify
4685       that  it stands for itself. In particular, if you want to match a back-
4686       slash, you write \\.
4687
4688       In a UTF mode, only ASCII numbers and letters have any special  meaning
4689       after  a  backslash.  All  other characters (in particular, those whose
4690       codepoints are greater than 127) are treated as literals.
4691
4692       If a pattern is compiled with the PCRE_EXTENDED option, white space  in
4693       the  pattern (other than in a character class) and characters between a
4694       # outside a character class and the next newline are ignored. An escap-
4695       ing  backslash  can  be used to include a white space or # character as
4696       part of the pattern.
4697
4698       If you want to remove the special meaning from a  sequence  of  charac-
4699       ters,  you can do so by putting them between \Q and \E. This is differ-
4700       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
4701       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
4702       tion. Note the following examples:
4703
4704         Pattern            PCRE matches   Perl matches
4705
4706         \Qabc$xyz\E        abc$xyz        abc followed by the
4707                                             contents of $xyz
4708         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
4709         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
4710
4711       The \Q...\E sequence is recognized both inside  and  outside  character
4712       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
4713       is not followed by \E later in the pattern, the literal  interpretation
4714       continues  to  the  end  of  the pattern (that is, \E is assumed at the
4715       end). If the isolated \Q is inside a character class,  this  causes  an
4716       error, because the character class is not terminated.
4717
4718   Non-printing characters
4719
4720       A second use of backslash provides a way of encoding non-printing char-
4721       acters in patterns in a visible manner. There is no restriction on  the
4722       appearance  of non-printing characters, apart from the binary zero that
4723       terminates a pattern, but when a pattern  is  being  prepared  by  text
4724       editing,  it  is  often  easier  to  use  one  of  the following escape
4725       sequences than the binary character it represents:
4726
4727         \a        alarm, that is, the BEL character (hex 07)
4728         \cx       "control-x", where x is any ASCII character
4729         \e        escape (hex 1B)
4730         \f        form feed (hex 0C)
4731         \n        linefeed (hex 0A)
4732         \r        carriage return (hex 0D)
4733         \t        tab (hex 09)
4734         \ddd      character with octal code ddd, or back reference
4735         \xhh      character with hex code hh
4736         \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
4737         \uhhhh    character with hex code hhhh (JavaScript mode only)
4738
4739       The precise effect of \cx on ASCII characters is as follows: if x is  a
4740       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
4741       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
4742       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
4743       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
4744       has  a  value greater than 127, a compile-time error occurs. This locks
4745       out non-ASCII characters in all modes.
4746
4747       The \c facility was designed for use with ASCII  characters,  but  with
4748       the  extension  to  Unicode it is even less useful than it once was. It
4749       is, however, recognized when PCRE is compiled  in  EBCDIC  mode,  where
4750       data  items  are always bytes. In this mode, all values are valid after
4751       \c. If the next character is a lower case letter, it  is  converted  to
4752       upper  case.  Then  the  0xc0  bits  of the byte are inverted. Thus \cA
4753       becomes hex 01, as in ASCII (A is C1), but because the  EBCDIC  letters
4754       are  disjoint,  \cZ becomes hex 29 (Z is E9), and other characters also
4755       generate different values.
4756
4757       By default, after \x, from zero to  two  hexadecimal  digits  are  read
4758       (letters can be in upper or lower case). Any number of hexadecimal dig-
4759       its may appear between \x{ and }, but the character code is constrained
4760       as follows:
4761
4762         8-bit non-UTF mode    less than 0x100
4763         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
4764         16-bit non-UTF mode   less than 0x10000
4765         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
4766         32-bit non-UTF mode   less than 0x80000000
4767         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
4768
4769       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
4770       called "surrogate" codepoints), and 0xffef.
4771
4772       If characters other than hexadecimal digits appear between \x{  and  },
4773       or if there is no terminating }, this form of escape is not recognized.
4774       Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
4775       escape,  with  no  following  digits, giving a character whose value is
4776       zero.
4777
4778       If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x
4779       is  as  just described only when it is followed by two hexadecimal dig-
4780       its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript
4781       mode, support for code points greater than 256 is provided by \u, which
4782       must be followed by four hexadecimal digits;  otherwise  it  matches  a
4783       literal  "u"  character.  Character codes specified by \u in JavaScript
4784       mode are constrained in the same was as those specified by \x  in  non-
4785       JavaScript mode.
4786
4787       Characters whose value is less than 256 can be defined by either of the
4788       two syntaxes for \x (or by \u in JavaScript mode). There is no  differ-
4789       ence in the way they are handled. For example, \xdc is exactly the same
4790       as \x{dc} (or \u00dc in JavaScript mode).
4791
4792       After \0 up to two further octal digits are read. If  there  are  fewer
4793       than  two  digits,  just  those  that  are  present  are used. Thus the
4794       sequence \0\x\07 specifies two binary zeros followed by a BEL character
4795       (code  value 7). Make sure you supply two digits after the initial zero
4796       if the pattern character that follows is itself an octal digit.
4797
4798       The handling of a backslash followed by a digit other than 0 is compli-
4799       cated.  Outside a character class, PCRE reads it and any following dig-
4800       its as a decimal number. If the number is less than  10,  or  if  there
4801       have been at least that many previous capturing left parentheses in the
4802       expression, the entire  sequence  is  taken  as  a  back  reference.  A
4803       description  of how this works is given later, following the discussion
4804       of parenthesized subpatterns.
4805
4806       Inside a character class, or if the decimal number is  greater  than  9
4807       and  there have not been that many capturing subpatterns, PCRE re-reads
4808       up to three octal digits following the backslash, and uses them to gen-
4809       erate a data character. Any subsequent digits stand for themselves. The
4810       value of the character is constrained in the  same  way  as  characters
4811       specified in hexadecimal.  For example:
4812
4813         \040   is another way of writing an ASCII space
4814         \40    is the same, provided there are fewer than 40
4815                   previous capturing subpatterns
4816         \7     is always a back reference
4817         \11    might be a back reference, or another way of
4818                   writing a tab
4819         \011   is always a tab
4820         \0113  is a tab followed by the character "3"
4821         \113   might be a back reference, otherwise the
4822                   character with octal code 113
4823         \377   might be a back reference, otherwise
4824                   the value 255 (decimal)
4825         \81    is either a back reference, or a binary zero
4826                   followed by the two characters "8" and "1"
4827
4828       Note  that  octal  values of 100 or greater must not be introduced by a
4829       leading zero, because no more than three octal digits are ever read.
4830
4831       All the sequences that define a single character value can be used both
4832       inside  and  outside character classes. In addition, inside a character
4833       class, \b is interpreted as the backspace character (hex 08).
4834
4835       \N is not allowed in a character class. \B, \R, and \X are not  special
4836       inside  a  character  class.  Like other unrecognized escape sequences,
4837       they are treated as  the  literal  characters  "B",  "R",  and  "X"  by
4838       default,  but cause an error if the PCRE_EXTRA option is set. Outside a
4839       character class, these sequences have different meanings.
4840
4841   Unsupported escape sequences
4842
4843       In Perl, the sequences \l, \L, \u, and \U are recognized by its  string
4844       handler  and  used  to  modify  the  case  of  following characters. By
4845       default, PCRE does not support these escape sequences. However, if  the
4846       PCRE_JAVASCRIPT_COMPAT  option  is set, \U matches a "U" character, and
4847       \u can be used to define a character by code point, as described in the
4848       previous section.
4849
4850   Absolute and relative back references
4851
4852       The  sequence  \g followed by an unsigned or a negative number, option-
4853       ally enclosed in braces, is an absolute or relative back  reference.  A
4854       named back reference can be coded as \g{name}. Back references are dis-
4855       cussed later, following the discussion of parenthesized subpatterns.
4856
4857   Absolute and relative subroutine calls
4858
4859       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
4860       name or a number enclosed either in angle brackets or single quotes, is
4861       an alternative syntax for referencing a subpattern as  a  "subroutine".
4862       Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
4863       \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
4864       reference; the latter is a subroutine call.
4865
4866   Generic character types
4867
4868       Another use of backslash is for specifying generic character types:
4869
4870         \d     any decimal digit
4871         \D     any character that is not a decimal digit
4872         \h     any horizontal white space character
4873         \H     any character that is not a horizontal white space character
4874         \s     any white space character
4875         \S     any character that is not a white space character
4876         \v     any vertical white space character
4877         \V     any character that is not a vertical white space character
4878         \w     any "word" character
4879         \W     any "non-word" character
4880
4881       There is also the single sequence \N, which matches a non-newline char-
4882       acter.  This is the same as the "." metacharacter when  PCRE_DOTALL  is
4883       not  set.  Perl also uses \N to match characters by name; PCRE does not
4884       support this.
4885
4886       Each pair of lower and upper case escape sequences partitions the  com-
4887       plete  set  of  characters  into two disjoint sets. Any given character
4888       matches one, and only one, of each pair. The sequences can appear  both
4889       inside  and outside character classes. They each match one character of
4890       the appropriate type. If the current matching point is at  the  end  of
4891       the  subject string, all of them fail, because there is no character to
4892       match.
4893
4894       For compatibility with Perl, \s does not match the VT  character  (code
4895       11).   This makes it different from the the POSIX "space" class. The \s
4896       characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
4897       "use locale;" is included in a Perl script, \s may match the VT charac-
4898       ter. In PCRE, it never does.
4899
4900       A "word" character is an underscore or any character that is  a  letter
4901       or  digit.   By  default,  the definition of letters and digits is con-
4902       trolled by PCRE's low-valued character tables, and may vary if  locale-
4903       specific  matching is taking place (see "Locale support" in the pcreapi
4904       page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
4905       systems,  or "french" in Windows, some character codes greater than 128
4906       are used for accented letters, and these are then matched  by  \w.  The
4907       use of locales with Unicode is discouraged.
4908
4909       By  default,  in  a  UTF  mode, characters with values greater than 128
4910       never match \d, \s, or \w, and always  match  \D,  \S,  and  \W.  These
4911       sequences  retain  their  original meanings from before UTF support was
4912       available, mainly for efficiency reasons. However, if PCRE is  compiled
4913       with  Unicode property support, and the PCRE_UCP option is set, the be-
4914       haviour is changed so that Unicode properties  are  used  to  determine
4915       character types, as follows:
4916
4917         \d  any character that \p{Nd} matches (decimal digit)
4918         \s  any character that \p{Z} matches, plus HT, LF, FF, CR
4919         \w  any character that \p{L} or \p{N} matches, plus underscore
4920
4921       The  upper case escapes match the inverse sets of characters. Note that
4922       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
4923       as  well as any Unicode letter, and underscore. Note also that PCRE_UCP
4924       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
4925       Matching these sequences is noticeably slower when PCRE_UCP is set.
4926
4927       The  sequences  \h, \H, \v, and \V are features that were added to Perl
4928       at release 5.10. In contrast to the other sequences, which  match  only
4929       ASCII  characters  by  default,  these always match certain high-valued
4930       codepoints, whether or not PCRE_UCP is set. The horizontal space  char-
4931       acters are:
4932
4933         U+0009     Horizontal tab (HT)
4934         U+0020     Space
4935         U+00A0     Non-break space
4936         U+1680     Ogham space mark
4937         U+180E     Mongolian vowel separator
4938         U+2000     En quad
4939         U+2001     Em quad
4940         U+2002     En space
4941         U+2003     Em space
4942         U+2004     Three-per-em space
4943         U+2005     Four-per-em space
4944         U+2006     Six-per-em space
4945         U+2007     Figure space
4946         U+2008     Punctuation space
4947         U+2009     Thin space
4948         U+200A     Hair space
4949         U+202F     Narrow no-break space
4950         U+205F     Medium mathematical space
4951         U+3000     Ideographic space
4952
4953       The vertical space characters are:
4954
4955         U+000A     Linefeed (LF)
4956         U+000B     Vertical tab (VT)
4957         U+000C     Form feed (FF)
4958         U+000D     Carriage return (CR)
4959         U+0085     Next line (NEL)
4960         U+2028     Line separator
4961         U+2029     Paragraph separator
4962
4963       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
4964       256 are relevant.
4965
4966   Newline sequences
4967
4968       Outside a character class, by default, the escape sequence  \R  matches
4969       any  Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
4970       to the following:
4971
4972         (?>\r\n|\n|\x0b|\f|\r|\x85)
4973
4974       This is an example of an "atomic group", details  of  which  are  given
4975       below.  This particular group matches either the two-character sequence
4976       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
4977       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car-
4978       riage return, U+000D), or NEL (next line,  U+0085).  The  two-character
4979       sequence is treated as a single unit that cannot be split.
4980
4981       In  other modes, two additional characters whose codepoints are greater
4982       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
4983       rator,  U+2029).   Unicode character property support is not needed for
4984       these characters to be recognized.
4985
4986       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
4987       the  complete  set  of  Unicode  line  endings)  by  setting the option
4988       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
4989       (BSR is an abbrevation for "backslash R".) This can be made the default
4990       when PCRE is built; if this is the case, the  other  behaviour  can  be
4991       requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
4992       specify these settings by starting a pattern string  with  one  of  the
4993       following sequences:
4994
4995         (*BSR_ANYCRLF)   CR, LF, or CRLF only
4996         (*BSR_UNICODE)   any Unicode newline sequence
4997
4998       These override the default and the options given to the compiling func-
4999       tion, but they can themselves be  overridden  by  options  given  to  a
5000       matching  function.  Note  that  these  special settings, which are not
5001       Perl-compatible, are recognized only at the very start  of  a  pattern,
5002       and  that  they  must  be  in  upper  case. If more than one of them is
5003       present, the last one is used. They can be combined with  a  change  of
5004       newline convention; for example, a pattern can start with:
5005
5006         (*ANY)(*BSR_ANYCRLF)
5007
5008       They  can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
5009       or (*UCP) special sequences. Inside a character class, \R is treated as
5010       an  unrecognized  escape  sequence,  and  so  matches the letter "R" by
5011       default, but causes an error if PCRE_EXTRA is set.
5012
5013   Unicode character properties
5014
5015       When PCRE is built with Unicode character property support, three addi-
5016       tional  escape sequences that match characters with specific properties
5017       are available.  When in 8-bit non-UTF-8 mode, these  sequences  are  of
5018       course  limited  to  testing  characters whose codepoints are less than
5019       256, but they do work in this mode.  The extra escape sequences are:
5020
5021         \p{xx}   a character with the xx property
5022         \P{xx}   a character without the xx property
5023         \X       a Unicode extended grapheme cluster
5024
5025       The property names represented by xx above are limited to  the  Unicode
5026       script names, the general category properties, "Any", which matches any
5027       character  (including  newline),  and  some  special  PCRE   properties
5028       (described  in the next section).  Other Perl properties such as "InMu-
5029       sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}
5030       does not match any characters, so always causes a match failure.
5031
5032       Sets of Unicode characters are defined as belonging to certain scripts.
5033       A character from one of these sets can be matched using a script  name.
5034       For example:
5035
5036         \p{Greek}
5037         \P{Han}
5038
5039       Those  that are not part of an identified script are lumped together as
5040       "Common". The current list of scripts is:
5041
5042       Arabic, Armenian, Avestan, Balinese, Bamum, Batak,  Bengali,  Bopomofo,
5043       Brahmi,  Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
5044       Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic,  Deseret,
5045       Devanagari,   Egyptian_Hieroglyphs,   Ethiopic,  Georgian,  Glagolitic,
5046       Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-
5047       gana,   Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,  Inscrip-
5048       tional_Parthian,  Javanese,  Kaithi,   Kannada,   Katakana,   Kayah_Li,
5049       Kharoshthi,  Khmer,  Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
5050       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
5051       Meroitic_Hieroglyphs,   Miao,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
5052       Ogham,   Old_Italic,   Old_Persian,   Old_South_Arabian,    Old_Turkic,
5053       Ol_Chiki,  Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-
5054       tan, Saurashtra, Sharada, Shavian,  Sinhala,  Sora_Sompeng,  Sundanese,
5055       Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
5056       Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,  Ugaritic,  Vai,
5057       Yi.
5058
5059       Each character has exactly one Unicode general category property, spec-
5060       ified by a two-letter abbreviation. For compatibility with Perl,  nega-
5061       tion  can  be  specified  by including a circumflex between the opening
5062       brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
5063       \P{Lu}.
5064
5065       If only one letter is specified with \p or \P, it includes all the gen-
5066       eral category properties that start with that letter. In this case,  in
5067       the  absence of negation, the curly brackets in the escape sequence are
5068       optional; these two examples have the same effect:
5069
5070         \p{L}
5071         \pL
5072
5073       The following general category property codes are supported:
5074
5075         C     Other
5076         Cc    Control
5077         Cf    Format
5078         Cn    Unassigned
5079         Co    Private use
5080         Cs    Surrogate
5081
5082         L     Letter
5083         Ll    Lower case letter
5084         Lm    Modifier letter
5085         Lo    Other letter
5086         Lt    Title case letter
5087         Lu    Upper case letter
5088
5089         M     Mark
5090         Mc    Spacing mark
5091         Me    Enclosing mark
5092         Mn    Non-spacing mark
5093
5094         N     Number
5095         Nd    Decimal number
5096         Nl    Letter number
5097         No    Other number
5098
5099         P     Punctuation
5100         Pc    Connector punctuation
5101         Pd    Dash punctuation
5102         Pe    Close punctuation
5103         Pf    Final punctuation
5104         Pi    Initial punctuation
5105         Po    Other punctuation
5106         Ps    Open punctuation
5107
5108         S     Symbol
5109         Sc    Currency symbol
5110         Sk    Modifier symbol
5111         Sm    Mathematical symbol
5112         So    Other symbol
5113
5114         Z     Separator
5115         Zl    Line separator
5116         Zp    Paragraph separator
5117         Zs    Space separator
5118
5119       The special property L& is also supported: it matches a character  that
5120       has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
5121       classified as a modifier or "other".
5122
5123       The Cs (Surrogate) property applies only to  characters  in  the  range
5124       U+D800  to U+DFFF. Such characters are not valid in Unicode strings and
5125       so cannot be tested by PCRE, unless  UTF  validity  checking  has  been
5126       turned    off    (see    the    discussion    of    PCRE_NO_UTF8_CHECK,
5127       PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page).  Perl
5128       does not support the Cs property.
5129
5130       The  long  synonyms  for  property  names  that  Perl supports (such as
5131       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
5132       any of these properties with "Is".
5133
5134       No character that is in the Unicode table has the Cn (unassigned) prop-
5135       erty.  Instead, this property is assumed for any code point that is not
5136       in the Unicode table.
5137
5138       Specifying  caseless  matching  does not affect these escape sequences.
5139       For example, \p{Lu} always matches only upper case letters.
5140
5141       Matching characters by Unicode property is not fast, because  PCRE  has
5142       to  do  a  multistage table lookup in order to find a character's prop-
5143       erty. That is why the traditional escape sequences such as \d and \w do
5144       not use Unicode properties in PCRE by default, though you can make them
5145       do so by setting the PCRE_UCP option or by starting  the  pattern  with
5146       (*UCP).
5147
5148   Extended grapheme clusters
5149
5150       The  \X  escape  matches  any number of Unicode characters that form an
5151       "extended grapheme cluster", and treats the sequence as an atomic group
5152       (see  below).   Up  to and including release 8.31, PCRE matched an ear-
5153       lier, simpler definition that was equivalent to
5154
5155         (?>\PM\pM*)
5156
5157       That is, it matched a character without the "mark"  property,  followed
5158       by  zero  or  more characters with the "mark" property. Characters with
5159       the "mark" property are typically non-spacing accents that  affect  the
5160       preceding character.
5161
5162       This  simple definition was extended in Unicode to include more compli-
5163       cated kinds of composite character by giving each character a  grapheme
5164       breaking  property,  and  creating  rules  that use these properties to
5165       define the boundaries of extended grapheme  clusters.  In  releases  of
5166       PCRE later than 8.31, \X matches one of these clusters.
5167
5168       \X  always  matches  at least one character. Then it decides whether to
5169       add additional characters according to the following rules for ending a
5170       cluster:
5171
5172       1. End at the end of the subject string.
5173
5174       2.  Do not end between CR and LF; otherwise end after any control char-
5175       acter.
5176
5177       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
5178       characters  are of five types: L, V, T, LV, and LVT. An L character may
5179       be followed by an L, V, LV, or LVT character; an LV or V character  may
5180       be followed by a V or T character; an LVT or T character may be follwed
5181       only by a T character.
5182
5183       4. Do not end before extending characters or spacing marks.  Characters
5184       with  the  "mark"  property  always have the "extend" grapheme breaking
5185       property.
5186
5187       5. Do not end after prepend characters.
5188
5189       6. Otherwise, end the cluster.
5190
5191   PCRE's additional properties
5192
5193       As well as the standard Unicode properties described above,  PCRE  sup-
5194       ports  four  more  that  make it possible to convert traditional escape
5195       sequences such as \w and \s and POSIX character classes to use  Unicode
5196       properties.  PCRE  uses  these non-standard, non-Perl properties inter-
5197       nally when PCRE_UCP is set. They are:
5198
5199         Xan   Any alphanumeric character
5200         Xps   Any POSIX space character
5201         Xsp   Any Perl space character
5202         Xwd   Any Perl "word" character
5203
5204       Xan matches characters that have either the L (letter) or the  N  (num-
5205       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
5206       form feed, or carriage return, and any other character that has  the  Z
5207       (separator) property.  Xsp is the same as Xps, except that vertical tab
5208       is excluded. Xwd matches the same characters as Xan, plus underscore.
5209
5210   Resetting the match start
5211
5212       The escape sequence \K causes any previously matched characters not  to
5213       be included in the final matched sequence. For example, the pattern:
5214
5215         foo\Kbar
5216
5217       matches  "foobar",  but reports that it has matched "bar". This feature
5218       is similar to a lookbehind assertion (described  below).   However,  in
5219       this  case, the part of the subject before the real match does not have
5220       to be of fixed length, as lookbehind assertions do. The use of \K  does
5221       not  interfere  with  the setting of captured substrings.  For example,
5222       when the pattern
5223
5224         (foo)\Kbar
5225
5226       matches "foobar", the first substring is still set to "foo".
5227
5228       Perl documents that the use  of  \K  within  assertions  is  "not  well
5229       defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
5230       assertions, but is ignored in negative assertions.
5231
5232   Simple assertions
5233
5234       The final use of backslash is for certain simple assertions. An  asser-
5235       tion  specifies a condition that has to be met at a particular point in
5236       a match, without consuming any characters from the subject string.  The
5237       use  of subpatterns for more complicated assertions is described below.
5238       The backslashed assertions are:
5239
5240         \b     matches at a word boundary
5241         \B     matches when not at a word boundary
5242         \A     matches at the start of the subject
5243         \Z     matches at the end of the subject
5244                 also matches before a newline at the end of the subject
5245         \z     matches only at the end of the subject
5246         \G     matches at the first matching position in the subject
5247
5248       Inside a character class, \b has a different meaning;  it  matches  the
5249       backspace  character.  If  any  other  of these assertions appears in a
5250       character class, by default it matches the corresponding literal  char-
5251       acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
5252       PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
5253       ated instead.
5254
5255       A  word  boundary is a position in the subject string where the current
5256       character and the previous character do not both match \w or  \W  (i.e.
5257       one  matches  \w  and the other matches \W), or the start or end of the
5258       string if the first or last character matches \w,  respectively.  In  a
5259       UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
5260       PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
5261       PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
5262       quence. However, whatever follows \b normally determines which  it  is.
5263       For example, the fragment \ba matches "a" at the start of a word.
5264
5265       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
5266       and dollar (described in the next section) in that they only ever match
5267       at  the  very start and end of the subject string, whatever options are
5268       set. Thus, they are independent of multiline mode. These  three  asser-
5269       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
5270       affect only the behaviour of the circumflex and dollar  metacharacters.
5271       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
5272       cating that matching is to start at a point other than the beginning of
5273       the  subject,  \A  can never match. The difference between \Z and \z is
5274       that \Z matches before a newline at the end of the string as well as at
5275       the very end, whereas \z matches only at the end.
5276
5277       The  \G assertion is true only when the current matching position is at
5278       the start point of the match, as specified by the startoffset  argument
5279       of  pcre_exec().  It  differs  from \A when the value of startoffset is
5280       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
5281       ments, you can mimic Perl's /g option, and it is in this kind of imple-
5282       mentation where \G can be useful.
5283
5284       Note, however, that PCRE's interpretation of \G, as the  start  of  the
5285       current match, is subtly different from Perl's, which defines it as the
5286       end of the previous match. In Perl, these can  be  different  when  the
5287       previously  matched  string was empty. Because PCRE does just one match
5288       at a time, it cannot reproduce this behaviour.
5289
5290       If all the alternatives of a pattern begin with \G, the  expression  is
5291       anchored to the starting match position, and the "anchored" flag is set
5292       in the compiled regular expression.
5293
5294
5295CIRCUMFLEX AND DOLLAR
5296
5297       The circumflex and dollar  metacharacters  are  zero-width  assertions.
5298       That  is,  they test for a particular condition being true without con-
5299       suming any characters from the subject string.
5300
5301       Outside a character class, in the default matching mode, the circumflex
5302       character  is  an  assertion  that is true only if the current matching
5303       point is at the start of the subject string. If the  startoffset  argu-
5304       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
5305       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
5306       has an entirely different meaning (see below).
5307
5308       Circumflex  need  not be the first character of the pattern if a number
5309       of alternatives are involved, but it should be the first thing in  each
5310       alternative  in  which  it appears if the pattern is ever to match that
5311       branch. If all possible alternatives start with a circumflex, that  is,
5312       if  the  pattern  is constrained to match only at the start of the sub-
5313       ject, it is said to be an "anchored" pattern.  (There  are  also  other
5314       constructs that can cause a pattern to be anchored.)
5315
5316       The  dollar  character is an assertion that is true only if the current
5317       matching point is at the end of  the  subject  string,  or  immediately
5318       before  a newline at the end of the string (by default). Note, however,
5319       that it does not actually match the newline. Dollar  need  not  be  the
5320       last character of the pattern if a number of alternatives are involved,
5321       but it should be the last item in any branch in which it appears.  Dol-
5322       lar has no special meaning in a character class.
5323
5324       The  meaning  of  dollar  can be changed so that it matches only at the
5325       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
5326       compile time. This does not affect the \Z assertion.
5327
5328       The meanings of the circumflex and dollar characters are changed if the
5329       PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
5330       matches  immediately after internal newlines as well as at the start of
5331       the subject string. It does not match after a  newline  that  ends  the
5332       string.  A dollar matches before any newlines in the string, as well as
5333       at the very end, when PCRE_MULTILINE is set. When newline is  specified
5334       as  the  two-character  sequence CRLF, isolated CR and LF characters do
5335       not indicate newlines.
5336
5337       For example, the pattern /^abc$/ matches the subject string  "def\nabc"
5338       (where  \n  represents a newline) in multiline mode, but not otherwise.
5339       Consequently, patterns that are anchored in single  line  mode  because
5340       all  branches  start  with  ^ are not anchored in multiline mode, and a
5341       match for circumflex is  possible  when  the  startoffset  argument  of
5342       pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
5343       PCRE_MULTILINE is set.
5344
5345       Note that the sequences \A, \Z, and \z can be used to match  the  start
5346       and  end of the subject in both modes, and if all branches of a pattern
5347       start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
5348       set.
5349
5350
5351FULL STOP (PERIOD, DOT) AND \N
5352
5353       Outside a character class, a dot in the pattern matches any one charac-
5354       ter in the subject string except (by default) a character  that  signi-
5355       fies the end of a line.
5356
5357       When  a line ending is defined as a single character, dot never matches
5358       that character; when the two-character sequence CRLF is used, dot  does
5359       not  match  CR  if  it  is immediately followed by LF, but otherwise it
5360       matches all characters (including isolated CRs and LFs). When any  Uni-
5361       code  line endings are being recognized, dot does not match CR or LF or
5362       any of the other line ending characters.
5363
5364       The behaviour of dot with regard to newlines can  be  changed.  If  the
5365       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
5366       exception. If the two-character sequence CRLF is present in the subject
5367       string, it takes two dots to match it.
5368
5369       The  handling of dot is entirely independent of the handling of circum-
5370       flex and dollar, the only relationship being  that  they  both  involve
5371       newlines. Dot has no special meaning in a character class.
5372
5373       The  escape  sequence  \N  behaves  like  a  dot, except that it is not
5374       affected by the PCRE_DOTALL option. In  other  words,  it  matches  any
5375       character  except  one that signifies the end of a line. Perl also uses
5376       \N to match characters by name; PCRE does not support this.
5377
5378
5379MATCHING A SINGLE DATA UNIT
5380
5381       Outside a character class, the escape sequence \C matches any one  data
5382       unit,  whether or not a UTF mode is set. In the 8-bit library, one data
5383       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
5384       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
5385       line-ending characters. The feature is provided in  Perl  in  order  to
5386       match individual bytes in UTF-8 mode, but it is unclear how it can use-
5387       fully be used. Because \C breaks up  characters  into  individual  data
5388       units,  matching  one unit with \C in a UTF mode means that the rest of
5389       the string may start with a malformed UTF character. This has undefined
5390       results, because PCRE assumes that it is dealing with valid UTF strings
5391       (and by default it checks this at the start of  processing  unless  the
5392       PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK  or PCRE_NO_UTF32_CHECK option
5393       is used).
5394
5395       PCRE does not allow \C to appear in  lookbehind  assertions  (described
5396       below)  in  a UTF mode, because this would make it impossible to calcu-
5397       late the length of the lookbehind.
5398
5399       In general, the \C escape sequence is best avoided. However, one way of
5400       using  it that avoids the problem of malformed UTF characters is to use
5401       a lookahead to check the length of the next character, as in this  pat-
5402       tern,  which  could be used with a UTF-8 string (ignore white space and
5403       line breaks):
5404
5405         (?| (?=[\x00-\x7f])(\C) |
5406             (?=[\x80-\x{7ff}])(\C)(\C) |
5407             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
5408             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
5409
5410       A group that starts with (?| resets the capturing  parentheses  numbers
5411       in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
5412       assertions at the start of each branch check the next  UTF-8  character
5413       for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
5414       character's individual bytes are then captured by the appropriate  num-
5415       ber of groups.
5416
5417
5418SQUARE BRACKETS AND CHARACTER CLASSES
5419
5420       An opening square bracket introduces a character class, terminated by a
5421       closing square bracket. A closing square bracket on its own is not spe-
5422       cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
5423       a lone closing square bracket causes a compile-time error. If a closing
5424       square  bracket  is required as a member of the class, it should be the
5425       first data character in the class  (after  an  initial  circumflex,  if
5426       present) or escaped with a backslash.
5427
5428       A  character  class matches a single character in the subject. In a UTF
5429       mode, the character may be more than one  data  unit  long.  A  matched
5430       character must be in the set of characters defined by the class, unless
5431       the first character in the class definition is a circumflex,  in  which
5432       case the subject character must not be in the set defined by the class.
5433       If a circumflex is actually required as a member of the  class,  ensure
5434       it is not the first character, or escape it with a backslash.
5435
5436       For  example, the character class [aeiou] matches any lower case vowel,
5437       while [^aeiou] matches any character that is not a  lower  case  vowel.
5438       Note that a circumflex is just a convenient notation for specifying the
5439       characters that are in the class by enumerating those that are  not.  A
5440       class  that starts with a circumflex is not an assertion; it still con-
5441       sumes a character from the subject string, and therefore  it  fails  if
5442       the current pointer is at the end of the string.
5443
5444       In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
5445       (0xffff) can be included in a class as a literal string of data  units,
5446       or by using the \x{ escaping mechanism.
5447
5448       When  caseless  matching  is set, any letters in a class represent both
5449       their upper case and lower case versions, so for  example,  a  caseless
5450       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
5451       match "A", whereas a caseful version would. In a UTF mode, PCRE  always
5452       understands  the  concept  of case for characters whose values are less
5453       than 128, so caseless matching is always possible. For characters  with
5454       higher  values,  the  concept  of case is supported if PCRE is compiled
5455       with Unicode property support, but not otherwise.  If you want  to  use
5456       caseless  matching in a UTF mode for characters 128 and above, you must
5457       ensure that PCRE is compiled with Unicode property support as  well  as
5458       with UTF support.
5459
5460       Characters  that  might  indicate  line breaks are never treated in any
5461       special way  when  matching  character  classes,  whatever  line-ending
5462       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
5463       PCRE_MULTILINE options is used. A class such as [^a] always matches one
5464       of these characters.
5465
5466       The  minus (hyphen) character can be used to specify a range of charac-
5467       ters in a character  class.  For  example,  [d-m]  matches  any  letter
5468       between  d  and  m,  inclusive.  If  a minus character is required in a
5469       class, it must be escaped with a backslash  or  appear  in  a  position
5470       where  it cannot be interpreted as indicating a range, typically as the
5471       first or last character in the class.
5472
5473       It is not possible to have the literal character "]" as the end charac-
5474       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
5475       two characters ("W" and "-") followed by a literal string "46]", so  it
5476       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
5477       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
5478       preted  as a class containing a range followed by two other characters.
5479       The octal or hexadecimal representation of "]" can also be used to  end
5480       a range.
5481
5482       Ranges  operate in the collating sequence of character values. They can
5483       also  be  used  for  characters  specified  numerically,  for   example
5484       [\000-\037].  Ranges  can include any characters that are valid for the
5485       current mode.
5486
5487       If a range that includes letters is used when caseless matching is set,
5488       it matches the letters in either case. For example, [W-c] is equivalent
5489       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
5490       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
5491       accented E characters in both cases. In UTF modes,  PCRE  supports  the
5492       concept  of  case for characters with values greater than 128 only when
5493       it is compiled with Unicode property support.
5494
5495       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
5496       \w, and \W may appear in a character class, and add the characters that
5497       they match to the class. For example, [\dABCDEF] matches any  hexadeci-
5498       mal  digit.  In  UTF modes, the PCRE_UCP option affects the meanings of
5499       \d, \s, \w and their upper case partners, just as  it  does  when  they
5500       appear  outside a character class, as described in the section entitled
5501       "Generic character types" above. The escape sequence \b has a different
5502       meaning  inside  a character class; it matches the backspace character.
5503       The sequences \B, \N, \R, and \X are not  special  inside  a  character
5504       class.  Like  any other unrecognized escape sequences, they are treated
5505       as the literal characters "B", "N", "R", and "X" by default, but  cause
5506       an error if the PCRE_EXTRA option is set.
5507
5508       A  circumflex  can  conveniently  be used with the upper case character
5509       types to specify a more restricted set of characters than the  matching
5510       lower  case  type.  For example, the class [^\W_] matches any letter or
5511       digit, but not underscore, whereas [\w] includes underscore. A positive
5512       character class should be read as "something OR something OR ..." and a
5513       negative class as "NOT something AND NOT something AND NOT ...".
5514
5515       The only metacharacters that are recognized in  character  classes  are
5516       backslash,  hyphen  (only  where  it can be interpreted as specifying a
5517       range), circumflex (only at the start), opening  square  bracket  (only
5518       when  it can be interpreted as introducing a POSIX class name - see the
5519       next section), and the terminating  closing  square  bracket.  However,
5520       escaping other non-alphanumeric characters does no harm.
5521
5522
5523POSIX CHARACTER CLASSES
5524
5525       Perl supports the POSIX notation for character classes. This uses names
5526       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
5527       supports this notation. For example,
5528
5529         [01[:alpha:]%]
5530
5531       matches "0", "1", any alphabetic character, or "%". The supported class
5532       names are:
5533
5534         alnum    letters and digits
5535         alpha    letters
5536         ascii    character codes 0 - 127
5537         blank    space or tab only
5538         cntrl    control characters
5539         digit    decimal digits (same as \d)
5540         graph    printing characters, excluding space
5541         lower    lower case letters
5542         print    printing characters, including space
5543         punct    printing characters, excluding letters and digits and space
5544         space    white space (not quite the same as \s)
5545         upper    upper case letters
5546         word     "word" characters (same as \w)
5547         xdigit   hexadecimal digits
5548
5549       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
5550       and  space  (32). Notice that this list includes the VT character (code
5551       11). This makes "space" different to \s, which does not include VT (for
5552       Perl compatibility).
5553
5554       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
5555       from Perl 5.8. Another Perl extension is negation, which  is  indicated
5556       by a ^ character after the colon. For example,
5557
5558         [12[:^digit:]]
5559
5560       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
5561       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5562       these are not supported, and an error is given if they are encountered.
5563
5564       By  default,  in  UTF modes, characters with values greater than 128 do
5565       not match any of the POSIX character classes. However, if the  PCRE_UCP
5566       option  is passed to pcre_compile(), some of the classes are changed so
5567       that Unicode character properties are used. This is achieved by replac-
5568       ing the POSIX classes by other sequences, as follows:
5569
5570         [:alnum:]  becomes  \p{Xan}
5571         [:alpha:]  becomes  \p{L}
5572         [:blank:]  becomes  \h
5573         [:digit:]  becomes  \p{Nd}
5574         [:lower:]  becomes  \p{Ll}
5575         [:space:]  becomes  \p{Xps}
5576         [:upper:]  becomes  \p{Lu}
5577         [:word:]   becomes  \p{Xwd}
5578
5579       Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
5580       POSIX classes are unchanged, and match only characters with code points
5581       less than 128.
5582
5583
5584VERTICAL BAR
5585
5586       Vertical  bar characters are used to separate alternative patterns. For
5587       example, the pattern
5588
5589         gilbert|sullivan
5590
5591       matches either "gilbert" or "sullivan". Any number of alternatives  may
5592       appear,  and  an  empty  alternative  is  permitted (matching the empty
5593       string). The matching process tries each alternative in turn, from left
5594       to  right, and the first one that succeeds is used. If the alternatives
5595       are within a subpattern (defined below), "succeeds" means matching  the
5596       rest of the main pattern as well as the alternative in the subpattern.
5597
5598
5599INTERNAL OPTION SETTING
5600
5601       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
5602       PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
5603       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
5604       between "(?" and ")".  The option letters are
5605
5606         i  for PCRE_CASELESS
5607         m  for PCRE_MULTILINE
5608         s  for PCRE_DOTALL
5609         x  for PCRE_EXTENDED
5610
5611       For example, (?im) sets caseless, multiline matching. It is also possi-
5612       ble to unset these options by preceding the letter with a hyphen, and a
5613       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
5614       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
5615       is also permitted. If a  letter  appears  both  before  and  after  the
5616       hyphen, the option is unset.
5617
5618       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
5619       can be changed in the same way as the Perl-compatible options by  using
5620       the characters J, U and X respectively.
5621
5622       When  one  of  these  option  changes occurs at top level (that is, not
5623       inside subpattern parentheses), the change applies to the remainder  of
5624       the pattern that follows. If the change is placed right at the start of
5625       a pattern, PCRE extracts it into the global options (and it will there-
5626       fore show up in data extracted by the pcre_fullinfo() function).
5627
5628       An  option  change  within a subpattern (see below for a description of
5629       subpatterns) affects only that part of the subpattern that follows  it,
5630       so
5631
5632         (a(?i)b)c
5633
5634       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
5635       used).  By this means, options can be made to have  different  settings
5636       in  different parts of the pattern. Any changes made in one alternative
5637       do carry on into subsequent branches within the  same  subpattern.  For
5638       example,
5639
5640         (a(?i)b|c)
5641
5642       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
5643       first branch is abandoned before the option setting.  This  is  because
5644       the  effects  of option settings happen at compile time. There would be
5645       some very weird behaviour otherwise.
5646
5647       Note: There are other PCRE-specific options that  can  be  set  by  the
5648       application  when  the  compiling  or matching functions are called. In
5649       some cases the pattern can contain special leading  sequences  such  as
5650       (*CRLF)  to  override  what  the  application  has set or what has been
5651       defaulted.  Details  are  given  in  the  section   entitled   "Newline
5652       sequences"  above.  There  are also the (*UTF8), (*UTF16),(*UTF32), and
5653       (*UCP) leading sequences that can be used to set UTF and Unicode  prop-
5654       erty  modes;  they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
5655       PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF)  sequence
5656       is a generic version that can be used with any of the libraries.
5657
5658
5659SUBPATTERNS
5660
5661       Subpatterns are delimited by parentheses (round brackets), which can be
5662       nested.  Turning part of a pattern into a subpattern does two things:
5663
5664       1. It localizes a set of alternatives. For example, the pattern
5665
5666         cat(aract|erpillar|)
5667
5668       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
5669       it would match "cataract", "erpillar" or an empty string.
5670
5671       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
5672       that, when the whole pattern  matches,  that  portion  of  the  subject
5673       string that matched the subpattern is passed back to the caller via the
5674       ovector argument of the matching function. (This applies  only  to  the
5675       traditional  matching functions; the DFA matching functions do not sup-
5676       port capturing.)
5677
5678       Opening parentheses are counted from left to right (starting from 1) to
5679       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
5680       string "the red king" is matched against the pattern
5681
5682         the ((red|white) (king|queen))
5683
5684       the captured substrings are "red king", "red", and "king", and are num-
5685       bered 1, 2, and 3, respectively.
5686
5687       The  fact  that  plain  parentheses  fulfil two functions is not always
5688       helpful.  There are often times when a grouping subpattern is  required
5689       without  a capturing requirement. If an opening parenthesis is followed
5690       by a question mark and a colon, the subpattern does not do any  captur-
5691       ing,  and  is  not  counted when computing the number of any subsequent
5692       capturing subpatterns. For example, if the string "the white queen"  is
5693       matched against the pattern
5694
5695         the ((?:red|white) (king|queen))
5696
5697       the captured substrings are "white queen" and "queen", and are numbered
5698       1 and 2. The maximum number of capturing subpatterns is 65535.
5699
5700       As a convenient shorthand, if any option settings are required  at  the
5701       start  of  a  non-capturing  subpattern,  the option letters may appear
5702       between the "?" and the ":". Thus the two patterns
5703
5704         (?i:saturday|sunday)
5705         (?:(?i)saturday|sunday)
5706
5707       match exactly the same set of strings. Because alternative branches are
5708       tried  from  left  to right, and options are not reset until the end of
5709       the subpattern is reached, an option setting in one branch does  affect
5710       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
5711       "Saturday".
5712
5713
5714DUPLICATE SUBPATTERN NUMBERS
5715
5716       Perl 5.10 introduced a feature whereby each alternative in a subpattern
5717       uses  the same numbers for its capturing parentheses. Such a subpattern
5718       starts with (?| and is itself a non-capturing subpattern. For  example,
5719       consider this pattern:
5720
5721         (?|(Sat)ur|(Sun))day
5722
5723       Because  the two alternatives are inside a (?| group, both sets of cap-
5724       turing parentheses are numbered one. Thus, when  the  pattern  matches,
5725       you  can  look  at captured substring number one, whichever alternative
5726       matched. This construct is useful when you want to  capture  part,  but
5727       not all, of one of a number of alternatives. Inside a (?| group, paren-
5728       theses are numbered as usual, but the number is reset at the  start  of
5729       each  branch.  The numbers of any capturing parentheses that follow the
5730       subpattern start after the highest number used in any branch. The  fol-
5731       lowing example is taken from the Perl documentation. The numbers under-
5732       neath show in which buffer the captured content will be stored.
5733
5734         # before  ---------------branch-reset----------- after
5735         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5736         # 1            2         2  3        2     3     4
5737
5738       A back reference to a numbered subpattern uses the  most  recent  value
5739       that  is  set  for that number by any subpattern. The following pattern
5740       matches "abcabc" or "defdef":
5741
5742         /(?|(abc)|(def))\1/
5743
5744       In contrast, a subroutine call to a numbered subpattern  always  refers
5745       to  the  first  one in the pattern with the given number. The following
5746       pattern matches "abcabc" or "defabc":
5747
5748         /(?|(abc)|(def))(?1)/
5749
5750       If a condition test for a subpattern's having matched refers to a  non-
5751       unique  number, the test is true if any of the subpatterns of that num-
5752       ber have matched.
5753
5754       An alternative approach to using this "branch reset" feature is to  use
5755       duplicate named subpatterns, as described in the next section.
5756
5757
5758NAMED SUBPATTERNS
5759
5760       Identifying  capturing  parentheses  by number is simple, but it can be
5761       very hard to keep track of the numbers in complicated  regular  expres-
5762       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
5763       change. To help with this difficulty, PCRE supports the naming of  sub-
5764       patterns. This feature was not added to Perl until release 5.10. Python
5765       had the feature earlier, and PCRE introduced it at release  4.0,  using
5766       the  Python syntax. PCRE now supports both the Perl and the Python syn-
5767       tax. Perl allows identically numbered  subpatterns  to  have  different
5768       names, but PCRE does not.
5769
5770       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
5771       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
5772       to  capturing parentheses from other parts of the pattern, such as back
5773       references, recursion, and conditions, can be made by name as  well  as
5774       by number.
5775
5776       Names  consist  of  up  to  32 alphanumeric characters and underscores.
5777       Named capturing parentheses are still  allocated  numbers  as  well  as
5778       names,  exactly as if the names were not present. The PCRE API provides
5779       function calls for extracting the name-to-number translation table from
5780       a compiled pattern. There is also a convenience function for extracting
5781       a captured substring by name.
5782
5783       By default, a name must be unique within a pattern, but it is  possible
5784       to relax this constraint by setting the PCRE_DUPNAMES option at compile
5785       time. (Duplicate names are also always permitted for  subpatterns  with
5786       the  same  number, set up as described in the previous section.) Dupli-
5787       cate names can be useful for patterns where only one  instance  of  the
5788       named  parentheses  can  match. Suppose you want to match the name of a
5789       weekday, either as a 3-letter abbreviation or as the full name, and  in
5790       both cases you want to extract the abbreviation. This pattern (ignoring
5791       the line breaks) does the job:
5792
5793         (?<DN>Mon|Fri|Sun)(?:day)?|
5794         (?<DN>Tue)(?:sday)?|
5795         (?<DN>Wed)(?:nesday)?|
5796         (?<DN>Thu)(?:rsday)?|
5797         (?<DN>Sat)(?:urday)?
5798
5799       There are five capturing substrings, but only one is ever set  after  a
5800       match.  (An alternative way of solving this problem is to use a "branch
5801       reset" subpattern, as described in the previous section.)
5802
5803       The convenience function for extracting the data by  name  returns  the
5804       substring  for  the first (and in this example, the only) subpattern of
5805       that name that matched. This saves searching  to  find  which  numbered
5806       subpattern it was.
5807
5808       If  you  make  a  back  reference to a non-unique named subpattern from
5809       elsewhere in the pattern, the one that corresponds to the first  occur-
5810       rence of the name is used. In the absence of duplicate numbers (see the
5811       previous section) this is the one with the lowest number. If you use  a
5812       named  reference  in a condition test (see the section about conditions
5813       below), either to check whether a subpattern has matched, or  to  check
5814       for  recursion,  all  subpatterns with the same name are tested. If the
5815       condition is true for any one of them, the overall condition  is  true.
5816       This is the same behaviour as testing by number. For further details of
5817       the interfaces for handling named subpatterns, see the pcreapi documen-
5818       tation.
5819
5820       Warning: You cannot use different names to distinguish between two sub-
5821       patterns with the same number because PCRE uses only the  numbers  when
5822       matching. For this reason, an error is given at compile time if differ-
5823       ent names are given to subpatterns with the same number.  However,  you
5824       can  give  the same name to subpatterns with the same number, even when
5825       PCRE_DUPNAMES is not set.
5826
5827
5828REPETITION
5829
5830       Repetition is specified by quantifiers, which can  follow  any  of  the
5831       following items:
5832
5833         a literal data character
5834         the dot metacharacter
5835         the \C escape sequence
5836         the \X escape sequence
5837         the \R escape sequence
5838         an escape such as \d or \pL that matches a single character
5839         a character class
5840         a back reference (see next section)
5841         a parenthesized subpattern (including assertions)
5842         a subroutine call to a subpattern (recursive or otherwise)
5843
5844       The  general repetition quantifier specifies a minimum and maximum num-
5845       ber of permitted matches, by giving the two numbers in  curly  brackets
5846       (braces),  separated  by  a comma. The numbers must be less than 65536,
5847       and the first must be less than or equal to the second. For example:
5848
5849         z{2,4}
5850
5851       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
5852       special  character.  If  the second number is omitted, but the comma is
5853       present, there is no upper limit; if the second number  and  the  comma
5854       are  both omitted, the quantifier specifies an exact number of required
5855       matches. Thus
5856
5857         [aeiou]{3,}
5858
5859       matches at least 3 successive vowels, but may match many more, while
5860
5861         \d{8}
5862
5863       matches exactly 8 digits. An opening curly bracket that  appears  in  a
5864       position  where a quantifier is not allowed, or one that does not match
5865       the syntax of a quantifier, is taken as a literal character. For  exam-
5866       ple, {,6} is not a quantifier, but a literal string of four characters.
5867
5868       In UTF modes, quantifiers apply to characters rather than to individual
5869       data units. Thus, for example, \x{100}{2} matches two characters,  each
5870       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
5871       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
5872       which  may  be  several  data  units long (and they may be of different
5873       lengths).
5874
5875       The quantifier {0} is permitted, causing the expression to behave as if
5876       the previous item and the quantifier were not present. This may be use-
5877       ful for subpatterns that are referenced as subroutines  from  elsewhere
5878       in the pattern (but see also the section entitled "Defining subpatterns
5879       for use by reference only" below). Items other  than  subpatterns  that
5880       have a {0} quantifier are omitted from the compiled pattern.
5881
5882       For  convenience, the three most common quantifiers have single-charac-
5883       ter abbreviations:
5884
5885         *    is equivalent to {0,}
5886         +    is equivalent to {1,}
5887         ?    is equivalent to {0,1}
5888
5889       It is possible to construct infinite loops by  following  a  subpattern
5890       that can match no characters with a quantifier that has no upper limit,
5891       for example:
5892
5893         (a?)*
5894
5895       Earlier versions of Perl and PCRE used to give an error at compile time
5896       for  such  patterns. However, because there are cases where this can be
5897       useful, such patterns are now accepted, but if any  repetition  of  the
5898       subpattern  does in fact match no characters, the loop is forcibly bro-
5899       ken.
5900
5901       By default, the quantifiers are "greedy", that is, they match  as  much
5902       as  possible  (up  to  the  maximum number of permitted times), without
5903       causing the rest of the pattern to fail. The classic example  of  where
5904       this gives problems is in trying to match comments in C programs. These
5905       appear between /* and */ and within the comment,  individual  *  and  /
5906       characters  may  appear. An attempt to match C comments by applying the
5907       pattern
5908
5909         /\*.*\*/
5910
5911       to the string
5912
5913         /* first comment */  not comment  /* second comment */
5914
5915       fails, because it matches the entire string owing to the greediness  of
5916       the .*  item.
5917
5918       However,  if  a quantifier is followed by a question mark, it ceases to
5919       be greedy, and instead matches the minimum number of times possible, so
5920       the pattern
5921
5922         /\*.*?\*/
5923
5924       does  the  right  thing with the C comments. The meaning of the various
5925       quantifiers is not otherwise changed,  just  the  preferred  number  of
5926       matches.   Do  not  confuse this use of question mark with its use as a
5927       quantifier in its own right. Because it has two uses, it can  sometimes
5928       appear doubled, as in
5929
5930         \d??\d
5931
5932       which matches one digit by preference, but can match two if that is the
5933       only way the rest of the pattern matches.
5934
5935       If the PCRE_UNGREEDY option is set (an option that is not available  in
5936       Perl),  the  quantifiers are not greedy by default, but individual ones
5937       can be made greedy by following them with a  question  mark.  In  other
5938       words, it inverts the default behaviour.
5939
5940       When  a  parenthesized  subpattern  is quantified with a minimum repeat
5941       count that is greater than 1 or with a limited maximum, more memory  is
5942       required  for  the  compiled  pattern, in proportion to the size of the
5943       minimum or maximum.
5944
5945       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
5946       alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
5947       the pattern is implicitly anchored, because whatever  follows  will  be
5948       tried  against every character position in the subject string, so there
5949       is no point in retrying the overall match at  any  position  after  the
5950       first.  PCRE  normally treats such a pattern as though it were preceded
5951       by \A.
5952
5953       In cases where it is known that the subject  string  contains  no  new-
5954       lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
5955       mization, or alternatively using ^ to indicate anchoring explicitly.
5956
5957       However, there are some cases where the optimization  cannot  be  used.
5958       When .*  is inside capturing parentheses that are the subject of a back
5959       reference elsewhere in the pattern, a match at the start may fail where
5960       a later one succeeds. Consider, for example:
5961
5962         (.*)abc\1
5963
5964       If  the subject is "xyz123abc123" the match point is the fourth charac-
5965       ter. For this reason, such a pattern is not implicitly anchored.
5966
5967       Another case where implicit anchoring is not applied is when the  lead-
5968       ing  .* is inside an atomic group. Once again, a match at the start may
5969       fail where a later one succeeds. Consider this pattern:
5970
5971         (?>.*?a)b
5972
5973       It matches "ab" in the subject "aab". The use of the backtracking  con-
5974       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
5975
5976       When a capturing subpattern is repeated, the value captured is the sub-
5977       string that matched the final iteration. For example, after
5978
5979         (tweedle[dume]{3}\s*)+
5980
5981       has matched "tweedledum tweedledee" the value of the captured substring
5982       is  "tweedledee".  However,  if there are nested capturing subpatterns,
5983       the corresponding captured values may have been set in previous  itera-
5984       tions. For example, after
5985
5986         /(a|(b))+/
5987
5988       matches "aba" the value of the second captured substring is "b".
5989
5990
5991ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
5992
5993       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
5994       repetition, failure of what follows normally causes the  repeated  item
5995       to  be  re-evaluated to see if a different number of repeats allows the
5996       rest of the pattern to match. Sometimes it is useful to  prevent  this,
5997       either  to  change the nature of the match, or to cause it fail earlier
5998       than it otherwise might, when the author of the pattern knows there  is
5999       no point in carrying on.
6000
6001       Consider,  for  example, the pattern \d+foo when applied to the subject
6002       line
6003
6004         123456bar
6005
6006       After matching all 6 digits and then failing to match "foo", the normal
6007       action  of  the matcher is to try again with only 5 digits matching the
6008       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
6009       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
6010       the means for specifying that once a subpattern has matched, it is  not
6011       to be re-evaluated in this way.
6012
6013       If  we  use atomic grouping for the previous example, the matcher gives
6014       up immediately on failing to match "foo" the first time.  The  notation
6015       is a kind of special parenthesis, starting with (?> as in this example:
6016
6017         (?>\d+)foo
6018
6019       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
6020       tains once it has matched, and a failure further into  the  pattern  is
6021       prevented  from  backtracking into it. Backtracking past it to previous
6022       items, however, works as normal.
6023
6024       An alternative description is that a subpattern of  this  type  matches
6025       the  string  of  characters  that an identical standalone pattern would
6026       match, if anchored at the current point in the subject string.
6027
6028       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
6029       such as the above example can be thought of as a maximizing repeat that
6030       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
6031       pared  to  adjust  the number of digits they match in order to make the
6032       rest of the pattern match, (?>\d+) can only match an entire sequence of
6033       digits.
6034
6035       Atomic  groups in general can of course contain arbitrarily complicated
6036       subpatterns, and can be nested. However, when  the  subpattern  for  an
6037       atomic group is just a single repeated item, as in the example above, a
6038       simpler notation, called a "possessive quantifier" can  be  used.  This
6039       consists  of  an  additional  + character following a quantifier. Using
6040       this notation, the previous example can be rewritten as
6041
6042         \d++foo
6043
6044       Note that a possessive quantifier can be used with an entire group, for
6045       example:
6046
6047         (abc|xyz){2,3}+
6048
6049       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
6050       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
6051       simpler  forms  of atomic group. However, there is no difference in the
6052       meaning of a possessive quantifier and  the  equivalent  atomic  group,
6053       though  there  may  be a performance difference; possessive quantifiers
6054       should be slightly faster.
6055
6056       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
6057       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
6058       edition of his book. Mike McCloskey liked it, so implemented it when he
6059       built  Sun's Java package, and PCRE copied it from there. It ultimately
6060       found its way into Perl at release 5.10.
6061
6062       PCRE has an optimization that automatically "possessifies" certain sim-
6063       ple  pattern  constructs.  For  example, the sequence A+B is treated as
6064       A++B because there is no point in backtracking into a sequence  of  A's
6065       when B must follow.
6066
6067       When  a  pattern  contains an unlimited repeat inside a subpattern that
6068       can itself be repeated an unlimited number of  times,  the  use  of  an
6069       atomic  group  is  the  only way to avoid some failing matches taking a
6070       very long time indeed. The pattern
6071
6072         (\D+|<\d+>)*[!?]
6073
6074       matches an unlimited number of substrings that either consist  of  non-
6075       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
6076       matches, it runs quickly. However, if it is applied to
6077
6078         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
6079
6080       it takes a long time before reporting  failure.  This  is  because  the
6081       string  can be divided between the internal \D+ repeat and the external
6082       * repeat in a large number of ways, and all  have  to  be  tried.  (The
6083       example  uses  [!?]  rather than a single character at the end, because
6084       both PCRE and Perl have an optimization that allows  for  fast  failure
6085       when  a single character is used. They remember the last single charac-
6086       ter that is required for a match, and fail early if it is  not  present
6087       in  the  string.)  If  the pattern is changed so that it uses an atomic
6088       group, like this:
6089
6090         ((?>\D+)|<\d+>)*[!?]
6091
6092       sequences of non-digits cannot be broken, and failure happens quickly.
6093
6094
6095BACK REFERENCES
6096
6097       Outside a character class, a backslash followed by a digit greater than
6098       0 (and possibly further digits) is a back reference to a capturing sub-
6099       pattern earlier (that is, to its left) in the pattern,  provided  there
6100       have been that many previous capturing left parentheses.
6101
6102       However, if the decimal number following the backslash is less than 10,
6103       it is always taken as a back reference, and causes  an  error  only  if
6104       there  are  not that many capturing left parentheses in the entire pat-
6105       tern. In other words, the parentheses that are referenced need  not  be
6106       to  the left of the reference for numbers less than 10. A "forward back
6107       reference" of this type can make sense when a  repetition  is  involved
6108       and  the  subpattern to the right has participated in an earlier itera-
6109       tion.
6110
6111       It is not possible to have a numerical "forward back  reference"  to  a
6112       subpattern  whose  number  is  10  or  more using this syntax because a
6113       sequence such as \50 is interpreted as a character  defined  in  octal.
6114       See the subsection entitled "Non-printing characters" above for further
6115       details of the handling of digits following a backslash.  There  is  no
6116       such  problem  when named parentheses are used. A back reference to any
6117       subpattern is possible using named parentheses (see below).
6118
6119       Another way of avoiding the ambiguity inherent in  the  use  of  digits
6120       following  a  backslash  is  to use the \g escape sequence. This escape
6121       must be followed by an unsigned number or a negative number, optionally
6122       enclosed in braces. These examples are all identical:
6123
6124         (ring), \1
6125         (ring), \g1
6126         (ring), \g{1}
6127
6128       An  unsigned number specifies an absolute reference without the ambigu-
6129       ity that is present in the older syntax. It is also useful when literal
6130       digits follow the reference. A negative number is a relative reference.
6131       Consider this example:
6132
6133         (abc(def)ghi)\g{-1}
6134
6135       The sequence \g{-1} is a reference to the most recently started captur-
6136       ing subpattern before \g, that is, is it equivalent to \2 in this exam-
6137       ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
6138       references  can  be helpful in long patterns, and also in patterns that
6139       are created by  joining  together  fragments  that  contain  references
6140       within themselves.
6141
6142       A  back  reference matches whatever actually matched the capturing sub-
6143       pattern in the current subject string, rather  than  anything  matching
6144       the subpattern itself (see "Subpatterns as subroutines" below for a way
6145       of doing that). So the pattern
6146
6147         (sens|respons)e and \1ibility
6148
6149       matches "sense and sensibility" and "response and responsibility",  but
6150       not  "sense and responsibility". If caseful matching is in force at the
6151       time of the back reference, the case of letters is relevant. For  exam-
6152       ple,
6153
6154         ((?i)rah)\s+\1
6155
6156       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
6157       original capturing subpattern is matched caselessly.
6158
6159       There are several different ways of writing back  references  to  named
6160       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
6161       \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
6162       unified back reference syntax, in which \g can be used for both numeric
6163       and named references, is also supported. We  could  rewrite  the  above
6164       example in any of the following ways:
6165
6166         (?<p1>(?i)rah)\s+\k<p1>
6167         (?'p1'(?i)rah)\s+\k{p1}
6168         (?P<p1>(?i)rah)\s+(?P=p1)
6169         (?<p1>(?i)rah)\s+\g{p1}
6170
6171       A  subpattern  that  is  referenced  by  name may appear in the pattern
6172       before or after the reference.
6173
6174       There may be more than one back reference to the same subpattern. If  a
6175       subpattern  has  not actually been used in a particular match, any back
6176       references to it always fail by default. For example, the pattern
6177
6178         (a|(bc))\2
6179
6180       always fails if it starts to match "a" rather than  "bc".  However,  if
6181       the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
6182       ence to an unset value matches an empty string.
6183
6184       Because there may be many capturing parentheses in a pattern, all  dig-
6185       its  following a backslash are taken as part of a potential back refer-
6186       ence number.  If the pattern continues with  a  digit  character,  some
6187       delimiter  must  be  used  to  terminate  the  back  reference.  If the
6188       PCRE_EXTENDED option is set, this can be white  space.  Otherwise,  the
6189       \g{ syntax or an empty comment (see "Comments" below) can be used.
6190
6191   Recursive back references
6192
6193       A  back reference that occurs inside the parentheses to which it refers
6194       fails when the subpattern is first used, so, for example,  (a\1)  never
6195       matches.   However,  such references can be useful inside repeated sub-
6196       patterns. For example, the pattern
6197
6198         (a|b\1)+
6199
6200       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
6201       ation  of  the  subpattern,  the  back  reference matches the character
6202       string corresponding to the previous iteration. In order  for  this  to
6203       work,  the  pattern must be such that the first iteration does not need
6204       to match the back reference. This can be done using alternation, as  in
6205       the example above, or by a quantifier with a minimum of zero.
6206
6207       Back  references of this type cause the group that they reference to be
6208       treated as an atomic group.  Once the whole group has been  matched,  a
6209       subsequent  matching  failure cannot cause backtracking into the middle
6210       of the group.
6211
6212
6213ASSERTIONS
6214
6215       An assertion is a test on the characters  following  or  preceding  the
6216       current  matching  point that does not actually consume any characters.
6217       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
6218       described above.
6219
6220       More  complicated  assertions  are  coded as subpatterns. There are two
6221       kinds: those that look ahead of the current  position  in  the  subject
6222       string,  and  those  that  look  behind  it. An assertion subpattern is
6223       matched in the normal way, except that it does not  cause  the  current
6224       matching position to be changed.
6225
6226       Assertion  subpatterns are not capturing subpatterns. If such an asser-
6227       tion contains capturing subpatterns within it, these  are  counted  for
6228       the  purposes  of numbering the capturing subpatterns in the whole pat-
6229       tern. However, substring capturing is carried  out  only  for  positive
6230       assertions, because it does not make sense for negative assertions.
6231
6232       For  compatibility  with  Perl,  assertion subpatterns may be repeated;
6233       though it makes no sense to assert the same thing  several  times,  the
6234       side  effect  of  capturing  parentheses may occasionally be useful. In
6235       practice, there only three cases:
6236
6237       (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during
6238       matching.   However,  it  may  contain internal capturing parenthesized
6239       groups that are called from elsewhere via the subroutine mechanism.
6240
6241       (2) If quantifier is {0,n} where n is greater than zero, it is  treated
6242       as  if  it  were  {0,1}.  At run time, the rest of the pattern match is
6243       tried with and without the assertion, the order depending on the greed-
6244       iness of the quantifier.
6245
6246       (3)  If  the minimum repetition is greater than zero, the quantifier is
6247       ignored.  The assertion is obeyed just  once  when  encountered  during
6248       matching.
6249
6250   Lookahead assertions
6251
6252       Lookahead assertions start with (?= for positive assertions and (?! for
6253       negative assertions. For example,
6254
6255         \w+(?=;)
6256
6257       matches a word followed by a semicolon, but does not include the  semi-
6258       colon in the match, and
6259
6260         foo(?!bar)
6261
6262       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
6263       that the apparently similar pattern
6264
6265         (?!foo)bar
6266
6267       does not find an occurrence of "bar"  that  is  preceded  by  something
6268       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
6269       the assertion (?!foo) is always true when the next three characters are
6270       "bar". A lookbehind assertion is needed to achieve the other effect.
6271
6272       If you want to force a matching failure at some point in a pattern, the
6273       most convenient way to do it is  with  (?!)  because  an  empty  string
6274       always  matches, so an assertion that requires there not to be an empty
6275       string must always fail.  The backtracking control verb (*FAIL) or (*F)
6276       is a synonym for (?!).
6277
6278   Lookbehind assertions
6279
6280       Lookbehind  assertions start with (?<= for positive assertions and (?<!
6281       for negative assertions. For example,
6282
6283         (?<!foo)bar
6284
6285       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
6286       contents  of  a  lookbehind  assertion are restricted such that all the
6287       strings it matches must have a fixed length. However, if there are sev-
6288       eral  top-level  alternatives,  they  do  not all have to have the same
6289       fixed length. Thus
6290
6291         (?<=bullock|donkey)
6292
6293       is permitted, but
6294
6295         (?<!dogs?|cats?)
6296
6297       causes an error at compile time. Branches that match  different  length
6298       strings  are permitted only at the top level of a lookbehind assertion.
6299       This is an extension compared with Perl, which requires all branches to
6300       match the same length of string. An assertion such as
6301
6302         (?<=ab(c|de))
6303
6304       is  not  permitted,  because  its single top-level branch can match two
6305       different lengths, but it is acceptable to PCRE if rewritten to use two
6306       top-level branches:
6307
6308         (?<=abc|abde)
6309
6310       In  some  cases, the escape sequence \K (see above) can be used instead
6311       of a lookbehind assertion to get round the fixed-length restriction.
6312
6313       The implementation of lookbehind assertions is, for  each  alternative,
6314       to  temporarily  move the current position back by the fixed length and
6315       then try to match. If there are insufficient characters before the cur-
6316       rent position, the assertion fails.
6317
6318       In  a UTF mode, PCRE does not allow the \C escape (which matches a sin-
6319       gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
6320       because  it  makes it impossible to calculate the length of the lookbe-
6321       hind. The \X and \R escapes, which can match different numbers of  data
6322       units, are also not permitted.
6323
6324       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
6325       lookbehinds, as long as the subpattern matches a  fixed-length  string.
6326       Recursion, however, is not supported.
6327
6328       Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
6329       assertions to specify efficient matching of fixed-length strings at the
6330       end of subject strings. Consider a simple pattern such as
6331
6332         abcd$
6333
6334       when  applied  to  a  long string that does not match. Because matching
6335       proceeds from left to right, PCRE will look for each "a" in the subject
6336       and  then  see  if what follows matches the rest of the pattern. If the
6337       pattern is specified as
6338
6339         ^.*abcd$
6340
6341       the initial .* matches the entire string at first, but when this  fails
6342       (because there is no following "a"), it backtracks to match all but the
6343       last character, then all but the last two characters, and so  on.  Once
6344       again  the search for "a" covers the entire string, from right to left,
6345       so we are no better off. However, if the pattern is written as
6346
6347         ^.*+(?<=abcd)
6348
6349       there can be no backtracking for the .*+ item; it can  match  only  the
6350       entire  string.  The subsequent lookbehind assertion does a single test
6351       on the last four characters. If it fails, the match fails  immediately.
6352       For  long  strings, this approach makes a significant difference to the
6353       processing time.
6354
6355   Using multiple assertions
6356
6357       Several assertions (of any sort) may occur in succession. For example,
6358
6359         (?<=\d{3})(?<!999)foo
6360
6361       matches "foo" preceded by three digits that are not "999". Notice  that
6362       each  of  the  assertions is applied independently at the same point in
6363       the subject string. First there is a  check  that  the  previous  three
6364       characters  are  all  digits,  and  then there is a check that the same
6365       three characters are not "999".  This pattern does not match "foo" pre-
6366       ceded  by  six  characters,  the first of which are digits and the last
6367       three of which are not "999". For example, it  doesn't  match  "123abc-
6368       foo". A pattern to do that is
6369
6370         (?<=\d{3}...)(?<!999)foo
6371
6372       This  time  the  first assertion looks at the preceding six characters,
6373       checking that the first three are digits, and then the second assertion
6374       checks that the preceding three characters are not "999".
6375
6376       Assertions can be nested in any combination. For example,
6377
6378         (?<=(?<!foo)bar)baz
6379
6380       matches  an occurrence of "baz" that is preceded by "bar" which in turn
6381       is not preceded by "foo", while
6382
6383         (?<=\d{3}(?!999)...)foo
6384
6385       is another pattern that matches "foo" preceded by three digits and  any
6386       three characters that are not "999".
6387
6388
6389CONDITIONAL SUBPATTERNS
6390
6391       It  is possible to cause the matching process to obey a subpattern con-
6392       ditionally or to choose between two alternative subpatterns,  depending
6393       on  the result of an assertion, or whether a specific capturing subpat-
6394       tern has already been matched. The two possible  forms  of  conditional
6395       subpattern are:
6396
6397         (?(condition)yes-pattern)
6398         (?(condition)yes-pattern|no-pattern)
6399
6400       If  the  condition is satisfied, the yes-pattern is used; otherwise the
6401       no-pattern (if present) is used. If there are more  than  two  alterna-
6402       tives  in  the subpattern, a compile-time error occurs. Each of the two
6403       alternatives may itself contain nested subpatterns of any form, includ-
6404       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
6405       applies only at the level of the condition. This pattern fragment is an
6406       example where the alternatives are complex:
6407
6408         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
6409
6410
6411       There  are  four  kinds of condition: references to subpatterns, refer-
6412       ences to recursion, a pseudo-condition called DEFINE, and assertions.
6413
6414   Checking for a used subpattern by number
6415
6416       If the text between the parentheses consists of a sequence  of  digits,
6417       the condition is true if a capturing subpattern of that number has pre-
6418       viously matched. If there is more than one  capturing  subpattern  with
6419       the  same  number  (see  the earlier section about duplicate subpattern
6420       numbers), the condition is true if any of them have matched. An  alter-
6421       native  notation is to precede the digits with a plus or minus sign. In
6422       this case, the subpattern number is relative rather than absolute.  The
6423       most  recently opened parentheses can be referenced by (?(-1), the next
6424       most recent by (?(-2), and so on. Inside loops it can also  make  sense
6425       to refer to subsequent groups. The next parentheses to be opened can be
6426       referenced as (?(+1), and so on. (The value zero in any of these  forms
6427       is not used; it provokes a compile-time error.)
6428
6429       Consider  the  following  pattern, which contains non-significant white
6430       space to make it more readable (assume the PCRE_EXTENDED option) and to
6431       divide it into three parts for ease of discussion:
6432
6433         ( \( )?    [^()]+    (?(1) \) )
6434
6435       The  first  part  matches  an optional opening parenthesis, and if that
6436       character is present, sets it as the first captured substring. The sec-
6437       ond  part  matches one or more characters that are not parentheses. The
6438       third part is a conditional subpattern that tests whether  or  not  the
6439       first  set  of  parentheses  matched.  If they did, that is, if subject
6440       started with an opening parenthesis, the condition is true, and so  the
6441       yes-pattern  is  executed and a closing parenthesis is required. Other-
6442       wise, since no-pattern is not present, the subpattern matches  nothing.
6443       In  other  words,  this  pattern matches a sequence of non-parentheses,
6444       optionally enclosed in parentheses.
6445
6446       If you were embedding this pattern in a larger one,  you  could  use  a
6447       relative reference:
6448
6449         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
6450
6451       This  makes  the  fragment independent of the parentheses in the larger
6452       pattern.
6453
6454   Checking for a used subpattern by name
6455
6456       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
6457       used  subpattern  by  name.  For compatibility with earlier versions of
6458       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
6459       also  recognized. However, there is a possible ambiguity with this syn-
6460       tax, because subpattern names may  consist  entirely  of  digits.  PCRE
6461       looks  first for a named subpattern; if it cannot find one and the name
6462       consists entirely of digits, PCRE looks for a subpattern of  that  num-
6463       ber,  which must be greater than zero. Using subpattern names that con-
6464       sist entirely of digits is not recommended.
6465
6466       Rewriting the above example to use a named subpattern gives this:
6467
6468         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
6469
6470       If the name used in a condition of this kind is a duplicate,  the  test
6471       is  applied to all subpatterns of the same name, and is true if any one
6472       of them has matched.
6473
6474   Checking for pattern recursion
6475
6476       If the condition is the string (R), and there is no subpattern with the
6477       name  R, the condition is true if a recursive call to the whole pattern
6478       or any subpattern has been made. If digits or a name preceded by amper-
6479       sand follow the letter R, for example:
6480
6481         (?(R3)...) or (?(R&name)...)
6482
6483       the condition is true if the most recent recursion is into a subpattern
6484       whose number or name is given. This condition does not check the entire
6485       recursion  stack.  If  the  name  used in a condition of this kind is a
6486       duplicate, the test is applied to all subpatterns of the same name, and
6487       is true if any one of them is the most recent recursion.
6488
6489       At  "top  level",  all  these recursion test conditions are false.  The
6490       syntax for recursive patterns is described below.
6491
6492   Defining subpatterns for use by reference only
6493
6494       If the condition is the string (DEFINE), and  there  is  no  subpattern
6495       with  the  name  DEFINE,  the  condition is always false. In this case,
6496       there may be only one alternative  in  the  subpattern.  It  is  always
6497       skipped  if  control  reaches  this  point  in the pattern; the idea of
6498       DEFINE is that it can be used to define subroutines that can be  refer-
6499       enced  from elsewhere. (The use of subroutines is described below.) For
6500       example, a pattern to match an IPv4 address  such  as  "192.168.23.245"
6501       could be written like this (ignore white space and line breaks):
6502
6503         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
6504         \b (?&byte) (\.(?&byte)){3} \b
6505
6506       The  first part of the pattern is a DEFINE group inside which a another
6507       group named "byte" is defined. This matches an individual component  of
6508       an  IPv4  address  (a number less than 256). When matching takes place,
6509       this part of the pattern is skipped because DEFINE acts  like  a  false
6510       condition.  The  rest of the pattern uses references to the named group
6511       to match the four dot-separated components of an IPv4 address,  insist-
6512       ing on a word boundary at each end.
6513
6514   Assertion conditions
6515
6516       If  the  condition  is  not  in any of the above formats, it must be an
6517       assertion.  This may be a positive or negative lookahead or  lookbehind
6518       assertion.  Consider  this  pattern,  again  containing non-significant
6519       white space, and with the two alternatives on the second line:
6520
6521         (?(?=[^a-z]*[a-z])
6522         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
6523
6524       The condition  is  a  positive  lookahead  assertion  that  matches  an
6525       optional  sequence of non-letters followed by a letter. In other words,
6526       it tests for the presence of at least one letter in the subject.  If  a
6527       letter  is found, the subject is matched against the first alternative;
6528       otherwise it is  matched  against  the  second.  This  pattern  matches
6529       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
6530       letters and dd are digits.
6531
6532
6533COMMENTS
6534
6535       There are two ways of including comments in patterns that are processed
6536       by PCRE. In both cases, the start of the comment must not be in a char-
6537       acter class, nor in the middle of any other sequence of related charac-
6538       ters  such  as  (?: or a subpattern name or number. The characters that
6539       make up a comment play no part in the pattern matching.
6540
6541       The sequence (?# marks the start of a comment that continues up to  the
6542       next  closing parenthesis. Nested parentheses are not permitted. If the
6543       PCRE_EXTENDED option is set, an unescaped # character also introduces a
6544       comment,  which  in  this  case continues to immediately after the next
6545       newline character or character sequence in the pattern.  Which  charac-
6546       ters are interpreted as newlines is controlled by the options passed to
6547       a compiling function or by a special sequence at the start of the  pat-
6548       tern, as described in the section entitled "Newline conventions" above.
6549       Note that the end of this type of comment is a literal newline sequence
6550       in  the pattern; escape sequences that happen to represent a newline do
6551       not count. For example, consider this  pattern  when  PCRE_EXTENDED  is
6552       set, and the default newline convention is in force:
6553
6554         abc #comment \n still comment
6555
6556       On  encountering  the  # character, pcre_compile() skips along, looking
6557       for a newline in the pattern. The sequence \n is still literal at  this
6558       stage,  so  it does not terminate the comment. Only an actual character
6559       with the code value 0x0a (the default newline) does so.
6560
6561
6562RECURSIVE PATTERNS
6563
6564       Consider the problem of matching a string in parentheses, allowing  for
6565       unlimited  nested  parentheses.  Without the use of recursion, the best
6566       that can be done is to use a pattern that  matches  up  to  some  fixed
6567       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
6568       depth.
6569
6570       For some time, Perl has provided a facility that allows regular expres-
6571       sions  to recurse (amongst other things). It does this by interpolating
6572       Perl code in the expression at run time, and the code can refer to  the
6573       expression itself. A Perl pattern using code interpolation to solve the
6574       parentheses problem can be created like this:
6575
6576         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
6577
6578       The (?p{...}) item interpolates Perl code at run time, and in this case
6579       refers recursively to the pattern in which it appears.
6580
6581       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
6582       it supports special syntax for recursion of  the  entire  pattern,  and
6583       also  for  individual  subpattern  recursion. After its introduction in
6584       PCRE and Python, this kind of  recursion  was  subsequently  introduced
6585       into Perl at release 5.10.
6586
6587       A  special  item  that consists of (? followed by a number greater than
6588       zero and a closing parenthesis is a recursive subroutine  call  of  the
6589       subpattern  of  the  given  number, provided that it occurs inside that
6590       subpattern. (If not, it is a non-recursive subroutine  call,  which  is
6591       described  in  the  next  section.)  The special item (?R) or (?0) is a
6592       recursive call of the entire regular expression.
6593
6594       This PCRE pattern solves the nested  parentheses  problem  (assume  the
6595       PCRE_EXTENDED option is set so that white space is ignored):
6596
6597         \( ( [^()]++ | (?R) )* \)
6598
6599       First  it matches an opening parenthesis. Then it matches any number of
6600       substrings which can either be a  sequence  of  non-parentheses,  or  a
6601       recursive  match  of the pattern itself (that is, a correctly parenthe-
6602       sized substring).  Finally there is a closing parenthesis. Note the use
6603       of a possessive quantifier to avoid backtracking into sequences of non-
6604       parentheses.
6605
6606       If this were part of a larger pattern, you would not  want  to  recurse
6607       the entire pattern, so instead you could use this:
6608
6609         ( \( ( [^()]++ | (?1) )* \) )
6610
6611       We  have  put the pattern into parentheses, and caused the recursion to
6612       refer to them instead of the whole pattern.
6613
6614       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
6615       tricky.  This is made easier by the use of relative references. Instead
6616       of (?1) in the pattern above you can write (?-2) to refer to the second
6617       most  recently  opened  parentheses  preceding  the recursion. In other
6618       words, a negative number counts capturing  parentheses  leftwards  from
6619       the point at which it is encountered.
6620
6621       It  is  also  possible  to refer to subsequently opened parentheses, by
6622       writing references such as (?+2). However, these  cannot  be  recursive
6623       because  the  reference  is  not inside the parentheses that are refer-
6624       enced. They are always non-recursive subroutine calls, as described  in
6625       the next section.
6626
6627       An  alternative  approach is to use named parentheses instead. The Perl
6628       syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
6629       supported. We could rewrite the above example as follows:
6630
6631         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
6632
6633       If  there  is more than one subpattern with the same name, the earliest
6634       one is used.
6635
6636       This particular example pattern that we have been looking  at  contains
6637       nested unlimited repeats, and so the use of a possessive quantifier for
6638       matching strings of non-parentheses is important when applying the pat-
6639       tern  to  strings  that do not match. For example, when this pattern is
6640       applied to
6641
6642         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
6643
6644       it yields "no match" quickly. However, if a  possessive  quantifier  is
6645       not  used, the match runs for a very long time indeed because there are
6646       so many different ways the + and * repeats can carve  up  the  subject,
6647       and all have to be tested before failure can be reported.
6648
6649       At  the  end  of a match, the values of capturing parentheses are those
6650       from the outermost level. If you want to obtain intermediate values,  a
6651       callout  function can be used (see below and the pcrecallout documenta-
6652       tion). If the pattern above is matched against
6653
6654         (ab(cd)ef)
6655
6656       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
6657       which  is the last value taken on at the top level. If a capturing sub-
6658       pattern is not matched at the top level, its final  captured  value  is
6659       unset,  even  if  it was (temporarily) set at a deeper level during the
6660       matching process.
6661
6662       If there are more than 15 capturing parentheses in a pattern, PCRE  has
6663       to  obtain extra memory to store data during a recursion, which it does
6664       by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
6665       can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
6666
6667       Do  not  confuse  the (?R) item with the condition (R), which tests for
6668       recursion.  Consider this pattern, which matches text in  angle  brack-
6669       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
6670       brackets (that is, when recursing), whereas any characters are  permit-
6671       ted at the outer level.
6672
6673         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
6674
6675       In  this  pattern, (?(R) is the start of a conditional subpattern, with
6676       two different alternatives for the recursive and  non-recursive  cases.
6677       The (?R) item is the actual recursive call.
6678
6679   Differences in recursion processing between PCRE and Perl
6680
6681       Recursion  processing  in PCRE differs from Perl in two important ways.
6682       In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
6683       always treated as an atomic group. That is, once it has matched some of
6684       the subject string, it is never re-entered, even if it contains untried
6685       alternatives  and  there  is a subsequent matching failure. This can be
6686       illustrated by the following pattern, which purports to match a  palin-
6687       dromic  string  that contains an odd number of characters (for example,
6688       "a", "aba", "abcba", "abcdcba"):
6689
6690         ^(.|(.)(?1)\2)$
6691
6692       The idea is that it either matches a single character, or two identical
6693       characters  surrounding  a sub-palindrome. In Perl, this pattern works;
6694       in PCRE it does not if the pattern is  longer  than  three  characters.
6695       Consider the subject string "abcba":
6696
6697       At  the  top level, the first character is matched, but as it is not at
6698       the end of the string, the first alternative fails; the second alterna-
6699       tive is taken and the recursion kicks in. The recursive call to subpat-
6700       tern 1 successfully matches the next character ("b").  (Note  that  the
6701       beginning and end of line tests are not part of the recursion).
6702
6703       Back  at  the top level, the next character ("c") is compared with what
6704       subpattern 2 matched, which was "a". This fails. Because the  recursion
6705       is  treated  as  an atomic group, there are now no backtracking points,
6706       and so the entire match fails. (Perl is able, at  this  point,  to  re-
6707       enter  the  recursion  and try the second alternative.) However, if the
6708       pattern is written with the alternatives in the other order, things are
6709       different:
6710
6711         ^((.)(?1)\2|.)$
6712
6713       This  time,  the recursing alternative is tried first, and continues to
6714       recurse until it runs out of characters, at which point  the  recursion
6715       fails.  But  this  time  we  do  have another alternative to try at the
6716       higher level. That is the big difference:  in  the  previous  case  the
6717       remaining alternative is at a deeper recursion level, which PCRE cannot
6718       use.
6719
6720       To change the pattern so that it matches all palindromic  strings,  not
6721       just  those  with an odd number of characters, it is tempting to change
6722       the pattern to this:
6723
6724         ^((.)(?1)\2|.?)$
6725
6726       Again, this works in Perl, but not in PCRE, and for  the  same  reason.
6727       When  a  deeper  recursion has matched a single character, it cannot be
6728       entered again in order to match an empty string.  The  solution  is  to
6729       separate  the two cases, and write out the odd and even cases as alter-
6730       natives at the higher level:
6731
6732         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
6733
6734       If you want to match typical palindromic phrases, the  pattern  has  to
6735       ignore all non-word characters, which can be done like this:
6736
6737         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
6738
6739       If run with the PCRE_CASELESS option, this pattern matches phrases such
6740       as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
6741       Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-
6742       ing into sequences of non-word characters. Without this, PCRE  takes  a
6743       great  deal  longer  (ten  times or more) to match typical phrases, and
6744       Perl takes so long that you think it has gone into a loop.
6745
6746       WARNING: The palindrome-matching patterns above work only if  the  sub-
6747       ject  string  does not start with a palindrome that is shorter than the
6748       entire string.  For example, although "abcba" is correctly matched,  if
6749       the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
6750       then fails at top level because the end of the string does not  follow.
6751       Once  again, it cannot jump back into the recursion to try other alter-
6752       natives, so the entire match fails.
6753
6754       The second way in which PCRE and Perl differ in  their  recursion  pro-
6755       cessing  is in the handling of captured values. In Perl, when a subpat-
6756       tern is called recursively or as a subpattern (see the  next  section),
6757       it  has  no  access to any values that were captured outside the recur-
6758       sion, whereas in PCRE these values can  be  referenced.  Consider  this
6759       pattern:
6760
6761         ^(.)(\1|a(?2))
6762
6763       In  PCRE,  this  pattern matches "bab". The first capturing parentheses
6764       match "b", then in the second group, when the back reference  \1  fails
6765       to  match "b", the second alternative matches "a" and then recurses. In
6766       the recursion, \1 does now match "b" and so the whole  match  succeeds.
6767       In  Perl,  the pattern fails to match because inside the recursive call
6768       \1 cannot access the externally set value.
6769
6770
6771SUBPATTERNS AS SUBROUTINES
6772
6773       If the syntax for a recursive subpattern call (either by number  or  by
6774       name)  is  used outside the parentheses to which it refers, it operates
6775       like a subroutine in a programming language. The called subpattern  may
6776       be  defined  before or after the reference. A numbered reference can be
6777       absolute or relative, as in these examples:
6778
6779         (...(absolute)...)...(?2)...
6780         (...(relative)...)...(?-1)...
6781         (...(?+1)...(relative)...
6782
6783       An earlier example pointed out that the pattern
6784
6785         (sens|respons)e and \1ibility
6786
6787       matches "sense and sensibility" and "response and responsibility",  but
6788       not "sense and responsibility". If instead the pattern
6789
6790         (sens|respons)e and (?1)ibility
6791
6792       is  used, it does match "sense and responsibility" as well as the other
6793       two strings. Another example is  given  in  the  discussion  of  DEFINE
6794       above.
6795
6796       All  subroutine  calls, whether recursive or not, are always treated as
6797       atomic groups. That is, once a subroutine has matched some of the  sub-
6798       ject string, it is never re-entered, even if it contains untried alter-
6799       natives and there is  a  subsequent  matching  failure.  Any  capturing
6800       parentheses  that  are  set  during the subroutine call revert to their
6801       previous values afterwards.
6802
6803       Processing options such as case-independence are fixed when  a  subpat-
6804       tern  is defined, so if it is used as a subroutine, such options cannot
6805       be changed for different calls. For example, consider this pattern:
6806
6807         (abc)(?i:(?-1))
6808
6809       It matches "abcabc". It does not match "abcABC" because the  change  of
6810       processing option does not affect the called subpattern.
6811
6812
6813ONIGURUMA SUBROUTINE SYNTAX
6814
6815       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6816       name or a number enclosed either in angle brackets or single quotes, is
6817       an  alternative  syntax  for  referencing a subpattern as a subroutine,
6818       possibly recursively. Here are two of the examples used above,  rewrit-
6819       ten using this syntax:
6820
6821         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
6822         (sens|respons)e and \g'1'ibility
6823
6824       PCRE  supports  an extension to Oniguruma: if a number is preceded by a
6825       plus or a minus sign it is taken as a relative reference. For example:
6826
6827         (abc)(?i:\g<-1>)
6828
6829       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
6830       synonymous.  The former is a back reference; the latter is a subroutine
6831       call.
6832
6833
6834CALLOUTS
6835
6836       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
6837       Perl  code to be obeyed in the middle of matching a regular expression.
6838       This makes it possible, amongst other things, to extract different sub-
6839       strings that match the same pair of parentheses when there is a repeti-
6840       tion.
6841
6842       PCRE provides a similar feature, but of course it cannot obey arbitrary
6843       Perl code. The feature is called "callout". The caller of PCRE provides
6844       an external function by putting its entry point in the global  variable
6845       pcre_callout  (8-bit  library) or pcre[16|32]_callout (16-bit or 32-bit
6846       library).  By default, this variable contains NULL, which disables  all
6847       calling out.
6848
6849       Within  a  regular  expression,  (?C) indicates the points at which the
6850       external function is to be called. If you want  to  identify  different
6851       callout  points, you can put a number less than 256 after the letter C.
6852       The default value is zero.  For example, this pattern has  two  callout
6853       points:
6854
6855         (?C1)abc(?C2)def
6856
6857       If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
6858       outs are automatically installed before each item in the pattern.  They
6859       are all numbered 255.
6860
6861       During  matching, when PCRE reaches a callout point, the external func-
6862       tion is called. It is provided with the  number  of  the  callout,  the
6863       position  in  the pattern, and, optionally, one item of data originally
6864       supplied by the caller of the matching function. The  callout  function
6865       may  cause  matching to proceed, to backtrack, or to fail altogether. A
6866       complete description of the interface to the callout function is  given
6867       in the pcrecallout documentation.
6868
6869
6870BACKTRACKING CONTROL
6871
6872       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
6873       which are described in the Perl documentation as "experimental and sub-
6874       ject  to  change or removal in a future version of Perl". It goes on to
6875       say: "Their usage in production code should be noted to avoid  problems
6876       during upgrades." The same remarks apply to the PCRE features described
6877       in this section.
6878
6879       Since these verbs are specifically related  to  backtracking,  most  of
6880       them  can  be  used only when the pattern is to be matched using one of
6881       the traditional matching functions, which use a backtracking algorithm.
6882       With  the  exception  of (*FAIL), which behaves like a failing negative
6883       assertion, they cause an error if encountered by a DFA  matching  func-
6884       tion.
6885
6886       If  any of these verbs are used in an assertion or in a subpattern that
6887       is called as a subroutine (whether or not recursively), their effect is
6888       confined to that subpattern; it does not extend to the surrounding pat-
6889       tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
6890       that  is  encountered in a successful positive assertion is passed back
6891       when a match succeeds (compare capturing  parentheses  in  assertions).
6892       Note that such subpatterns are processed as anchored at the point where
6893       they are tested. Note also that Perl's  treatment  of  subroutines  and
6894       assertions is different in some cases.
6895
6896       The  new verbs make use of what was previously invalid syntax: an open-
6897       ing parenthesis followed by an asterisk. They are generally of the form
6898       (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
6899       haviour, depending on whether or not an argument is present. A name  is
6900       any sequence of characters that does not include a closing parenthesis.
6901       The maximum length of name is 255 in the 8-bit library and 65535 in the
6902       16-bit and 32-bit library.  If the name is empty, that is, if the clos-
6903       ing parenthesis immediately follows the colon, the effect is as if  the
6904       colon were not there. Any number of these verbs may occur in a pattern.
6905
6906   Optimizations that affect backtracking verbs
6907
6908       PCRE  contains some optimizations that are used to speed up matching by
6909       running some checks at the start of each match attempt. For example, it
6910       may  know  the minimum length of matching subject, or that a particular
6911       character must be present. When one of these  optimizations  suppresses
6912       the  running  of  a match, any included backtracking verbs will not, of
6913       course, be processed. You can suppress the start-of-match optimizations
6914       by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
6915       pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
6916       There is more discussion of this option in the section entitled "Option
6917       bits for pcre_exec()" in the pcreapi documentation.
6918
6919       Experiments with Perl suggest that it too  has  similar  optimizations,
6920       sometimes leading to anomalous results.
6921
6922   Verbs that act immediately
6923
6924       The  following  verbs act as soon as they are encountered. They may not
6925       be followed by a name.
6926
6927          (*ACCEPT)
6928
6929       This verb causes the match to end successfully, skipping the  remainder
6930       of  the pattern. However, when it is inside a subpattern that is called
6931       as a subroutine, only that subpattern is ended  successfully.  Matching
6932       then  continues  at  the  outer level. If (*ACCEPT) is inside capturing
6933       parentheses, the data so far is captured. For example:
6934
6935         A((?:A|B(*ACCEPT)|C)D)
6936
6937       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
6938       tured by the outer parentheses.
6939
6940         (*FAIL) or (*F)
6941
6942       This  verb causes a matching failure, forcing backtracking to occur. It
6943       is equivalent to (?!) but easier to read. The Perl documentation  notes
6944       that  it  is  probably  useful only when combined with (?{}) or (??{}).
6945       Those are, of course, Perl features that are not present in  PCRE.  The
6946       nearest  equivalent is the callout feature, as for example in this pat-
6947       tern:
6948
6949         a+(?C)(*FAIL)
6950
6951       A match with the string "aaaa" always fails, but the callout  is  taken
6952       before each backtrack happens (in this example, 10 times).
6953
6954   Recording which path was taken
6955
6956       There  is  one  verb  whose  main  purpose  is to track how a match was
6957       arrived at, though it also has a  secondary  use  in  conjunction  with
6958       advancing the match starting point (see (*SKIP) below).
6959
6960         (*MARK:NAME) or (*:NAME)
6961
6962       A  name  is  always  required  with  this  verb.  There  may be as many
6963       instances of (*MARK) as you like in a pattern, and their names  do  not
6964       have to be unique.
6965
6966       When  a match succeeds, the name of the last-encountered (*MARK) on the
6967       matching path is passed back to the caller as described in the  section
6968       entitled  "Extra  data  for  pcre_exec()" in the pcreapi documentation.
6969       Here is an example of pcretest output, where the /K  modifier  requests
6970       the retrieval and outputting of (*MARK) data:
6971
6972           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
6973         data> XY
6974          0: XY
6975         MK: A
6976         XZ
6977          0: XZ
6978         MK: B
6979
6980       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
6981       ple it indicates which of the two alternatives matched. This is a  more
6982       efficient  way of obtaining this information than putting each alterna-
6983       tive in its own capturing parentheses.
6984
6985       If (*MARK) is encountered in a positive assertion, its name is recorded
6986       and passed back if it is the last-encountered. This does not happen for
6987       negative assertions.
6988
6989       After a partial match or a failed match, the name of the  last  encoun-
6990       tered (*MARK) in the entire match process is returned. For example:
6991
6992           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
6993         data> XP
6994         No match, mark = B
6995
6996       Note  that  in  this  unanchored  example the mark is retained from the
6997       match attempt that started at the letter "X" in the subject. Subsequent
6998       match attempts starting at "P" and then with an empty string do not get
6999       as far as the (*MARK) item, but nevertheless do not reset it.
7000
7001       If you are interested in  (*MARK)  values  after  failed  matches,  you
7002       should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to
7003       ensure that the match is always attempted.
7004
7005   Verbs that act after backtracking
7006
7007       The following verbs do nothing when they are encountered. Matching con-
7008       tinues  with what follows, but if there is no subsequent match, causing
7009       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
7010       cannot  pass  to the left of the verb. However, when one of these verbs
7011       appears inside an atomic group, its effect is confined to  that  group,
7012       because  once the group has been matched, there is never any backtrack-
7013       ing into it. In this situation, backtracking can  "jump  back"  to  the
7014       left  of the entire atomic group. (Remember also, as stated above, that
7015       this localization also applies in subroutine calls and assertions.)
7016
7017       These verbs differ in exactly what kind of failure  occurs  when  back-
7018       tracking reaches them.
7019
7020         (*COMMIT)
7021
7022       This  verb, which may not be followed by a name, causes the whole match
7023       to fail outright if the rest of the pattern does not match. Even if the
7024       pattern is unanchored, no further attempts to find a match by advancing
7025       the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
7026       pcre_exec()  is  committed  to  finding a match at the current starting
7027       point, or not at all. For example:
7028
7029         a+(*COMMIT)b
7030
7031       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
7032       of dynamic anchor, or "I've started, so I must finish." The name of the
7033       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
7034       forces a match failure.
7035
7036       Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
7037       anchor, unless PCRE's start-of-match optimizations are turned  off,  as
7038       shown in this pcretest example:
7039
7040           re> /(*COMMIT)abc/
7041         data> xyzabc
7042          0: abc
7043         xyzabc\Y
7044         No match
7045
7046       PCRE  knows  that  any  match  must start with "a", so the optimization
7047       skips along the subject to "a" before running the first match  attempt,
7048       which  succeeds.  When the optimization is disabled by the \Y escape in
7049       the second subject, the match starts at "x" and so the (*COMMIT) causes
7050       it to fail without trying any other starting points.
7051
7052         (*PRUNE) or (*PRUNE:NAME)
7053
7054       This  verb causes the match to fail at the current starting position in
7055       the subject if the rest of the pattern does not match. If  the  pattern
7056       is  unanchored,  the  normal  "bumpalong"  advance to the next starting
7057       character then happens. Backtracking can occur as usual to the left  of
7058       (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
7059       (*PRUNE), but if there is no match to the  right,  backtracking  cannot
7060       cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
7061       native to an atomic group or possessive quantifier, but there are  some
7062       uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
7063       iour of (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE).  In  an
7064       anchored pattern (*PRUNE) has the same effect as (*COMMIT).
7065
7066         (*SKIP)
7067
7068       This  verb, when given without a name, is like (*PRUNE), except that if
7069       the pattern is unanchored, the "bumpalong" advance is not to  the  next
7070       character, but to the position in the subject where (*SKIP) was encoun-
7071       tered. (*SKIP) signifies that whatever text was matched leading  up  to
7072       it cannot be part of a successful match. Consider:
7073
7074         a+(*SKIP)b
7075
7076       If  the  subject  is  "aaaac...",  after  the first match attempt fails
7077       (starting at the first character in the  string),  the  starting  point
7078       skips on to start the next attempt at "c". Note that a possessive quan-
7079       tifer does not have the same effect as this example; although it  would
7080       suppress  backtracking  during  the  first  match  attempt,  the second
7081       attempt would start at the second character instead of skipping  on  to
7082       "c".
7083
7084         (*SKIP:NAME)
7085
7086       When  (*SKIP) has an associated name, its behaviour is modified. If the
7087       following pattern fails to match, the previous path through the pattern
7088       is  searched for the most recent (*MARK) that has the same name. If one
7089       is found, the "bumpalong" advance is to the subject position that  cor-
7090       responds  to  that (*MARK) instead of to where (*SKIP) was encountered.
7091       If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
7092
7093         (*THEN) or (*THEN:NAME)
7094
7095       This verb causes a skip to the next innermost alternative if  the  rest
7096       of  the  pattern does not match. That is, it cancels pending backtrack-
7097       ing, but only within the current alternative. Its name comes  from  the
7098       observation that it can be used for a pattern-based if-then-else block:
7099
7100         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
7101
7102       If  the COND1 pattern matches, FOO is tried (and possibly further items
7103       after the end of the group if FOO succeeds); on  failure,  the  matcher
7104       skips  to  the second alternative and tries COND2, without backtracking
7105       into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
7106       (*MARK:NAME)(*THEN).   If (*THEN) is not inside an alternation, it acts
7107       like (*PRUNE).
7108
7109       Note that a subpattern that does not contain a | character  is  just  a
7110       part  of the enclosing alternative; it is not a nested alternation with
7111       only one alternative. The effect of (*THEN) extends beyond such a  sub-
7112       pattern  to  the enclosing alternative. Consider this pattern, where A,
7113       B, etc. are complex pattern fragments that do not contain any | charac-
7114       ters at this level:
7115
7116         A (B(*THEN)C) | D
7117
7118       If  A and B are matched, but there is a failure in C, matching does not
7119       backtrack into A; instead it moves to the next alternative, that is, D.
7120       However,  if the subpattern containing (*THEN) is given an alternative,
7121       it behaves differently:
7122
7123         A (B(*THEN)C | (*FAIL)) | D
7124
7125       The effect of (*THEN) is now confined to the inner subpattern. After  a
7126       failure in C, matching moves to (*FAIL), which causes the whole subpat-
7127       tern to fail because there are no more alternatives  to  try.  In  this
7128       case, matching does now backtrack into A.
7129
7130       Note also that a conditional subpattern is not considered as having two
7131       alternatives, because only one is ever used.  In  other  words,  the  |
7132       character in a conditional subpattern has a different meaning. Ignoring
7133       white space, consider:
7134
7135         ^.*? (?(?=a) a | b(*THEN)c )
7136
7137       If the subject is "ba", this pattern does not  match.  Because  .*?  is
7138       ungreedy,  it  initially  matches  zero characters. The condition (?=a)
7139       then fails, the character "b" is matched,  but  "c"  is  not.  At  this
7140       point,  matching does not backtrack to .*? as might perhaps be expected
7141       from the presence of the | character.  The  conditional  subpattern  is
7142       part of the single alternative that comprises the whole pattern, and so
7143       the match fails. (If there was a backtrack into  .*?,  allowing  it  to
7144       match "b", the match would succeed.)
7145
7146       The  verbs just described provide four different "strengths" of control
7147       when subsequent matching fails. (*THEN) is the weakest, carrying on the
7148       match  at  the next alternative. (*PRUNE) comes next, failing the match
7149       at the current starting position, but allowing an advance to  the  next
7150       character  (for an unanchored pattern). (*SKIP) is similar, except that
7151       the advance may be more than one character. (*COMMIT) is the strongest,
7152       causing the entire match to fail.
7153
7154       If more than one such verb is present in a pattern, the "strongest" one
7155       wins.  For example, consider this pattern, where A, B, etc. are complex
7156       pattern fragments:
7157
7158         (A(*COMMIT)B(*THEN)C|D)
7159
7160       Once  A  has  matched,  PCRE is committed to this match, at the current
7161       starting position. If subsequently B matches, but C does not, the  nor-
7162       mal (*THEN) action of trying the next alternative (that is, D) does not
7163       happen because (*COMMIT) overrides.
7164
7165
7166SEE ALSO
7167
7168       pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),
7169       pcre16(3), pcre32(3).
7170
7171
7172AUTHOR
7173
7174       Philip Hazel
7175       University Computing Service
7176       Cambridge CB2 3QH, England.
7177
7178
7179REVISION
7180
7181       Last updated: 11 November 2012
7182       Copyright (c) 1997-2012 University of Cambridge.
7183------------------------------------------------------------------------------
7184
7185
7186PCRESYNTAX(3)                                                    PCRESYNTAX(3)
7187
7188
7189NAME
7190       PCRE - Perl-compatible regular expressions
7191
7192
7193PCRE REGULAR EXPRESSION SYNTAX SUMMARY
7194
7195       The  full syntax and semantics of the regular expressions that are sup-
7196       ported by PCRE are described in  the  pcrepattern  documentation.  This
7197       document contains a quick-reference summary of the syntax.
7198
7199
7200QUOTING
7201
7202         \x         where x is non-alphanumeric is a literal x
7203         \Q...\E    treat enclosed characters as literal
7204
7205
7206CHARACTERS
7207
7208         \a         alarm, that is, the BEL character (hex 07)
7209         \cx        "control-x", where x is any ASCII character
7210         \e         escape (hex 1B)
7211         \f         form feed (hex 0C)
7212         \n         newline (hex 0A)
7213         \r         carriage return (hex 0D)
7214         \t         tab (hex 09)
7215         \ddd       character with octal code ddd, or backreference
7216         \xhh       character with hex code hh
7217         \x{hhh..}  character with hex code hhh..
7218
7219
7220CHARACTER TYPES
7221
7222         .          any character except newline;
7223                      in dotall mode, any character whatsoever
7224         \C         one data unit, even in UTF mode (best avoided)
7225         \d         a decimal digit
7226         \D         a character that is not a decimal digit
7227         \h         a horizontal white space character
7228         \H         a character that is not a horizontal white space character
7229         \N         a character that is not a newline
7230         \p{xx}     a character with the xx property
7231         \P{xx}     a character without the xx property
7232         \R         a newline sequence
7233         \s         a white space character
7234         \S         a character that is not a white space character
7235         \v         a vertical white space character
7236         \V         a character that is not a vertical white space character
7237         \w         a "word" character
7238         \W         a "non-word" character
7239         \X         a Unicode extended grapheme cluster
7240
7241       In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
7242       characters, even in a UTF mode. However, this can be changed by setting
7243       the PCRE_UCP option.
7244
7245
7246GENERAL CATEGORY PROPERTIES FOR \p and \P
7247
7248         C          Other
7249         Cc         Control
7250         Cf         Format
7251         Cn         Unassigned
7252         Co         Private use
7253         Cs         Surrogate
7254
7255         L          Letter
7256         Ll         Lower case letter
7257         Lm         Modifier letter
7258         Lo         Other letter
7259         Lt         Title case letter
7260         Lu         Upper case letter
7261         L&         Ll, Lu, or Lt
7262
7263         M          Mark
7264         Mc         Spacing mark
7265         Me         Enclosing mark
7266         Mn         Non-spacing mark
7267
7268         N          Number
7269         Nd         Decimal number
7270         Nl         Letter number
7271         No         Other number
7272
7273         P          Punctuation
7274         Pc         Connector punctuation
7275         Pd         Dash punctuation
7276         Pe         Close punctuation
7277         Pf         Final punctuation
7278         Pi         Initial punctuation
7279         Po         Other punctuation
7280         Ps         Open punctuation
7281
7282         S          Symbol
7283         Sc         Currency symbol
7284         Sk         Modifier symbol
7285         Sm         Mathematical symbol
7286         So         Other symbol
7287
7288         Z          Separator
7289         Zl         Line separator
7290         Zp         Paragraph separator
7291         Zs         Space separator
7292
7293
7294PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
7295
7296         Xan        Alphanumeric: union of properties L and N
7297         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
7298         Xsp        Perl space: property Z or tab, NL, FF, CR
7299         Xwd        Perl word: property Xan or underscore
7300
7301
7302SCRIPT NAMES FOR \p AND \P
7303
7304       Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
7305       Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
7306       Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
7307       Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
7308       Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
7309       gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
7310       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
7311       Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
7312       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
7313       Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
7314       Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
7315       Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
7316       tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
7317       Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
7318       Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
7319       Yi.
7320
7321
7322CHARACTER CLASSES
7323
7324         [...]       positive character class
7325         [^...]      negative character class
7326         [x-y]       range (can be used for hex characters)
7327         [[:xxx:]]   positive POSIX named set
7328         [[:^xxx:]]  negative POSIX named set
7329
7330         alnum       alphanumeric
7331         alpha       alphabetic
7332         ascii       0-127
7333         blank       space or tab
7334         cntrl       control character
7335         digit       decimal digit
7336         graph       printing, excluding space
7337         lower       lower case letter
7338         print       printing, including space
7339         punct       printing, excluding alphanumeric
7340         space       white space
7341         upper       upper case letter
7342         word        same as \w
7343         xdigit      hexadecimal digit
7344
7345       In PCRE, POSIX character set names recognize only ASCII  characters  by
7346       default,  but  some  of them use Unicode properties if PCRE_UCP is set.
7347       You can use \Q...\E inside a character class.
7348
7349
7350QUANTIFIERS
7351
7352         ?           0 or 1, greedy
7353         ?+          0 or 1, possessive
7354         ??          0 or 1, lazy
7355         *           0 or more, greedy
7356         *+          0 or more, possessive
7357         *?          0 or more, lazy
7358         +           1 or more, greedy
7359         ++          1 or more, possessive
7360         +?          1 or more, lazy
7361         {n}         exactly n
7362         {n,m}       at least n, no more than m, greedy
7363         {n,m}+      at least n, no more than m, possessive
7364         {n,m}?      at least n, no more than m, lazy
7365         {n,}        n or more, greedy
7366         {n,}+       n or more, possessive
7367         {n,}?       n or more, lazy
7368
7369
7370ANCHORS AND SIMPLE ASSERTIONS
7371
7372         \b          word boundary
7373         \B          not a word boundary
7374         ^           start of subject
7375                      also after internal newline in multiline mode
7376         \A          start of subject
7377         $           end of subject
7378                      also before newline at end of subject
7379                      also before internal newline in multiline mode
7380         \Z          end of subject
7381                      also before newline at end of subject
7382         \z          end of subject
7383         \G          first matching position in subject
7384
7385
7386MATCH POINT RESET
7387
7388         \K          reset start of match
7389
7390
7391ALTERNATION
7392
7393         expr|expr|expr...
7394
7395
7396CAPTURING
7397
7398         (...)           capturing group
7399         (?<name>...)    named capturing group (Perl)
7400         (?'name'...)    named capturing group (Perl)
7401         (?P<name>...)   named capturing group (Python)
7402         (?:...)         non-capturing group
7403         (?|...)         non-capturing group; reset group numbers for
7404                          capturing groups in each alternative
7405
7406
7407ATOMIC GROUPS
7408
7409         (?>...)         atomic, non-capturing group
7410
7411
7412COMMENT
7413
7414         (?#....)        comment (not nestable)
7415
7416
7417OPTION SETTING
7418
7419         (?i)            caseless
7420         (?J)            allow duplicate names
7421         (?m)            multiline
7422         (?s)            single line (dotall)
7423         (?U)            default ungreedy (lazy)
7424         (?x)            extended (ignore white space)
7425         (?-...)         unset option(s)
7426
7427       The following are recognized only at the start of a  pattern  or  after
7428       one of the newline-setting options with similar syntax:
7429
7430         (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
7431         (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
7432         (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
7433         (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
7434         (*UTF)          set appropriate UTF mode for the library in use
7435         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
7436
7437
7438LOOKAHEAD AND LOOKBEHIND ASSERTIONS
7439
7440         (?=...)         positive look ahead
7441         (?!...)         negative look ahead
7442         (?<=...)        positive look behind
7443         (?<!...)        negative look behind
7444
7445       Each top-level branch of a look behind must be of a fixed length.
7446
7447
7448BACKREFERENCES
7449
7450         \n              reference by number (can be ambiguous)
7451         \gn             reference by number
7452         \g{n}           reference by number
7453         \g{-n}          relative reference by number
7454         \k<name>        reference by name (Perl)
7455         \k'name'        reference by name (Perl)
7456         \g{name}        reference by name (Perl)
7457         \k{name}        reference by name (.NET)
7458         (?P=name)       reference by name (Python)
7459
7460
7461SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
7462
7463         (?R)            recurse whole pattern
7464         (?n)            call subpattern by absolute number
7465         (?+n)           call subpattern by relative number
7466         (?-n)           call subpattern by relative number
7467         (?&name)        call subpattern by name (Perl)
7468         (?P>name)       call subpattern by name (Python)
7469         \g<name>        call subpattern by name (Oniguruma)
7470         \g'name'        call subpattern by name (Oniguruma)
7471         \g<n>           call subpattern by absolute number (Oniguruma)
7472         \g'n'           call subpattern by absolute number (Oniguruma)
7473         \g<+n>          call subpattern by relative number (PCRE extension)
7474         \g'+n'          call subpattern by relative number (PCRE extension)
7475         \g<-n>          call subpattern by relative number (PCRE extension)
7476         \g'-n'          call subpattern by relative number (PCRE extension)
7477
7478
7479CONDITIONAL PATTERNS
7480
7481         (?(condition)yes-pattern)
7482         (?(condition)yes-pattern|no-pattern)
7483
7484         (?(n)...        absolute reference condition
7485         (?(+n)...       relative reference condition
7486         (?(-n)...       relative reference condition
7487         (?(<name>)...   named reference condition (Perl)
7488         (?('name')...   named reference condition (Perl)
7489         (?(name)...     named reference condition (PCRE)
7490         (?(R)...        overall recursion condition
7491         (?(Rn)...       specific group recursion condition
7492         (?(R&name)...   specific recursion condition
7493         (?(DEFINE)...   define subpattern for reference
7494         (?(assert)...   assertion condition
7495
7496
7497BACKTRACKING CONTROL
7498
7499       The following act immediately they are reached:
7500
7501         (*ACCEPT)       force successful match
7502         (*FAIL)         force backtrack; synonym (*F)
7503         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
7504
7505       The  following  act only when a subsequent match failure causes a back-
7506       track to reach them. They all force a match failure, but they differ in
7507       what happens afterwards. Those that advance the start-of-match point do
7508       so only if the pattern is not anchored.
7509
7510         (*COMMIT)       overall failure, no advance of starting point
7511         (*PRUNE)        advance to next starting character
7512         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
7513         (*SKIP)         advance to current matching position
7514         (*SKIP:NAME)    advance to position corresponding to an earlier
7515                         (*MARK:NAME); if not found, the (*SKIP) is ignored
7516         (*THEN)         local failure, backtrack to next alternation
7517         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
7518
7519
7520NEWLINE CONVENTIONS
7521
7522       These are recognized only at the very start of the pattern or  after  a
7523       (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
7524
7525         (*CR)           carriage return only
7526         (*LF)           linefeed only
7527         (*CRLF)         carriage return followed by linefeed
7528         (*ANYCRLF)      all three of the above
7529         (*ANY)          any Unicode newline sequence
7530
7531
7532WHAT \R MATCHES
7533
7534       These  are  recognized only at the very start of the pattern or after a
7535       (*...) option that sets the newline convention or a UTF or UCP mode.
7536
7537         (*BSR_ANYCRLF)  CR, LF, or CRLF
7538         (*BSR_UNICODE)  any Unicode newline sequence
7539
7540
7541CALLOUTS
7542
7543         (?C)      callout
7544         (?Cn)     callout with data n
7545
7546
7547SEE ALSO
7548
7549       pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
7550
7551
7552AUTHOR
7553
7554       Philip Hazel
7555       University Computing Service
7556       Cambridge CB2 3QH, England.
7557
7558
7559REVISION
7560
7561       Last updated: 11 November 2012
7562       Copyright (c) 1997-2012 University of Cambridge.
7563------------------------------------------------------------------------------
7564
7565
7566PCREUNICODE(3)                                                  PCREUNICODE(3)
7567
7568
7569NAME
7570       PCRE - Perl-compatible regular expressions
7571
7572
7573UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT
7574
7575       As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30)
7576       and UTF-32 (from release 8.32), by means of two  additional  libraries.
7577       They can be built as well as, or instead of, the 8-bit library.
7578
7579
7580UTF-8 SUPPORT
7581
7582       In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
7583       with UTF support, and, in addition, you must call  pcre_compile()  with
7584       the  PCRE_UTF8 option flag, or the pattern must start with the sequence
7585       (*UTF8) or (*UTF). When either of these is the case, both  the  pattern
7586       and  any  subject  strings  that  are matched against it are treated as
7587       UTF-8 strings instead of strings of individual 1-byte characters.
7588
7589
7590UTF-16 AND UTF-32 SUPPORT
7591
7592       In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit
7593       or  32-bit  library  with  UTF support, and, in addition, you must call
7594       pcre16_compile() or pcre32_compile() with the PCRE_UTF16 or  PCRE_UTF32
7595       option flag, as appropriate. Alternatively, the pattern must start with
7596       the sequence (*UTF16), (*UTF32), as appropriate, or (*UTF),  which  can
7597       be used with either library. When UTF mode is set, both the pattern and
7598       any subject strings that are matched against it are treated  as  UTF-16
7599       or  UTF-32  strings  instead  of strings of individual 16-bit or 32-bit
7600       characters.
7601
7602
7603UTF SUPPORT OVERHEAD
7604
7605       If you compile PCRE with UTF support, but do not use it  at  run  time,
7606       the  library will be a bit bigger, but the additional run time overhead
7607       is limited to  testing  the  PCRE_UTF[8|16|32]  flag  occasionally,  so
7608       should not be very big.
7609
7610
7611UNICODE PROPERTY SUPPORT
7612
7613       If PCRE is built with Unicode character property support (which implies
7614       UTF support), the escape sequences \p{..}, \P{..}, and \X can be  used.
7615       The  available properties that can be tested are limited to the general
7616       category properties such as Lu for an upper case letter  or  Nd  for  a
7617       decimal number, the Unicode script names such as Arabic or Han, and the
7618       derived properties Any and L&. Full lists is given in  the  pcrepattern
7619       and  pcresyntax  documentation. Only the short names for properties are
7620       supported. For example, \p{L}  matches  a  letter.  Its  Perl  synonym,
7621       \p{Letter},  is  not  supported.  Furthermore, in Perl, many properties
7622       may optionally be prefixed by "Is", for compatibility  with  Perl  5.6.
7623       PCRE does not support this.
7624
7625   Validity of UTF-8 strings
7626
7627       When  you  set  the PCRE_UTF8 flag, the byte strings passed as patterns
7628       and subjects are (by default) checked for validity on entry to the rel-
7629       evant functions. The entire string is checked before any other process-
7630       ing takes place. From release 7.3 of PCRE, the check is  according  the
7631       rules of RFC 3629, which are themselves derived from the Unicode speci-
7632       fication. Earlier releases of PCRE followed  the  rules  of  RFC  2279,
7633       which  allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The
7634       current check allows only values in the range U+0 to U+10FFFF,  exclud-
7635       ing the surrogate area and the non-characters.
7636
7637       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
7638       UTF-16, where they are used in pairs to encode codepoints  with  values
7639       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
7640       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
7641       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
7642       unfortunately messes up UTF-8 and UTF-32.)
7643
7644       Also excluded are the "Non-Character" code points, which are U+FDD0  to
7645       U+FDEF  and  the  last  two  code  points  in  each plane, U+??FFFE and
7646       U+??FFFF.
7647
7648       If an invalid UTF-8 string is passed to PCRE, an error return is given.
7649       At  compile  time, the only additional information is the offset to the
7650       first byte of the failing character. The run-time functions pcre_exec()
7651       and  pcre_dfa_exec() also pass back this information, as well as a more
7652       detailed reason code if the caller has provided memory in which  to  do
7653       this.
7654
7655       In  some  situations, you may already know that your strings are valid,
7656       and therefore want to skip these checks in  order  to  improve  perfor-
7657       mance,  for  example in the case of a long subject string that is being
7658       scanned repeatedly.  If you set the PCRE_NO_UTF8_CHECK flag at  compile
7659       time  or  at  run  time, PCRE assumes that the pattern or subject it is
7660       given (respectively) contains only valid UTF-8 codes. In this case,  it
7661       does not diagnose an invalid UTF-8 string.
7662
7663       Note  that  passing  PCRE_NO_UTF8_CHECK to pcre_compile() just disables
7664       the check for the pattern; it does not also apply to  subject  strings.
7665       If  you  want  to  disable the check for a subject string you must pass
7666       this option to pcre_exec() or pcre_dfa_exec().
7667
7668       If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the
7669       result is undefined and your program may crash.
7670
7671   Validity of UTF-16 strings
7672
7673       When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
7674       are passed as patterns and subjects are (by default) checked for valid-
7675       ity  on entry to the relevant functions. Values other than those in the
7676       surrogate range U+D800 to U+DFFF are independent code points. Values in
7677       the surrogate range must be used in pairs in the correct manner.
7678
7679       Excluded  are  the  "Non-Character"  code  points,  which are U+FDD0 to
7680       U+FDEF and the last  two  code  points  in  each  plane,  U+??FFFE  and
7681       U+??FFFF.
7682
7683       If  an  invalid  UTF-16  string  is  passed to PCRE, an error return is
7684       given. At compile time, the only additional information is  the  offset
7685       to the first data unit of the failing character. The run-time functions
7686       pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
7687       well  as  a more detailed reason code if the caller has provided memory
7688       in which to do this.
7689
7690       In some situations, you may already know that your strings  are  valid,
7691       and  therefore  want  to  skip these checks in order to improve perfor-
7692       mance. If you set the PCRE_NO_UTF16_CHECK flag at compile  time  or  at
7693       run time, PCRE assumes that the pattern or subject it is given (respec-
7694       tively) contains only valid UTF-16 sequences. In this case, it does not
7695       diagnose  an  invalid  UTF-16 string.  However, if an invalid string is
7696       passed, the result is undefined.
7697
7698   Validity of UTF-32 strings
7699
7700       When you set the PCRE_UTF32 flag, the strings of 32-bit data units that
7701       are passed as patterns and subjects are (by default) checked for valid-
7702       ity on entry to the relevant functions.  This check allows only  values
7703       in  the  range  U+0 to U+10FFFF, excluding the surrogate area U+D800 to
7704       U+DFFF, and the "Non-Character" code points, which are U+FDD0 to U+FDEF
7705       and the last two characters in each plane, U+??FFFE and U+??FFFF.
7706
7707       If  an  invalid  UTF-32  string  is  passed to PCRE, an error return is
7708       given. At compile time, the only additional information is  the  offset
7709       to the first data unit of the failing character. The run-time functions
7710       pcre32_exec() and pcre32_dfa_exec() also pass back this information, as
7711       well  as  a more detailed reason code if the caller has provided memory
7712       in which to do this.
7713
7714       In some situations, you may already know that your strings  are  valid,
7715       and  therefore  want  to  skip these checks in order to improve perfor-
7716       mance. If you set the PCRE_NO_UTF32_CHECK flag at compile  time  or  at
7717       run time, PCRE assumes that the pattern or subject it is given (respec-
7718       tively) contains only valid UTF-32 sequences. In this case, it does not
7719       diagnose  an  invalid  UTF-32 string.  However, if an invalid string is
7720       passed, the result is undefined.
7721
7722   General comments about UTF modes
7723
7724       1. Codepoints less than 256 can be  specified  in  patterns  by  either
7725       braced or unbraced hexadecimal escape sequences (for example, \x{b3} or
7726       \xb3). Larger values have to use braced sequences.
7727
7728       2. Octal numbers up to \777 are recognized,  and  in  UTF-8  mode  they
7729       match two-byte characters for values greater than \177.
7730
7731       3. Repeat quantifiers apply to complete UTF characters, not to individ-
7732       ual data units, for example: \x{100}{3}.
7733
7734       4. The dot metacharacter matches one UTF character instead of a  single
7735       data unit.
7736
7737       5.  The  escape sequence \C can be used to match a single byte in UTF-8
7738       mode, or a single 16-bit data unit in UTF-16 mode, or a  single  32-bit
7739       data  unit in UTF-32 mode, but its use can lead to some strange effects
7740       because it breaks up multi-unit characters (see the description  of  \C
7741       in  the  pcrepattern  documentation). The use of \C is not supported in
7742       the alternative matching function  pcre[16|32]_dfa_exec(),  nor  is  it
7743       supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If
7744       JIT optimization is requested for a UTF pattern that  contains  \C,  it
7745       will not succeed, and so the matching will be carried out by the normal
7746       interpretive function.
7747
7748       6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
7749       test characters of any code value, but, by default, the characters that
7750       PCRE recognizes as digits, spaces, or word characters remain  the  same
7751       set  as  in  non-UTF  mode, all with values less than 256. This remains
7752       true even when PCRE is  built  to  include  Unicode  property  support,
7753       because to do otherwise would slow down PCRE in many common cases. Note
7754       in particular that this applies to \b and \B, because they are  defined
7755       in terms of \w and \W. If you really want to test for a wider sense of,
7756       say, "digit", you can use  explicit  Unicode  property  tests  such  as
7757       \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
7758       character escapes work is changed so that Unicode properties  are  used
7759       to determine which characters match. There are more details in the sec-
7760       tion on generic character types in the pcrepattern documentation.
7761
7762       7. Similarly, characters that match the POSIX named  character  classes
7763       are all low-valued characters, unless the PCRE_UCP option is set.
7764
7765       8.  However,  the  horizontal and vertical white space matching escapes
7766       (\h, \H, \v, and \V) do match all the appropriate  Unicode  characters,
7767       whether or not PCRE_UCP is set.
7768
7769       9.  Case-insensitive  matching  applies only to characters whose values
7770       are less than 128, unless PCRE is built with Unicode property  support.
7771       A  few  Unicode characters such as Greek sigma have more than two code-
7772       points that are case-equivalent. Up to and including PCRE release 8.31,
7773       only  one-to-one case mappings were supported, but later releases (with
7774       Unicode property support) do treat as case-equivalent all  versions  of
7775       characters such as Greek sigma.
7776
7777
7778AUTHOR
7779
7780       Philip Hazel
7781       University Computing Service
7782       Cambridge CB2 3QH, England.
7783
7784
7785REVISION
7786
7787       Last updated: 11 November 2012
7788       Copyright (c) 1997-2012 University of Cambridge.
7789------------------------------------------------------------------------------
7790
7791
7792PCREJIT(3)                                                          PCREJIT(3)
7793
7794
7795NAME
7796       PCRE - Perl-compatible regular expressions
7797
7798
7799PCRE JUST-IN-TIME COMPILER SUPPORT
7800
7801       Just-in-time  compiling  is a heavyweight optimization that can greatly
7802       speed up pattern matching. However, it comes at the cost of extra  pro-
7803       cessing before the match is performed. Therefore, it is of most benefit
7804       when the same pattern is going to be matched many times. This does  not
7805       necessarily  mean  many calls of a matching function; if the pattern is
7806       not anchored, matching attempts may take place many  times  at  various
7807       positions  in  the  subject, even for a single call.  Therefore, if the
7808       subject string is very long, it may still pay to use  JIT  for  one-off
7809       matches.
7810
7811       JIT  support  applies  only to the traditional Perl-compatible matching
7812       function.  It does not apply when the DFA matching  function  is  being
7813       used. The code for this support was written by Zoltan Herczeg.
7814
7815
78168-BIT, 16-BIT AND 32-BIT SUPPORT
7817
7818       JIT  support  is available for all of the 8-bit, 16-bit and 32-bit PCRE
7819       libraries. To keep this documentation simple, only the 8-bit  interface
7820       is described in what follows. If you are using the 16-bit library, sub-
7821       stitute the  16-bit  functions  and  16-bit  structures  (for  example,
7822       pcre16_jit_stack  instead  of  pcre_jit_stack).  If  you  are using the
7823       32-bit library, substitute the 32-bit functions and  32-bit  structures
7824       (for example, pcre32_jit_stack instead of pcre_jit_stack).
7825
7826
7827AVAILABILITY OF JIT SUPPORT
7828
7829       JIT  support  is  an  optional  feature of PCRE. The "configure" option
7830       --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
7831       built  if  you want to use JIT. The support is limited to the following
7832       hardware platforms:
7833
7834         ARM v5, v7, and Thumb2
7835         Intel x86 32-bit and 64-bit
7836         MIPS 32-bit
7837         Power PC 32-bit and 64-bit
7838         SPARC 32-bit (experimental)
7839
7840       If --enable-jit is set on an unsupported platform, compilation fails.
7841
7842       A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
7843       port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
7844       option. The result is 1 when JIT is available, and  0  otherwise.  How-
7845       ever, a simple program does not need to check this in order to use JIT.
7846       The normal API is implemented in a way that falls back to the interpre-
7847       tive code if JIT is not available. For programs that need the best pos-
7848       sible performance, there is also a "fast path"  API  that  is  JIT-spe-
7849       cific.
7850
7851       If  your program may sometimes be linked with versions of PCRE that are
7852       older than 8.20, but you want to use JIT when it is available, you  can
7853       test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
7854       macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
7855
7856
7857SIMPLE USE OF JIT
7858
7859       You have to do two things to make use of the JIT support  in  the  sim-
7860       plest way:
7861
7862         (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
7863             each compiled pattern, and pass the resulting pcre_extra block to
7864             pcre_exec().
7865
7866         (2) Use pcre_free_study() to free the pcre_extra block when it is
7867             no  longer  needed,  instead  of  just  freeing it yourself. This
7868       ensures that
7869             any JIT data is also freed.
7870
7871       For a program that may be linked with pre-8.20 versions  of  PCRE,  you
7872       can insert
7873
7874         #ifndef PCRE_STUDY_JIT_COMPILE
7875         #define PCRE_STUDY_JIT_COMPILE 0
7876         #endif
7877
7878       so  that  no  option  is passed to pcre_study(), and then use something
7879       like this to free the study data:
7880
7881         #ifdef PCRE_CONFIG_JIT
7882             pcre_free_study(study_ptr);
7883         #else
7884             pcre_free(study_ptr);
7885         #endif
7886
7887       PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate  code  for
7888       complete  matches.  If  you  want  to  run  partial  matches  using the
7889       PCRE_PARTIAL_HARD or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(),  you
7890       should  set  one  or  both  of the following options in addition to, or
7891       instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
7892
7893         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
7894         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
7895
7896       The JIT compiler generates different optimized code  for  each  of  the
7897       three  modes  (normal, soft partial, hard partial). When pcre_exec() is
7898       called, the appropriate code is run if it is available. Otherwise,  the
7899       pattern is matched using interpretive code.
7900
7901       In  some circumstances you may need to call additional functions. These
7902       are described in the  section  entitled  "Controlling  the  JIT  stack"
7903       below.
7904
7905       If  JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc. are
7906       ignored, and no JIT data is created. Otherwise, the compiled pattern is
7907       passed  to the JIT compiler, which turns it into machine code that exe-
7908       cutes much faster than the normal interpretive code.  When  pcre_exec()
7909       is  passed  a  pcre_extra block containing a pointer to JIT code of the
7910       appropriate mode (normal or hard/soft  partial),  it  obeys  that  code
7911       instead  of  running  the interpreter. The result is identical, but the
7912       compiled JIT code runs much faster.
7913
7914       There are some pcre_exec() options that are not supported for JIT  exe-
7915       cution.  There  are  also  some  pattern  items that JIT cannot handle.
7916       Details are given below. In both cases, execution  automatically  falls
7917       back  to  the  interpretive  code.  If you want to know whether JIT was
7918       actually used for a particular match, you  should  arrange  for  a  JIT
7919       callback  function  to  be  set up as described in the section entitled
7920       "Controlling the JIT stack" below, even if you do not need to supply  a
7921       non-default  JIT stack. Such a callback function is called whenever JIT
7922       code is about to be obeyed. If the execution options are not right  for
7923       JIT execution, the callback function is not obeyed.
7924
7925       If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
7926       ated. You can find out if JIT execution is available after  studying  a
7927       pattern  by  calling  pcre_fullinfo()  with the PCRE_INFO_JIT option. A
7928       result of 1 means that JIT compilation was successful. A  result  of  0
7929       means that JIT support is not available, or the pattern was not studied
7930       with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not  able  to
7931       handle the pattern.
7932
7933       Once a pattern has been studied, with or without JIT, it can be used as
7934       many times as you like for matching different subject strings.
7935
7936
7937UNSUPPORTED OPTIONS AND PATTERN ITEMS
7938
7939       The only pcre_exec() options that are supported for JIT  execution  are
7940       PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT-
7941       BOL,  PCRE_NOTEOL,  PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,   PCRE_PAR-
7942       TIAL_HARD, and PCRE_PARTIAL_SOFT.
7943
7944       The unsupported pattern items are:
7945
7946         \C             match a single byte; not supported in UTF-8 mode
7947         (?Cn)          callouts
7948         (*PRUNE)       )
7949         (*SKIP)        ) backtracking control verbs
7950         (*THEN)        )
7951
7952       Support for some of these may be added in future.
7953
7954
7955RETURN VALUES FROM JIT EXECUTION
7956
7957       When  a  pattern  is matched using JIT execution, the return values are
7958       the same as those given by the interpretive pcre_exec() code, with  the
7959       addition  of  one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means
7960       that the memory used for the JIT stack was insufficient. See  "Control-
7961       ling the JIT stack" below for a discussion of JIT stack usage. For com-
7962       patibility with the interpretive pcre_exec() code, no  more  than  two-
7963       thirds  of  the ovector argument is used for passing back captured sub-
7964       strings.
7965
7966       The error code PCRE_ERROR_MATCHLIMIT is returned by  the  JIT  code  if
7967       searching  a  very large pattern tree goes on for too long, as it is in
7968       the same circumstance when JIT is not used, but the details of  exactly
7969       what  is  counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error
7970       code is never returned by JIT execution.
7971
7972
7973SAVING AND RESTORING COMPILED PATTERNS
7974
7975       The code that is generated by the  JIT  compiler  is  architecture-spe-
7976       cific,  and  is also position dependent. For those reasons it cannot be
7977       saved (in a file or database) and restored later like the bytecode  and
7978       other  data  of  a compiled pattern. Saving and restoring compiled pat-
7979       terns is not something many people do. More detail about this  facility
7980       is  given in the pcreprecompile documentation. It should be possible to
7981       run pcre_study() on a saved and restored pattern, and thereby  recreate
7982       the  JIT  data, but because JIT compilation uses significant resources,
7983       it is probably not worth doing this; you might as  well  recompile  the
7984       original pattern.
7985
7986
7987CONTROLLING THE JIT STACK
7988
7989       When the compiled JIT code runs, it needs a block of memory to use as a
7990       stack.  By default, it uses 32K on the  machine  stack.  However,  some
7991       large   or   complicated  patterns  need  more  than  this.  The  error
7992       PCRE_ERROR_JIT_STACKLIMIT is given when  there  is  not  enough  stack.
7993       Three  functions  are provided for managing blocks of memory for use as
7994       JIT stacks. There is further discussion about the use of JIT stacks  in
7995       the section entitled "JIT stack FAQ" below.
7996
7997       The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
7998       are a starting size and a maximum size, and it returns a pointer to  an
7999       opaque  structure of type pcre_jit_stack, or NULL if there is an error.
8000       The pcre_jit_stack_free() function can be used to free a stack that  is
8001       no  longer  needed.  (For  the technically minded: the address space is
8002       allocated by mmap or VirtualAlloc.)
8003
8004       JIT uses far less memory for recursion than the interpretive code,  and
8005       a  maximum  stack size of 512K to 1M should be more than enough for any
8006       pattern.
8007
8008       The pcre_assign_jit_stack() function specifies  which  stack  JIT  code
8009       should use. Its arguments are as follows:
8010
8011         pcre_extra         *extra
8012         pcre_jit_callback  callback
8013         void               *data
8014
8015       The  extra  argument  must  be  the  result  of studying a pattern with
8016       PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
8017       other two options:
8018
8019         (1) If callback is NULL and data is NULL, an internal 32K block
8020             on the machine stack is used.
8021
8022         (2) If callback is NULL and data is not NULL, data must be
8023             a valid JIT stack, the result of calling pcre_jit_stack_alloc().
8024
8025         (3) If callback is not NULL, it must point to a function that is
8026             called with data as an argument at the start of matching, in
8027             order to set up a JIT stack. If the return from the callback
8028             function is NULL, the internal 32K stack is used; otherwise the
8029             return value must be a valid JIT stack, the result of calling
8030             pcre_jit_stack_alloc().
8031
8032       A  callback function is obeyed whenever JIT code is about to be run; it
8033       is not obeyed when pcre_exec() is called with options that  are  incom-
8034       patible for JIT execution. A callback function can therefore be used to
8035       determine whether a match operation was  executed  by  JIT  or  by  the
8036       interpreter.
8037
8038       You may safely use the same JIT stack for more than one pattern (either
8039       by assigning directly or by callback), as long as the patterns are  all
8040       matched  sequentially in the same thread. In a multithread application,
8041       if you do not specify a JIT stack, or if you assign or pass  back  NULL
8042       from  a  callback, that is thread-safe, because each thread has its own
8043       machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
8044       stack,  this  must  be  a  different  stack for each thread so that the
8045       application is thread-safe.
8046
8047       Strictly speaking, even more is allowed. You can assign the  same  non-
8048       NULL  stack  to any number of patterns as long as they are not used for
8049       matching by multiple threads at the same time.  For  example,  you  can
8050       assign  the same stack to all compiled patterns, and use a global mutex
8051       in the callback to wait until the stack is available for use.  However,
8052       this is an inefficient solution, and not recommended.
8053
8054       This  is a suggestion for how a multithreaded program that needs to set
8055       up non-default JIT stacks might operate:
8056
8057         During thread initalization
8058           thread_local_var = pcre_jit_stack_alloc(...)
8059
8060         During thread exit
8061           pcre_jit_stack_free(thread_local_var)
8062
8063         Use a one-line callback function
8064           return thread_local_var
8065
8066       All the functions described in this section do nothing if  JIT  is  not
8067       available,  and  pcre_assign_jit_stack()  does nothing unless the extra
8068       argument is non-NULL and points to  a  pcre_extra  block  that  is  the
8069       result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
8070
8071
8072JIT STACK FAQ
8073
8074       (1) Why do we need JIT stacks?
8075
8076       PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack
8077       where the local data of the current node is pushed before checking  its
8078       child nodes.  Allocating real machine stack on some platforms is diffi-
8079       cult. For example, the stack chain needs to be updated every time if we
8080       extend  the  stack  on  PowerPC.  Although it is possible, its updating
8081       time overhead decreases performance. So we do the recursion in memory.
8082
8083       (2) Why don't we simply allocate blocks of memory with malloc()?
8084
8085       Modern operating systems have a  nice  feature:  they  can  reserve  an
8086       address space instead of allocating memory. We can safely allocate mem-
8087       ory pages inside this address space, so the stack  could  grow  without
8088       moving memory data (this is important because of pointers). Thus we can
8089       allocate 1M address space, and use only a single memory  page  (usually
8090       4K)  if  that is enough. However, we can still grow up to 1M anytime if
8091       needed.
8092
8093       (3) Who "owns" a JIT stack?
8094
8095       The owner of the stack is the user program, not the JIT studied pattern
8096       or  anything else. The user program must ensure that if a stack is used
8097       by pcre_exec(), (that is, it is assigned to the pattern currently  run-
8098       ning), that stack must not be used by any other threads (to avoid over-
8099       writing the same memory area). The best practice for multithreaded pro-
8100       grams  is  to  allocate  a stack for each thread, and return this stack
8101       through the JIT callback function.
8102
8103       (4) When should a JIT stack be freed?
8104
8105       You can free a JIT stack at any time, as long as it will not be used by
8106       pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a
8107       pointer is set. There is no reference counting or any other magic.  You
8108       can  free  the  patterns  and stacks in any order, anytime. Just do not
8109       call pcre_exec() with a pattern pointing to an already freed stack,  as
8110       that  will cause SEGFAULT. (Also, do not free a stack currently used by
8111       pcre_exec() in another thread). You can also replace the  stack  for  a
8112       pattern  at  any  time.  You  can  even  free the previous stack before
8113       assigning a replacement.
8114
8115       (5) Should I allocate/free a  stack  every  time  before/after  calling
8116       pcre_exec()?
8117
8118       No,  because  this  is  too  costly in terms of resources. However, you
8119       could implement some clever idea which release the stack if it  is  not
8120       used  in  let's  say  two minutes. The JIT callback can help to achieve
8121       this without keeping a list of the currently JIT studied patterns.
8122
8123       (6) OK, the stack is for long term memory allocation. But what  happens
8124       if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
8125       until the stack is freed?
8126
8127       Especially on embedded sytems, it might be a good idea to release  mem-
8128       ory  sometimes  without  freeing the stack. There is no API for this at
8129       the moment.  Probably a function call which returns with the  currently
8130       allocated  memory for any stack and another which allows releasing mem-
8131       ory (shrinking the stack) would be a good idea if someone needs this.
8132
8133       (7) This is too much of a headache. Isn't there any better solution for
8134       JIT stack handling?
8135
8136       No,  thanks to Windows. If POSIX threads were used everywhere, we could
8137       throw out this complicated API.
8138
8139
8140EXAMPLE CODE
8141
8142       This is a single-threaded example that specifies a  JIT  stack  without
8143       using a callback.
8144
8145         int rc;
8146         int ovector[30];
8147         pcre *re;
8148         pcre_extra *extra;
8149         pcre_jit_stack *jit_stack;
8150
8151         re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
8152         /* Check for errors */
8153         extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
8154         jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
8155         /* Check for error (NULL) */
8156         pcre_assign_jit_stack(extra, NULL, jit_stack);
8157         rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
8158         /* Check results */
8159         pcre_free(re);
8160         pcre_free_study(extra);
8161         pcre_jit_stack_free(jit_stack);
8162
8163
8164JIT FAST PATH API
8165
8166       Because  the  API  described  above falls back to interpreted execution
8167       when JIT is not available, it is convenient for programs that are writ-
8168       ten  for  general  use  in  many environments. However, calling JIT via
8169       pcre_exec() does have a performance impact. Programs that  are  written
8170       for  use  where  JIT  is known to be available, and which need the best
8171       possible performance, can instead use a "fast path"  API  to  call  JIT
8172       execution  directly  instead of calling pcre_exec() (obviously only for
8173       patterns that have been successfully studied by JIT).
8174
8175       The fast path function is called pcre_jit_exec(), and it takes  exactly
8176       the  same  arguments  as pcre_exec(), plus one additional argument that
8177       must point to a JIT stack. The JIT stack arrangements  described  above
8178       do not apply. The return values are the same as for pcre_exec().
8179
8180       When  you  call  pcre_exec(), as well as testing for invalid options, a
8181       number of other sanity checks are performed on the arguments. For exam-
8182       ple,  if  the  subject  pointer  is NULL, or its length is negative, an
8183       immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is  set,  a
8184       UTF  subject  string is tested for validity. In the interests of speed,
8185       these checks do not happen on the JIT fast path, and if invalid data is
8186       passed, the result is undefined.
8187
8188       Bypassing  the  sanity  checks  and  the  pcre_exec() wrapping can give
8189       speedups of more than 10%.
8190
8191
8192SEE ALSO
8193
8194       pcreapi(3)
8195
8196
8197AUTHOR
8198
8199       Philip Hazel (FAQ by Zoltan Herczeg)
8200       University Computing Service
8201       Cambridge CB2 3QH, England.
8202
8203
8204REVISION
8205
8206       Last updated: 31 October 2012
8207       Copyright (c) 1997-2012 University of Cambridge.
8208------------------------------------------------------------------------------
8209
8210
8211PCREPARTIAL(3)                                                  PCREPARTIAL(3)
8212
8213
8214NAME
8215       PCRE - Perl-compatible regular expressions
8216
8217
8218PARTIAL MATCHING IN PCRE
8219
8220       In normal use of PCRE, if the subject string that is passed to a match-
8221       ing function matches as far as it goes, but is too short to  match  the
8222       entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
8223       where it might be helpful to distinguish this case from other cases  in
8224       which there is no match.
8225
8226       Consider, for example, an application where a human is required to type
8227       in data for a field with specific formatting requirements.  An  example
8228       might be a date in the form ddmmmyy, defined by this pattern:
8229
8230         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
8231
8232       If the application sees the user's keystrokes one by one, and can check
8233       that what has been typed so far is potentially valid,  it  is  able  to
8234       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
8235       reflecting the character that has been typed, for example. This immedi-
8236       ate  feedback is likely to be a better user interface than a check that
8237       is delayed until the entire string has been entered.  Partial  matching
8238       can  also be useful when the subject string is very long and is not all
8239       available at once.
8240
8241       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
8242       PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
8243       matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
8244       onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
8245       options is whether or not a partial match is preferred to  an  alterna-
8246       tive complete match, though the details differ between the two types of
8247       matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
8248       precedence.
8249
8250       If  you  want to use partial matching with just-in-time optimized code,
8251       you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
8252       or both of these options:
8253
8254         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
8255         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
8256
8257       PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
8258       partial matches on the same pattern. If the appropriate JIT study  mode
8259       has not been set for a match, the interpretive matching code is used.
8260
8261       Setting a partial matching option disables two of PCRE's standard opti-
8262       mizations. PCRE remembers the last literal data unit in a pattern,  and
8263       abandons  matching  immediately  if  it  is  not present in the subject
8264       string. This optimization cannot be used  for  a  subject  string  that
8265       might  match only partially. If the pattern was studied, PCRE knows the
8266       minimum length of a matching string, and does not  bother  to  run  the
8267       matching  function  on  shorter strings. This optimization is also dis-
8268       abled for partial matching.
8269
8270
8271PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()
8272
8273       A  partial   match   occurs   during   a   call   to   pcre_exec()   or
8274       pcre[16|32]_exec()  when  the end of the subject string is reached suc-
8275       cessfully, but matching cannot continue  because  more  characters  are
8276       needed.  However,  at least one character in the subject must have been
8277       inspected. This character need not  form  part  of  the  final  matched
8278       string;  lookbehind  assertions and the \K escape sequence provide ways
8279       of inspecting characters before the start of a matched  substring.  The
8280       requirement  for  inspecting  at  least one character exists because an
8281       empty string can always be matched; without such  a  restriction  there
8282       would  always  be  a partial match of an empty string at the end of the
8283       subject.
8284
8285       If there are at least two slots in the offsets vector  when  a  partial
8286       match  is returned, the first slot is set to the offset of the earliest
8287       character that was inspected. For convenience, the second offset points
8288       to the end of the subject so that a substring can easily be identified.
8289
8290       For  the majority of patterns, the first offset identifies the start of
8291       the partially matched string. However, for patterns that contain  look-
8292       behind  assertions,  or  \K, or begin with \b or \B, earlier characters
8293       have been inspected while carrying out the match. For example:
8294
8295         /(?<=abc)123/
8296
8297       This pattern matches "123", but only if it is preceded by "abc". If the
8298       subject string is "xyzabc12", the offsets after a partial match are for
8299       the substring "abc12", because  all  these  characters  are  needed  if
8300       another match is tried with extra characters added to the subject.
8301
8302       What happens when a partial match is identified depends on which of the
8303       two partial matching options are set.
8304
8305   PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
8306
8307       If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
8308       identifies a partial match, the partial match is remembered, but match-
8309       ing continues as normal, and other  alternatives  in  the  pattern  are
8310       tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
8311       returned instead of PCRE_ERROR_NOMATCH.
8312
8313       This option is "soft" because it prefers a complete match over  a  par-
8314       tial  match.   All the various matching items in a pattern behave as if
8315       the subject string is potentially complete. For example, \z, \Z, and  $
8316       match  at  the end of the subject, as normal, and for \b and \B the end
8317       of the subject is treated as a non-alphanumeric.
8318
8319       If there is more than one partial match, the first one that  was  found
8320       provides the data that is returned. Consider this pattern:
8321
8322         /123\w+X|dogY/
8323
8324       If  this is matched against the subject string "abc123dog", both alter-
8325       natives fail to match, but the end of the  subject  is  reached  during
8326       matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
8327       and 9, identifying "123dog" as the first partial match that was  found.
8328       (In  this  example, there are two partial matches, because "dog" on its
8329       own partially matches the second alternative.)
8330
8331   PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
8332
8333       If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
8334       PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
8335       without continuing to search for possible complete matches. This option
8336       is "hard" because it prefers an earlier partial match over a later com-
8337       plete match. For this reason, the assumption is made that  the  end  of
8338       the  supplied  subject  string may not be the true end of the available
8339       data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
8340       subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
8341       character in the subject has been inspected.
8342
8343       Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
8344       strings  are checked for validity. Normally, an invalid sequence causes
8345       the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
8346       special  case  of  a  truncated  character  at  the end of the subject,
8347       PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
8348       PCRE_PARTIAL_HARD is set.
8349
8350   Comparing hard and soft partial matching
8351
8352       The  difference  between the two partial matching options can be illus-
8353       trated by a pattern such as:
8354
8355         /dog(sbody)?/
8356
8357       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
8358       the  longer  string  if  possible). If it is matched against the string
8359       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
8360       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
8361       On the other hand, if the pattern is made ungreedy the result  is  dif-
8362       ferent:
8363
8364         /dog(sbody)??/
8365
8366       In  this  case  the  result  is always a complete match because that is
8367       found first, and matching never  continues  after  finding  a  complete
8368       match. It might be easier to follow this explanation by thinking of the
8369       two patterns like this:
8370
8371         /dog(sbody)?/    is the same as  /dogsbody|dog/
8372         /dog(sbody)??/   is the same as  /dog|dogsbody/
8373
8374       The second pattern will never match "dogsbody", because it will  always
8375       find the shorter match first.
8376
8377
8378PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
8379
8380       The DFA functions move along the subject string character by character,
8381       without backtracking, searching for  all  possible  matches  simultane-
8382       ously.  If the end of the subject is reached before the end of the pat-
8383       tern, there is the possibility of a partial match, again provided  that
8384       at least one character has been inspected.
8385
8386       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
8387       there have been no complete matches. Otherwise,  the  complete  matches
8388       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
8389       takes precedence over any complete matches. The portion of  the  string
8390       that  was  inspected when the longest partial match was found is set as
8391       the first matching string, provided there are at least two slots in the
8392       offsets vector.
8393
8394       Because  the  DFA functions always search for all possible matches, and
8395       there is no difference between greedy and  ungreedy  repetition,  their
8396       behaviour  is  different  from  the  standard  functions when PCRE_PAR-
8397       TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
8398       ungreedy pattern shown above:
8399
8400         /dog(sbody)??/
8401
8402       Whereas  the  standard functions stop as soon as they find the complete
8403       match for "dog", the DFA functions also  find  the  partial  match  for
8404       "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
8405
8406
8407PARTIAL MATCHING AND WORD BOUNDARIES
8408
8409       If  a  pattern ends with one of sequences \b or \B, which test for word
8410       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
8411       intuitive results. Consider this pattern:
8412
8413         /\bcat\b/
8414
8415       This matches "cat", provided there is a word boundary at either end. If
8416       the subject string is "the cat", the comparison of the final "t" with a
8417       following  character  cannot  take  place, so a partial match is found.
8418       However, normal matching carries on, and \b matches at the end  of  the
8419       subject  when  the  last  character is a letter, so a complete match is
8420       found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
8421       PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
8422       then the partial match takes precedence.
8423
8424
8425FORMERLY RESTRICTED PATTERNS
8426
8427       For releases of PCRE prior to 8.00, because of the way certain internal
8428       optimizations   were  implemented  in  the  pcre_exec()  function,  the
8429       PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
8430       used  with all patterns. From release 8.00 onwards, the restrictions no
8431       longer apply, and partial matching with can be requested for  any  pat-
8432       tern.
8433
8434       Items that were formerly restricted were repeated single characters and
8435       repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
8436       not  conform  to  the restrictions, pcre_exec() returned the error code
8437       PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
8438       PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
8439       pattern can be used for partial matching now always returns 1.
8440
8441
8442EXAMPLE OF PARTIAL MATCHING USING PCRETEST
8443
8444       If the escape sequence \P is present  in  a  pcretest  data  line,  the
8445       PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
8446       pcretest that uses the date example quoted above:
8447
8448           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
8449         data> 25jun04\P
8450          0: 25jun04
8451          1: jun
8452         data> 25dec3\P
8453         Partial match: 23dec3
8454         data> 3ju\P
8455         Partial match: 3ju
8456         data> 3juj\P
8457         No match
8458         data> j\P
8459         No match
8460
8461       The first data string is matched  completely,  so  pcretest  shows  the
8462       matched  substrings.  The  remaining four strings do not match the com-
8463       plete pattern, but the first two are partial matches. Similar output is
8464       obtained if DFA matching is used.
8465
8466       If  the escape sequence \P is present more than once in a pcretest data
8467       line, the PCRE_PARTIAL_HARD option is set for the match.
8468
8469
8470MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
8471
8472       When a partial match has been found using a DFA matching  function,  it
8473       is  possible to continue the match by providing additional subject data
8474       and calling the function again with the same compiled  regular  expres-
8475       sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
8476       same working space as before, because this is where details of the pre-
8477       vious  partial  match  are  stored.  Here is an example using pcretest,
8478       using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
8479       specifies the use of the DFA matching function):
8480
8481           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
8482         data> 23ja\P\D
8483         Partial match: 23ja
8484         data> n05\R\D
8485          0: n05
8486
8487       The  first  call has "23ja" as the subject, and requests partial match-
8488       ing; the second call  has  "n05"  as  the  subject  for  the  continued
8489       (restarted)  match.   Notice  that when the match is complete, only the
8490       last part is shown; PCRE does  not  retain  the  previously  partially-
8491       matched  string. It is up to the calling program to do that if it needs
8492       to.
8493
8494       You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
8495       PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
8496       This facility can be used to pass very long subject strings to the  DFA
8497       matching functions.
8498
8499
8500MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()
8501
8502       From  release 8.00, the standard matching functions can also be used to
8503       do multi-segment matching. Unlike the DFA functions, it is not possible
8504       to  restart the previous match with a new segment of data. Instead, new
8505       data must be added to the previous subject string, and the entire match
8506       re-run,  starting from the point where the partial match occurred. Ear-
8507       lier data can be discarded.
8508
8509       It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
8510       not  treat the end of a segment as the end of the subject when matching
8511       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
8512       dates:
8513
8514           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
8515         data> The date is 23ja\P\P
8516         Partial match: 23ja
8517
8518       At  this stage, an application could discard the text preceding "23ja",
8519       add on text from the next  segment,  and  call  the  matching  function
8520       again.  Unlike  the  DFA matching functions, the entire matching string
8521       must always be available, and the complete matching process occurs  for
8522       each call, so more memory and more processing time is needed.
8523
8524       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
8525       with \b or \B, the string that is returned for a partial match includes
8526       characters  that  precede  the partially matched string itself, because
8527       these must be retained when adding on more characters for a  subsequent
8528       matching  attempt.   However, in some cases you may need to retain even
8529       earlier characters, as discussed in the next section.
8530
8531
8532ISSUES WITH MULTI-SEGMENT MATCHING
8533
8534       Certain types of pattern may give problems with multi-segment matching,
8535       whichever matching function is used.
8536
8537       1. If the pattern contains a test for the beginning of a line, you need
8538       to pass the PCRE_NOTBOL option when the subject  string  for  any  call
8539       does  start  at  the  beginning  of a line. There is also a PCRE_NOTEOL
8540       option, but in practice when doing multi-segment matching you should be
8541       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
8542
8543       2.  Lookbehind assertions that have already been obeyed are catered for
8544       in the offsets that are returned for a partial match. However a lookbe-
8545       hind  assertion later in the pattern could require even earlier charac-
8546       ters  to  be  inspected.  You  can  handle  this  case  by  using   the
8547       PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
8548       pcre[16|32]_fullinfo() functions to obtain the length  of  the  largest
8549       lookbehind  in  the  pattern.  This  length is given in characters, not
8550       bytes. If you always retain at least that many  characters  before  the
8551       partially  matched  string,  all  should  be well. (Of course, near the
8552       start of the subject, fewer characters may be present; in that case all
8553       characters should be retained.)
8554
8555       3.  Because a partial match must always contain at least one character,
8556       what might be considered a partial match of an  empty  string  actually
8557       gives a "no match" result. For example:
8558
8559           re> /c(?<=abc)x/
8560         data> ab\P
8561         No match
8562
8563       If the next segment begins "cx", a match should be found, but this will
8564       only happen if characters from the previous segment are  retained.  For
8565       this  reason,  a  "no  match"  result should be interpreted as "partial
8566       match of an empty string" when the pattern contains lookbehinds.
8567
8568       4. Matching a subject string that is split into multiple  segments  may
8569       not  always produce exactly the same result as matching over one single
8570       long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
8571       "Partial  Matching  and  Word Boundaries" above describes an issue that
8572       arises if the pattern ends with \b or \B. Another  kind  of  difference
8573       may  occur when there are multiple matching possibilities, because (for
8574       PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
8575       no completed matches. This means that as soon as the shortest match has
8576       been found, continuation to a new subject segment is no  longer  possi-
8577       ble. Consider again this pcretest example:
8578
8579           re> /dog(sbody)?/
8580         data> dogsb\P
8581          0: dog
8582         data> do\P\D
8583         Partial match: do
8584         data> gsb\R\P\D
8585          0: g
8586         data> dogsbody\D
8587          0: dogsbody
8588          1: dog
8589
8590       The  first  data  line passes the string "dogsb" to a standard matching
8591       function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
8592       a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
8593       because the shorter string "dog" is a complete match.  Similarly,  when
8594       the  subject  is  presented to a DFA matching function in several parts
8595       ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
8596       been  found, and it is not possible to continue.  On the other hand, if
8597       "dogsbody" is presented as a single string,  a  DFA  matching  function
8598       finds both matches.
8599
8600       Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
8601       matching multi-segment data. The example  above  then  behaves  differ-
8602       ently:
8603
8604           re> /dog(sbody)?/
8605         data> dogsb\P\P
8606         Partial match: dogsb
8607         data> do\P\D
8608         Partial match: do
8609         data> gsb\R\P\P\D
8610         Partial match: gsb
8611
8612       5. Patterns that contain alternatives at the top level which do not all
8613       start with the  same  pattern  item  may  not  work  as  expected  when
8614       PCRE_DFA_RESTART is used. For example, consider this pattern:
8615
8616         1234|3789
8617
8618       If  the  first  part of the subject is "ABC123", a partial match of the
8619       first alternative is found at offset 3. There is no partial  match  for
8620       the second alternative, because such a match does not start at the same
8621       point in the subject string. Attempting to  continue  with  the  string
8622       "7890"  does  not  yield  a  match because only those alternatives that
8623       match at one point in the subject are remembered.  The  problem  arises
8624       because  the  start  of the second alternative matches within the first
8625       alternative. There is no problem with  anchored  patterns  or  patterns
8626       such as:
8627
8628         1234|ABCD
8629
8630       where  no  string can be a partial match for both alternatives. This is
8631       not a problem if a standard matching  function  is  used,  because  the
8632       entire match has to be rerun each time:
8633
8634           re> /1234|3789/
8635         data> ABC123\P\P
8636         Partial match: 123
8637         data> 1237890
8638          0: 3789
8639
8640       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
8641       running the entire match can also be used with the DFA  matching  func-
8642       tions.  Another  possibility  is to work with two buffers. If a partial
8643       match at offset n in the first buffer is followed by  "no  match"  when
8644       PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
8645       match starting at offset n+1 in the first buffer.
8646
8647
8648AUTHOR
8649
8650       Philip Hazel
8651       University Computing Service
8652       Cambridge CB2 3QH, England.
8653
8654
8655REVISION
8656
8657       Last updated: 24 June 2012
8658       Copyright (c) 1997-2012 University of Cambridge.
8659------------------------------------------------------------------------------
8660
8661
8662PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
8663
8664
8665NAME
8666       PCRE - Perl-compatible regular expressions
8667
8668
8669SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
8670
8671       If  you  are running an application that uses a large number of regular
8672       expression patterns, it may be useful to store them  in  a  precompiled
8673       form  instead  of  having to compile them every time the application is
8674       run.  If you are not  using  any  private  character  tables  (see  the
8675       pcre_maketables()  documentation),  this is relatively straightforward.
8676       If you are using private tables, it is a little bit  more  complicated.
8677       However,  if you are using the just-in-time optimization feature, it is
8678       not possible to save and reload the JIT data.
8679
8680       If you save compiled patterns to a file, you can copy them to a differ-
8681       ent host and run them there. If the two hosts have different endianness
8682       (byte    order),    you     should     run     the     pcre[16|32]_pat-
8683       tern_to_host_byte_order()  function  on  the  new host before trying to
8684       match the pattern. The matching functions return  PCRE_ERROR_BADENDIAN-
8685       NESS if they detect a pattern with the wrong endianness.
8686
8687       Compiling  regular  expressions with one version of PCRE for use with a
8688       different version is not guaranteed to work and may cause crashes,  and
8689       saving  and  restoring  a  compiled  pattern loses any JIT optimization
8690       data.
8691
8692
8693SAVING A COMPILED PATTERN
8694
8695       The value returned by pcre[16|32]_compile() points to a single block of
8696       memory  that  holds  the  compiled pattern and associated data. You can
8697       find   the   length   of   this   block    in    bytes    by    calling
8698       pcre[16|32]_fullinfo() with an argument of PCRE_INFO_SIZE. You can then
8699       save the data in any appropriate manner. Here is sample  code  for  the
8700       8-bit  library  that  compiles  a  pattern  and writes it to a file. It
8701       assumes that the variable fd refers to a file that is open for output:
8702
8703         int erroroffset, rc, size;
8704         char *error;
8705         pcre *re;
8706
8707         re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
8708         if (re == NULL) { ... handle errors ... }
8709         rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
8710         if (rc < 0) { ... handle errors ... }
8711         rc = fwrite(re, 1, size, fd);
8712         if (rc != size) { ... handle errors ... }
8713
8714       In this example, the bytes  that  comprise  the  compiled  pattern  are
8715       copied  exactly.  Note that this is binary data that may contain any of
8716       the 256 possible byte  values.  On  systems  that  make  a  distinction
8717       between binary and non-binary data, be sure that the file is opened for
8718       binary output.
8719
8720       If you want to write more than one pattern to a file, you will have  to
8721       devise  a  way of separating them. For binary data, preceding each pat-
8722       tern with its length is probably  the  most  straightforward  approach.
8723       Another  possibility is to write out the data in hexadecimal instead of
8724       binary, one pattern to a line.
8725
8726       Saving compiled patterns in a file is only one possible way of  storing
8727       them  for later use. They could equally well be saved in a database, or
8728       in the memory of some daemon process that passes them  via  sockets  to
8729       the processes that want them.
8730
8731       If the pattern has been studied, it is also possible to save the normal
8732       study data in a similar way to the compiled pattern itself. However, if
8733       the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
8734       ated cannot be saved because it is too dependent on the  current  envi-
8735       ronment.    When    studying    generates    additional    information,
8736       pcre[16|32]_study() returns  a  pointer  to  a  pcre[16|32]_extra  data
8737       block.  Its  format  is defined in the section on matching a pattern in
8738       the pcreapi documentation. The study_data field points  to  the  binary
8739       study  data,  and this is what you must save (not the pcre[16|32]_extra
8740       block itself). The length of the study data can be obtained by  calling
8741       pcre[16|32]_fullinfo()  with an argument of PCRE_INFO_STUDYSIZE. Remem-
8742       ber to check that  pcre[16|32]_study()  did  return  a  non-NULL  value
8743       before trying to save the study data.
8744
8745
8746RE-USING A PRECOMPILED PATTERN
8747
8748       Re-using  a  precompiled pattern is straightforward. Having reloaded it
8749       into main memory,  called  pcre[16|32]_pattern_to_host_byte_order()  if
8750       necessary,    you   pass   its   pointer   to   pcre[16|32]_exec()   or
8751       pcre[16|32]_dfa_exec() in the usual way.
8752
8753       However, if you passed a pointer to custom character  tables  when  the
8754       pattern  was compiled (the tableptr argument of pcre[16|32]_compile()),
8755       you  must  now  pass  a  similar  pointer  to   pcre[16|32]_exec()   or
8756       pcre[16|32]_dfa_exec(),  because the value saved with the compiled pat-
8757       tern will obviously be nonsense. A field in a pcre[16|32]_extra() block
8758       is  used  to  pass this data, as described in the section on matching a
8759       pattern in the pcreapi documentation.
8760
8761       If you did not provide custom character tables  when  the  pattern  was
8762       compiled, the pointer in the compiled pattern is NULL, which causes the
8763       matching functions to use PCRE's internal tables. Thus, you do not need
8764       to take any special action at run time in this case.
8765
8766       If  you  saved study data with the compiled pattern, you need to create
8767       your own pcre[16|32]_extra data block and set the study_data  field  to
8768       point   to   the   reloaded   study   data.   You  must  also  set  the
8769       PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
8770       data  is present. Then pass the pcre[16|32]_extra block to the matching
8771       function in the usual way. If the pattern was studied for  just-in-time
8772       optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
8773       save/restore cycle.
8774
8775
8776COMPATIBILITY WITH DIFFERENT PCRE RELEASES
8777
8778       In general, it is safest to  recompile  all  saved  patterns  when  you
8779       update  to  a new PCRE release, though not all updates actually require
8780       this.
8781
8782
8783AUTHOR
8784
8785       Philip Hazel
8786       University Computing Service
8787       Cambridge CB2 3QH, England.
8788
8789
8790REVISION
8791
8792       Last updated: 24 June 2012
8793       Copyright (c) 1997-2012 University of Cambridge.
8794------------------------------------------------------------------------------
8795
8796
8797PCREPERFORM(3)                                                  PCREPERFORM(3)
8798
8799
8800NAME
8801       PCRE - Perl-compatible regular expressions
8802
8803
8804PCRE PERFORMANCE
8805
8806       Two  aspects  of performance are discussed below: memory usage and pro-
8807       cessing time. The way you express your pattern as a regular  expression
8808       can affect both of them.
8809
8810
8811COMPILED PATTERN MEMORY USAGE
8812
8813       Patterns  are compiled by PCRE into a reasonably efficient interpretive
8814       code, so that most simple patterns do not  use  much  memory.  However,
8815       there  is  one case where the memory usage of a compiled pattern can be
8816       unexpectedly large. If a parenthesized subpattern has a quantifier with
8817       a minimum greater than 1 and/or a limited maximum, the whole subpattern
8818       is repeated in the compiled code. For example, the pattern
8819
8820         (abc|def){2,4}
8821
8822       is compiled as if it were
8823
8824         (abc|def)(abc|def)((abc|def)(abc|def)?)?
8825
8826       (Technical aside: It is done this way so that backtrack  points  within
8827       each of the repetitions can be independently maintained.)
8828
8829       For  regular expressions whose quantifiers use only small numbers, this
8830       is not usually a problem. However, if the numbers are large,  and  par-
8831       ticularly  if  such repetitions are nested, the memory usage can become
8832       an embarrassment. For example, the very simple pattern
8833
8834         ((ab){1,1000}c){1,3}
8835
8836       uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
8837       piled  with  its  default  internal pointer size of two bytes, the size
8838       limit on a compiled pattern is 64K data units, and this is reached with
8839       the  above  pattern  if  the outer repetition is increased from 3 to 4.
8840       PCRE can be compiled to use larger internal pointers  and  thus  handle
8841       larger  compiled patterns, but it is better to try to rewrite your pat-
8842       tern to use less memory if you can.
8843
8844       One way of reducing the memory usage for such patterns is to  make  use
8845       of PCRE's "subroutine" facility. Re-writing the above pattern as
8846
8847         ((ab)(?2){0,999}c)(?1){0,2}
8848
8849       reduces the memory requirements to 18K, and indeed it remains under 20K
8850       even with the outer repetition increased to 100. However, this  pattern
8851       is  not  exactly equivalent, because the "subroutine" calls are treated
8852       as atomic groups into which there can be no backtracking if there is  a
8853       subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
8854       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
8855       speed  when executing the modified pattern. Nevertheless, if the atomic
8856       grouping is not a problem and the loss of  speed  is  acceptable,  this
8857       kind  of  rewriting will allow you to process patterns that PCRE cannot
8858       otherwise handle.
8859
8860
8861STACK USAGE AT RUN TIME
8862
8863       When pcre_exec() or pcre[16|32]_exec() is used  for  matching,  certain
8864       kinds  of  pattern  can  cause  it  to use large amounts of the process
8865       stack. In some environments the default process stack is  quite  small,
8866       and  if it runs out the result is often SIGSEGV. This issue is probably
8867       the most frequently raised problem with PCRE.  Rewriting  your  pattern
8868       can  often  help.  The  pcrestack documentation discusses this issue in
8869       detail.
8870
8871
8872PROCESSING TIME
8873
8874       Certain items in regular expression patterns are processed  more  effi-
8875       ciently than others. It is more efficient to use a character class like
8876       [aeiou]  than  a  set  of   single-character   alternatives   such   as
8877       (a|e|i|o|u).  In  general,  the simplest construction that provides the
8878       required behaviour is usually the most efficient. Jeffrey Friedl's book
8879       contains  a  lot  of useful general discussion about optimizing regular
8880       expressions for efficient performance. This  document  contains  a  few
8881       observations about PCRE.
8882
8883       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
8884       slow, because PCRE has to use a multi-stage table  lookup  whenever  it
8885       needs  a  character's  property. If you can find an alternative pattern
8886       that does not use character properties, it will probably be faster.
8887
8888       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
8889       character  classes  such  as  [:alpha:]  do not use Unicode properties,
8890       partly for backwards compatibility, and partly for performance reasons.
8891       However,  you can set PCRE_UCP if you want Unicode character properties
8892       to be used. This can double the matching time for  items  such  as  \d,
8893       when matched with a traditional matching function; the performance loss
8894       is less with a DFA matching function, and in both cases  there  is  not
8895       much difference for \b.
8896
8897       When  a  pattern  begins  with .* not in parentheses, or in parentheses
8898       that are not the subject of a backreference, and the PCRE_DOTALL option
8899       is  set, the pattern is implicitly anchored by PCRE, since it can match
8900       only at the start of a subject string. However, if PCRE_DOTALL  is  not
8901       set,  PCRE  cannot  make this optimization, because the . metacharacter
8902       does not then match a newline, and if the subject string contains  new-
8903       lines,  the  pattern may match from the character immediately following
8904       one of them instead of from the very start. For example, the pattern
8905
8906         .*second
8907
8908       matches the subject "first\nand second" (where \n stands for a  newline
8909       character),  with the match starting at the seventh character. In order
8910       to do this, PCRE has to retry the match starting after every newline in
8911       the subject.
8912
8913       If  you  are using such a pattern with subject strings that do not con-
8914       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
8915       or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
8916       ing. That saves PCRE from having to scan along the subject looking  for
8917       a newline to restart at.
8918
8919       Beware  of  patterns  that contain nested indefinite repeats. These can
8920       take a long time to run when applied to a string that does  not  match.
8921       Consider the pattern fragment
8922
8923         ^(a+)*
8924
8925       This  can  match "aaaa" in 16 different ways, and this number increases
8926       very rapidly as the string gets longer. (The * repeat can match  0,  1,
8927       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
8928       repeats can match different numbers of times.) When  the  remainder  of
8929       the pattern is such that the entire match is going to fail, PCRE has in
8930       principle to try  every  possible  variation,  and  this  can  take  an
8931       extremely long time, even for relatively short strings.
8932
8933       An optimization catches some of the more simple cases such as
8934
8935         (a+)*b
8936
8937       where  a  literal  character  follows. Before embarking on the standard
8938       matching procedure, PCRE checks that there is a "b" later in  the  sub-
8939       ject  string, and if there is not, it fails the match immediately. How-
8940       ever, when there is no following literal this  optimization  cannot  be
8941       used. You can see the difference by comparing the behaviour of
8942
8943         (a+)*\d
8944
8945       with  the  pattern  above.  The former gives a failure almost instantly
8946       when applied to a whole line of  "a"  characters,  whereas  the  latter
8947       takes an appreciable time with strings longer than about 20 characters.
8948
8949       In many cases, the solution to this kind of performance issue is to use
8950       an atomic group or a possessive quantifier.
8951
8952
8953AUTHOR
8954
8955       Philip Hazel
8956       University Computing Service
8957       Cambridge CB2 3QH, England.
8958
8959
8960REVISION
8961
8962       Last updated: 25 August 2012
8963       Copyright (c) 1997-2012 University of Cambridge.
8964------------------------------------------------------------------------------
8965
8966
8967PCREPOSIX(3)                                                      PCREPOSIX(3)
8968
8969
8970NAME
8971       PCRE - Perl-compatible regular expressions.
8972
8973
8974SYNOPSIS OF POSIX API
8975
8976       #include <pcreposix.h>
8977
8978       int regcomp(regex_t *preg, const char *pattern,
8979            int cflags);
8980
8981       int regexec(regex_t *preg, const char *string,
8982            size_t nmatch, regmatch_t pmatch[], int eflags);
8983
8984       size_t regerror(int errcode, const regex_t *preg,
8985            char *errbuf, size_t errbuf_size);
8986
8987       void regfree(regex_t *preg);
8988
8989
8990DESCRIPTION
8991
8992       This  set  of functions provides a POSIX-style API for the PCRE regular
8993       expression 8-bit library. See the pcreapi documentation for a  descrip-
8994       tion  of  PCRE's native API, which contains much additional functional-
8995       ity. There is no POSIX-style  wrapper  for  PCRE's  16-bit  and  32-bit
8996       library.
8997
8998       The functions described here are just wrapper functions that ultimately
8999       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
9000       pcreposix.h  header  file,  and  on  Unix systems the library itself is
9001       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
9002       command  for  linking  an application that uses them. Because the POSIX
9003       functions call the native ones, it is also necessary to add -lpcre.
9004
9005       I have implemented only those POSIX option bits that can be  reasonably
9006       mapped  to PCRE native options. In addition, the option REG_EXTENDED is
9007       defined with the value zero. This has no  effect,  but  since  programs
9008       that  are  written  to  the POSIX interface often use it, this makes it
9009       easier to slot in PCRE as a replacement library.  Other  POSIX  options
9010       are not even defined.
9011
9012       There  are also some other options that are not defined by POSIX. These
9013       have been added at the request of users who want to make use of certain
9014       PCRE-specific features via the POSIX calling interface.
9015
9016       When  PCRE  is  called  via these functions, it is only the API that is
9017       POSIX-like in style. The syntax and semantics of  the  regular  expres-
9018       sions  themselves  are  still  those of Perl, subject to the setting of
9019       various PCRE options, as described below. "POSIX-like in  style"  means
9020       that  the  API  approximates  to  the POSIX definition; it is not fully
9021       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
9022       even less compatible.
9023
9024       The  header for these functions is supplied as pcreposix.h to avoid any
9025       potential clash with other POSIX  libraries.  It  can,  of  course,  be
9026       renamed or aliased as regex.h, which is the "correct" name. It provides
9027       two structure types, regex_t for  compiled  internal  forms,  and  reg-
9028       match_t  for  returning  captured substrings. It also defines some con-
9029       stants whose names start  with  "REG_";  these  are  used  for  setting
9030       options and identifying error codes.
9031
9032
9033COMPILING A PATTERN
9034
9035       The  function regcomp() is called to compile a pattern into an internal
9036       form. The pattern is a C string terminated by a  binary  zero,  and  is
9037       passed  in  the  argument  pattern. The preg argument is a pointer to a
9038       regex_t structure that is used as a base for storing information  about
9039       the compiled regular expression.
9040
9041       The argument cflags is either zero, or contains one or more of the bits
9042       defined by the following macros:
9043
9044         REG_DOTALL
9045
9046       The PCRE_DOTALL option is set when the regular expression is passed for
9047       compilation to the native function. Note that REG_DOTALL is not part of
9048       the POSIX standard.
9049
9050         REG_ICASE
9051
9052       The PCRE_CASELESS option is set when the regular expression  is  passed
9053       for compilation to the native function.
9054
9055         REG_NEWLINE
9056
9057       The  PCRE_MULTILINE option is set when the regular expression is passed
9058       for compilation to the native function. Note that this does  not  mimic
9059       the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
9060       tion).
9061
9062         REG_NOSUB
9063
9064       The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
9065       passed for compilation to the native function. In addition, when a pat-
9066       tern that is compiled with this flag is passed to regexec() for  match-
9067       ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
9068       strings are returned.
9069
9070         REG_UCP
9071
9072       The PCRE_UCP option is set when the regular expression  is  passed  for
9073       compilation  to  the  native  function. This causes PCRE to use Unicode
9074       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
9075       ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
9076
9077         REG_UNGREEDY
9078
9079       The  PCRE_UNGREEDY  option is set when the regular expression is passed
9080       for compilation to the native function. Note that REG_UNGREEDY  is  not
9081       part of the POSIX standard.
9082
9083         REG_UTF8
9084
9085       The  PCRE_UTF8  option is set when the regular expression is passed for
9086       compilation to the native function. This causes the pattern itself  and
9087       all  data  strings used for matching it to be treated as UTF-8 strings.
9088       Note that REG_UTF8 is not part of the POSIX standard.
9089
9090       In the absence of these flags, no options  are  passed  to  the  native
9091       function.   This  means  the  the  regex  is compiled with PCRE default
9092       semantics. In particular, the way it handles newline characters in  the
9093       subject  string  is  the Perl way, not the POSIX way. Note that setting
9094       PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
9095       It  does not affect the way newlines are matched by . (they are not) or
9096       by a negative class such as [^a] (they are).
9097
9098       The yield of regcomp() is zero on success, and non-zero otherwise.  The
9099       preg structure is filled in on success, and one member of the structure
9100       is public: re_nsub contains the number of capturing subpatterns in  the
9101       regular expression. Various error codes are defined in the header file.
9102
9103       NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
9104       use the contents of the preg structure. If, for example, you pass it to
9105       regexec(), the result is undefined and your program is likely to crash.
9106
9107
9108MATCHING NEWLINE CHARACTERS
9109
9110       This area is not simple, because POSIX and Perl take different views of
9111       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
9112       then  PCRE was never intended to be a POSIX engine. The following table
9113       lists the different possibilities for matching  newline  characters  in
9114       PCRE:
9115
9116                                 Default   Change with
9117
9118         . matches newline          no     PCRE_DOTALL
9119         newline matches [^a]       yes    not changeable
9120         $ matches \n at end        yes    PCRE_DOLLARENDONLY
9121         $ matches \n in middle     no     PCRE_MULTILINE
9122         ^ matches \n in middle     no     PCRE_MULTILINE
9123
9124       This is the equivalent table for POSIX:
9125
9126                                 Default   Change with
9127
9128         . matches newline          yes    REG_NEWLINE
9129         newline matches [^a]       yes    REG_NEWLINE
9130         $ matches \n at end        no     REG_NEWLINE
9131         $ matches \n in middle     no     REG_NEWLINE
9132         ^ matches \n in middle     no     REG_NEWLINE
9133
9134       PCRE's behaviour is the same as Perl's, except that there is no equiva-
9135       lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
9136       no way to stop newline from matching [^a].
9137
9138       The   default  POSIX  newline  handling  can  be  obtained  by  setting
9139       PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
9140       behave exactly as for the REG_NEWLINE action.
9141
9142
9143MATCHING A PATTERN
9144
9145       The  function  regexec()  is  called  to  match a compiled pattern preg
9146       against a given string, which is by default terminated by a  zero  byte
9147       (but  see  REG_STARTEND below), subject to the options in eflags. These
9148       can be:
9149
9150         REG_NOTBOL
9151
9152       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
9153       function.
9154
9155         REG_NOTEMPTY
9156
9157       The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
9158       ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
9159       However, setting this option can give more POSIX-like behaviour in some
9160       situations.
9161
9162         REG_NOTEOL
9163
9164       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
9165       function.
9166
9167         REG_STARTEND
9168
9169       The  string  is  considered to start at string + pmatch[0].rm_so and to
9170       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
9171       not  actually  be  a  NUL at that location), regardless of the value of
9172       nmatch. This is a BSD extension, compatible with but not  specified  by
9173       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
9174       software intended to be portable to other systems. Note that a non-zero
9175       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
9176       of the string, not how it is matched.
9177
9178       If the pattern was compiled with the REG_NOSUB flag, no data about  any
9179       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
9180       regexec() are ignored.
9181
9182       If the value of nmatch is zero, or if the value pmatch is NULL, no data
9183       about any matched strings is returned.
9184
9185       Otherwise,the portion of the string that was matched, and also any cap-
9186       tured substrings, are returned via the pmatch argument, which points to
9187       an  array  of nmatch structures of type regmatch_t, containing the mem-
9188       bers rm_so and rm_eo. These contain the offset to the  first  character
9189       of  each  substring and the offset to the first character after the end
9190       of each substring, respectively. The 0th element of the vector  relates
9191       to  the  entire portion of string that was matched; subsequent elements
9192       relate to the capturing subpatterns of the regular  expression.  Unused
9193       entries in the array have both structure members set to -1.
9194
9195       A  successful  match  yields  a  zero  return;  various error codes are
9196       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
9197       failure code.
9198
9199
9200ERROR MESSAGES
9201
9202       The regerror() function maps a non-zero errorcode from either regcomp()
9203       or regexec() to a printable message. If preg is  not  NULL,  the  error
9204       should have arisen from the use of that structure. A message terminated
9205       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
9206       including  the  zero, is limited to errbuf_size. The yield of the func-
9207       tion is the size of buffer needed to hold the whole message.
9208
9209
9210MEMORY USAGE
9211
9212       Compiling a regular expression causes memory to be allocated and  asso-
9213       ciated  with  the preg structure. The function regfree() frees all such
9214       memory, after which preg may no longer be used as  a  compiled  expres-
9215       sion.
9216
9217
9218AUTHOR
9219
9220       Philip Hazel
9221       University Computing Service
9222       Cambridge CB2 3QH, England.
9223
9224
9225REVISION
9226
9227       Last updated: 09 January 2012
9228       Copyright (c) 1997-2012 University of Cambridge.
9229------------------------------------------------------------------------------
9230
9231
9232PCRECPP(3)                                                          PCRECPP(3)
9233
9234
9235NAME
9236       PCRE - Perl-compatible regular expressions.
9237
9238
9239SYNOPSIS OF C++ WRAPPER
9240
9241       #include <pcrecpp.h>
9242
9243
9244DESCRIPTION
9245
9246       The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
9247       functionality was added by Giuseppe Maxia. This brief man page was con-
9248       structed  from  the  notes  in the pcrecpp.h file, which should be con-
9249       sulted for further details. Note that the C++ wrapper supports only the
9250       original  8-bit  PCRE  library. There is no 16-bit or 32-bit support at
9251       present.
9252
9253
9254MATCHING INTERFACE
9255
9256       The "FullMatch" operation checks that supplied text matches a  supplied
9257       pattern  exactly.  If pointer arguments are supplied, it copies matched
9258       sub-strings that match sub-patterns into them.
9259
9260         Example: successful match
9261            pcrecpp::RE re("h.*o");
9262            re.FullMatch("hello");
9263
9264         Example: unsuccessful match (requires full match):
9265            pcrecpp::RE re("e");
9266            !re.FullMatch("hello");
9267
9268         Example: creating a temporary RE object:
9269            pcrecpp::RE("h.*o").FullMatch("hello");
9270
9271       You can pass in a "const char*" or a "string" for "text". The  examples
9272       below  tend to use a const char*. You can, as in the different examples
9273       above, store the RE object explicitly in a variable or use a  temporary
9274       RE  object.  The  examples below use one mode or the other arbitrarily.
9275       Either could correctly be used for any of these examples.
9276
9277       You must supply extra pointer arguments to extract matched subpieces.
9278
9279         Example: extracts "ruby" into "s" and 1234 into "i"
9280            int i;
9281            string s;
9282            pcrecpp::RE re("(\\w+):(\\d+)");
9283            re.FullMatch("ruby:1234", &s, &i);
9284
9285         Example: does not try to extract any extra sub-patterns
9286            re.FullMatch("ruby:1234", &s);
9287
9288         Example: does not try to extract into NULL
9289            re.FullMatch("ruby:1234", NULL, &i);
9290
9291         Example: integer overflow causes failure
9292            !re.FullMatch("ruby:1234567891234", NULL, &i);
9293
9294         Example: fails because there aren't enough sub-patterns:
9295            !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
9296
9297         Example: fails because string cannot be stored in integer
9298            !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
9299
9300       The provided pointer arguments can be pointers to  any  scalar  numeric
9301       type, or one of:
9302
9303          string        (matched piece is copied to string)
9304          StringPiece   (StringPiece is mutated to point to matched piece)
9305          T             (where "bool T::ParseFrom(const char*, int)" exists)
9306          NULL          (the corresponding matched sub-pattern is not copied)
9307
9308       The  function returns true iff all of the following conditions are sat-
9309       isfied:
9310
9311         a. "text" matches "pattern" exactly;
9312
9313         b. The number of matched sub-patterns is >= number of supplied
9314            pointers;
9315
9316         c. The "i"th argument has a suitable type for holding the
9317            string captured as the "i"th sub-pattern. If you pass in
9318            void * NULL for the "i"th argument, or a non-void * NULL
9319            of the correct type, or pass fewer arguments than the
9320            number of sub-patterns, "i"th captured sub-pattern is
9321            ignored.
9322
9323       CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
9324       string  is  assigned  the  empty  string. Therefore, the following will
9325       return false (because the empty string is not a valid number):
9326
9327          int number;
9328          pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
9329
9330       The matching interface supports at most 16 arguments per call.  If  you
9331       need    more,    consider    using    the    more   general   interface
9332       pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
9333
9334       NOTE: Do not use no_arg, which is used internally to mark the end of  a
9335       list  of optional arguments, as a placeholder for missing arguments, as
9336       this can lead to segfaults.
9337
9338
9339QUOTING METACHARACTERS
9340
9341       You can use the "QuoteMeta" operation to insert backslashes before  all
9342       potentially  meaningful  characters  in  a string. The returned string,
9343       used as a regular expression, will exactly match the original string.
9344
9345         Example:
9346            string quoted = RE::QuoteMeta(unquoted);
9347
9348       Note that it's legal to escape a character even if it  has  no  special
9349       meaning  in  a  regular expression -- so this function does that. (This
9350       also makes it identical to the perl function  of  the  same  name;  see
9351       "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
9352       "1\.5\-2\.0\?".
9353
9354
9355PARTIAL MATCHES
9356
9357       You can use the "PartialMatch" operation when you want the  pattern  to
9358       match any substring of the text.
9359
9360         Example: simple search for a string:
9361            pcrecpp::RE("ell").PartialMatch("hello");
9362
9363         Example: find first number in a string:
9364            int number;
9365            pcrecpp::RE re("(\\d+)");
9366            re.PartialMatch("x*100 + 20", &number);
9367            assert(number == 100);
9368
9369
9370UTF-8 AND THE MATCHING INTERFACE
9371
9372       By  default,  pattern  and text are plain text, one byte per character.
9373       The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
9374       string to be treated as UTF-8 text, still a byte stream but potentially
9375       multiple bytes per character. In practice, the text is likelier  to  be
9376       UTF-8  than  the pattern, but the match returned may depend on the UTF8
9377       flag, so always use it when matching UTF8 text. For example,  "."  will
9378       match  one  byte normally but with UTF8 set may match up to three bytes
9379       of a multi-byte character.
9380
9381         Example:
9382            pcrecpp::RE_Options options;
9383            options.set_utf8();
9384            pcrecpp::RE re(utf8_pattern, options);
9385            re.FullMatch(utf8_string);
9386
9387         Example: using the convenience function UTF8():
9388            pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
9389            re.FullMatch(utf8_string);
9390
9391       NOTE: The UTF8 flag is ignored if pcre was not configured with the
9392             --enable-utf8 flag.
9393
9394
9395PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
9396
9397       PCRE defines some modifiers to  change  the  behavior  of  the  regular
9398       expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
9399       RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
9400       rently, the following modifiers are supported:
9401
9402          modifier              description               Perl corresponding
9403
9404          PCRE_CASELESS         case insensitive match      /i
9405          PCRE_MULTILINE        multiple lines match        /m
9406          PCRE_DOTALL           dot matches newlines        /s
9407          PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
9408          PCRE_EXTRA            strict escape parsing       N/A
9409          PCRE_EXTENDED         ignore white spaces         /x
9410          PCRE_UTF8             handles UTF8 chars          built-in
9411          PCRE_UNGREEDY         reverses * and *?           N/A
9412          PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
9413
9414       (*)  Both Perl and PCRE allow non capturing parentheses by means of the
9415       "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
9416       ture, while (ab|cd) does.
9417
9418       For  a  full  account on how each modifier works, please check the PCRE
9419       API reference page.
9420
9421       For each modifier, there are two member functions whose  name  is  made
9422       out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
9423       instance, PCRE_CASELESS is handled by
9424
9425         bool caseless()
9426
9427       which returns true if the modifier is set, and
9428
9429         RE_Options & set_caseless(bool)
9430
9431       which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
9432       be  accessed  through  the  set_match_limit()  and match_limit() member
9433       functions. Setting match_limit to a non-zero value will limit the  exe-
9434       cution  of pcre to keep it from doing bad things like blowing the stack
9435       or taking an eternity to return a result.  A  value  of  5000  is  good
9436       enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
9437       to  zero  disables  match  limiting.  Alternatively,   you   can   call
9438       match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
9439       limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
9440       matches PCRE does; match_limit_recursion() limits the depth of internal
9441       recursion, and therefore the amount of stack that is used.
9442
9443       Normally, to pass one or more modifiers to a RE class,  you  declare  a
9444       RE_Options object, set the appropriate options, and pass this object to
9445       a RE constructor. Example:
9446
9447          RE_Options opt;
9448          opt.set_caseless(true);
9449          if (RE("HELLO", opt).PartialMatch("hello world")) ...
9450
9451       RE_options has two constructors. The default constructor takes no argu-
9452       ments  and creates a set of flags that are off by default. The optional
9453       parameter option_flags is to facilitate transfer of legacy code from  C
9454       programs.  This lets you do
9455
9456          RE(pattern,
9457            RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
9458
9459       However, new code is better off doing
9460
9461          RE(pattern,
9462            RE_Options().set_caseless(true).set_multiline(true))
9463              .PartialMatch(str);
9464
9465       If you are going to pass one of the most used modifiers, there are some
9466       convenience functions that return a RE_Options class with the appropri-
9467       ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
9468       and EXTENDED().
9469
9470       If you need to set several options at once, and you don't  want  to  go
9471       through  the pains of declaring a RE_Options object and setting several
9472       options, there is a parallel method that give you such ability  on  the
9473       fly.  You  can  concatenate several set_xxxxx() member functions, since
9474       each of them returns a reference to its class object. For  example,  to
9475       pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
9476       statement, you may write:
9477
9478          RE(" ^ xyz \\s+ .* blah$",
9479            RE_Options()
9480              .set_caseless(true)
9481              .set_extended(true)
9482              .set_multiline(true)).PartialMatch(sometext);
9483
9484
9485SCANNING TEXT INCREMENTALLY
9486
9487       The "Consume" operation may be useful if you want to  repeatedly  match
9488       regular expressions at the front of a string and skip over them as they
9489       match. This requires use of the "StringPiece" type, which represents  a
9490       sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
9491       pcrecpp namespace.
9492
9493         Example: read lines of the form "var = value" from a string.
9494            string contents = ...;                 // Fill string somehow
9495            pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
9496
9497            string var;
9498            int value;
9499            pcrecpp::RE re("(\\w+) = (\\d+)\n");
9500            while (re.Consume(&input, &var, &value)) {
9501              ...;
9502            }
9503
9504       Each successful call  to  "Consume"  will  set  "var/value",  and  also
9505       advance "input" so it points past the matched text.
9506
9507       The  "FindAndConsume"  operation  is  similar to "Consume" but does not
9508       anchor your match at the beginning of  the  string.  For  example,  you
9509       could extract all words from a string by repeatedly calling
9510
9511         pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
9512
9513
9514PARSING HEX/OCTAL/C-RADIX NUMBERS
9515
9516       By default, if you pass a pointer to a numeric value, the corresponding
9517       text is interpreted as a base-10  number.  You  can  instead  wrap  the
9518       pointer with a call to one of the operators Hex(), Octal(), or CRadix()
9519       to interpret the text in another base. The CRadix  operator  interprets
9520       C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
9521       base-10.
9522
9523         Example:
9524           int a, b, c, d;
9525           pcrecpp::RE re("(.*) (.*) (.*) (.*)");
9526           re.FullMatch("100 40 0100 0x40",
9527                        pcrecpp::Octal(&a), pcrecpp::Hex(&b),
9528                        pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
9529
9530       will leave 64 in a, b, c, and d.
9531
9532
9533REPLACING PARTS OF STRINGS
9534
9535       You can replace the first match of "pattern" in "str"  with  "rewrite".
9536       Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
9537       insert text matching corresponding parenthesized group  from  the  pat-
9538       tern. \0 in "rewrite" refers to the entire matching text. For example:
9539
9540         string s = "yabba dabba doo";
9541         pcrecpp::RE("b+").Replace("d", &s);
9542
9543       will  leave  "s" containing "yada dabba doo". The result is true if the
9544       pattern matches and a replacement occurs, false otherwise.
9545
9546       GlobalReplace is like Replace except that it replaces  all  occurrences
9547       of  the  pattern  in  the string with the rewrite. Replacements are not
9548       subject to re-matching. For example:
9549
9550         string s = "yabba dabba doo";
9551         pcrecpp::RE("b+").GlobalReplace("d", &s);
9552
9553       will leave "s" containing "yada dada doo". It  returns  the  number  of
9554       replacements made.
9555
9556       Extract  is like Replace, except that if the pattern matches, "rewrite"
9557       is copied into "out" (an additional argument) with substitutions.   The
9558       non-matching  portions  of "text" are ignored. Returns true iff a match
9559       occurred and the extraction happened successfully;  if no match occurs,
9560       the string is left unaffected.
9561
9562
9563AUTHOR
9564
9565       The C++ wrapper was contributed by Google Inc.
9566       Copyright (c) 2007 Google Inc.
9567
9568
9569REVISION
9570
9571       Last updated: 08 January 2012
9572------------------------------------------------------------------------------
9573
9574
9575PCRESAMPLE(3)                                                    PCRESAMPLE(3)
9576
9577
9578NAME
9579       PCRE - Perl-compatible regular expressions
9580
9581
9582PCRE SAMPLE PROGRAM
9583
9584       A simple, complete demonstration program, to get you started with using
9585       PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
9586       listing  of this program is given in the pcredemo documentation. If you
9587       do not have a copy of the PCRE distribution, you can save this  listing
9588       to re-create pcredemo.c.
9589
9590       The  demonstration program, which uses the original PCRE 8-bit library,
9591       compiles the regular expression that is its first argument, and matches
9592       it  against  the subject string in its second argument. No PCRE options
9593       are set, and default character tables are used. If  matching  succeeds,
9594       the  program  outputs the portion of the subject that matched, together
9595       with the contents of any captured substrings.
9596
9597       If the -g option is given on the command line, the program then goes on
9598       to check for further matches of the same regular expression in the same
9599       subject string. The logic is a little bit tricky because of the  possi-
9600       bility  of  matching an empty string. Comments in the code explain what
9601       is going on.
9602
9603       If PCRE is installed in the standard include  and  library  directories
9604       for your operating system, you should be able to compile the demonstra-
9605       tion program using this command:
9606
9607         gcc -o pcredemo pcredemo.c -lpcre
9608
9609       If PCRE is installed elsewhere, you may need to add additional  options
9610       to  the  command line. For example, on a Unix-like system that has PCRE
9611       installed in /usr/local, you  can  compile  the  demonstration  program
9612       using a command like this:
9613
9614         gcc -o pcredemo -I/usr/local/include pcredemo.c \
9615             -L/usr/local/lib -lpcre
9616
9617       In  a  Windows  environment, if you want to statically link the program
9618       against a non-dll pcre.a file, you must uncomment the line that defines
9619       PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
9620       loc()   and   pcre_free()   exported   functions   will   be   declared
9621       __declspec(dllimport), with unwanted results.
9622
9623       Once  you  have  compiled and linked the demonstration program, you can
9624       run simple tests like this:
9625
9626         ./pcredemo 'cat|dog' 'the cat sat on the mat'
9627         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
9628
9629       Note that there is a  much  more  comprehensive  test  program,  called
9630       pcretest,  which  supports  many  more  facilities  for testing regular
9631       expressions and both PCRE libraries. The pcredemo program  is  provided
9632       as a simple coding example.
9633
9634       If  you  try to run pcredemo when PCRE is not installed in the standard
9635       library directory, you may get an error like  this  on  some  operating
9636       systems (e.g. Solaris):
9637
9638         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
9639       directory
9640
9641       This is caused by the way shared library support works  on  those  sys-
9642       tems. You need to add
9643
9644         -R/usr/local/lib
9645
9646       (for example) to the compile command to get round this problem.
9647
9648
9649AUTHOR
9650
9651       Philip Hazel
9652       University Computing Service
9653       Cambridge CB2 3QH, England.
9654
9655
9656REVISION
9657
9658       Last updated: 10 January 2012
9659       Copyright (c) 1997-2012 University of Cambridge.
9660------------------------------------------------------------------------------
9661PCRELIMITS(3)                                                    PCRELIMITS(3)
9662
9663
9664NAME
9665       PCRE - Perl-compatible regular expressions
9666
9667
9668SIZE AND OTHER LIMITATIONS
9669
9670       There  are some size limitations in PCRE but it is hoped that they will
9671       never in practice be relevant.
9672
9673       The maximum length of a compiled  pattern  is  approximately  64K  data
9674       units  (bytes  for  the  8-bit  library,  32-bit  units  for the 32-bit
9675       library, and 32-bit units for the 32-bit library) if PCRE  is  compiled
9676       with  the  default  internal  linkage  size  of 2 bytes. If you want to
9677       process regular expressions that are truly enormous,  you  can  compile
9678       PCRE  with an internal linkage size of 3 or 4 (when building the 16-bit
9679       or 32-bit library, 3 is rounded up to 4). See the README  file  in  the
9680       source  distribution  and  the  pcrebuild documentation for details. In
9681       these cases the limit is substantially larger.  However, the  speed  of
9682       execution is slower.
9683
9684       All values in repeating quantifiers must be less than 65536.
9685
9686       There is no limit to the number of parenthesized subpatterns, but there
9687       can be no more than 65535 capturing subpatterns.
9688
9689       There is a limit to the number of forward references to subsequent sub-
9690       patterns  of  around  200,000.  Repeated  forward references with fixed
9691       upper limits, for example, (?2){0,100} when subpattern number 2  is  to
9692       the  right,  are included in the count. There is no limit to the number
9693       of backward references.
9694
9695       The maximum length of name for a named subpattern is 32 characters, and
9696       the maximum number of named subpatterns is 10000.
9697
9698       The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
9699       (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
9700       32-bit library.
9701
9702       The  maximum  length of a subject string is the largest positive number
9703       that an integer variable can hold. However, when using the  traditional
9704       matching function, PCRE uses recursion to handle subpatterns and indef-
9705       inite repetition.  This means that the available stack space may  limit
9706       the size of a subject string that can be processed by certain patterns.
9707       For a discussion of stack issues, see the pcrestack documentation.
9708
9709
9710AUTHOR
9711
9712       Philip Hazel
9713       University Computing Service
9714       Cambridge CB2 3QH, England.
9715
9716
9717REVISION
9718
9719       Last updated: 04 May 2012
9720       Copyright (c) 1997-2012 University of Cambridge.
9721------------------------------------------------------------------------------
9722
9723
9724PCRESTACK(3)                                                      PCRESTACK(3)
9725
9726
9727NAME
9728       PCRE - Perl-compatible regular expressions
9729
9730
9731PCRE DISCUSSION OF STACK USAGE
9732
9733       When  you call pcre[16|32]_exec(), it makes use of an internal function
9734       called match(). This calls itself recursively at branch points  in  the
9735       pattern,  in  order  to  remember the state of the match so that it can
9736       back up and try a different alternative if  the  first  one  fails.  As
9737       matching proceeds deeper and deeper into the tree of possibilities, the
9738       recursion depth increases. The match() function is also called in other
9739       circumstances,  for  example,  whenever  a parenthesized sub-pattern is
9740       entered, and in certain cases of repetition.
9741
9742       Not all calls of match() increase the recursion depth; for an item such
9743       as  a* it may be called several times at the same level, after matching
9744       different numbers of a's. Furthermore, in a number of cases  where  the
9745       result  of  the  recursive call would immediately be passed back as the
9746       result of the current call (a "tail recursion"), the function  is  just
9747       restarted instead.
9748
9749       The  above  comments apply when pcre[16|32]_exec() is run in its normal
9750       interpretive  manner.   If   the   pattern   was   studied   with   the
9751       PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
9752       ful, and the options passed to pcre[16|32]_exec() were  not  incompati-
9753       ble,  the  matching  process  uses the JIT-compiled code instead of the
9754       match() function. In this case, the  memory  requirements  are  handled
9755       entirely differently. See the pcrejit documentation for details.
9756
9757       The  pcre[16|32]_dfa_exec()  function operates in an entirely different
9758       way, and uses recursion only when there is a regular expression  recur-
9759       sion or subroutine call in the pattern. This includes the processing of
9760       assertion and "once-only" subpatterns, which are handled  like  subrou-
9761       tine  calls.  Normally, these are never very deep, and the limit on the
9762       complexity of pcre[16|32]_dfa_exec() is controlled  by  the  amount  of
9763       workspace  it is given.  However, it is possible to write patterns with
9764       runaway    infinite    recursions;    such    patterns    will    cause
9765       pcre[16|32]_dfa_exec()  to  run  out  of stack. At present, there is no
9766       protection against this.
9767
9768       The comments that follow do NOT apply to  pcre[16|32]_dfa_exec();  they
9769       are relevant only for pcre[16|32]_exec() without the JIT optimization.
9770
9771   Reducing pcre[16|32]_exec()'s stack usage
9772
9773       Each  time  that match() is actually called recursively, it uses memory
9774       from the process stack. For certain kinds of  pattern  and  data,  very
9775       large  amounts of stack may be needed, despite the recognition of "tail
9776       recursion".  You can often reduce the amount of recursion,  and  there-
9777       fore  the  amount of stack used, by modifying the pattern that is being
9778       matched. Consider, for example, this pattern:
9779
9780         ([^<]|<(?!inet))+
9781
9782       It matches from wherever it starts until it encounters "<inet"  or  the
9783       end  of  the  data,  and is the kind of pattern that might be used when
9784       processing an XML file. Each iteration of the outer parentheses matches
9785       either  one  character that is not "<" or a "<" that is not followed by
9786       "inet". However, each time a  parenthesis  is  processed,  a  recursion
9787       occurs, so this formulation uses a stack frame for each matched charac-
9788       ter. For a long string, a lot of stack is required. Consider  now  this
9789       rewritten pattern, which matches exactly the same strings:
9790
9791         ([^<]++|<(?!inet))+
9792
9793       This  uses very much less stack, because runs of characters that do not
9794       contain "<" are "swallowed" in one item inside the parentheses.  Recur-
9795       sion  happens  only when a "<" character that is not followed by "inet"
9796       is encountered (and we assume this is relatively  rare).  A  possessive
9797       quantifier  is  used  to stop any backtracking into the runs of non-"<"
9798       characters, but that is not related to stack usage.
9799
9800       This example shows that one way of avoiding stack problems when  match-
9801       ing long subject strings is to write repeated parenthesized subpatterns
9802       to match more than one character whenever possible.
9803
9804   Compiling PCRE to use heap instead of stack for pcre[16|32]_exec()
9805
9806       In environments where stack memory is constrained, you  might  want  to
9807       compile  PCRE to use heap memory instead of stack for remembering back-
9808       up points when pcre[16|32]_exec() is running. This makes it run  a  lot
9809       more slowly, however.  Details of how to do this are given in the pcre-
9810       build documentation. When built in  this  way,  instead  of  using  the
9811       stack,  PCRE obtains and frees memory by calling the functions that are
9812       pointed to by the pcre[16|32]_stack_malloc  and  pcre[16|32]_stack_free
9813       variables.  By default, these point to malloc() and free(), but you can
9814       replace the pointers to cause PCRE to use your own functions. Since the
9815       block sizes are always the same, and are always freed in reverse order,
9816       it may be possible to implement customized  memory  handlers  that  are
9817       more efficient than the standard functions.
9818
9819   Limiting pcre[16|32]_exec()'s stack usage
9820
9821       You  can set limits on the number of times that match() is called, both
9822       in total and recursively. If a limit  is  exceeded,  pcre[16|32]_exec()
9823       returns  an  error code. Setting suitable limits should prevent it from
9824       running out of stack. The default values of the limits are very  large,
9825       and  unlikely  ever to operate. They can be changed when PCRE is built,
9826       and they can also be set when pcre[16|32]_exec() is called. For details
9827       of these interfaces, see the pcrebuild documentation and the section on
9828       extra data for pcre[16|32]_exec() in the pcreapi documentation.
9829
9830       As a very rough rule of thumb, you should reckon on about 500 bytes per
9831       recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
9832       should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
9833       hand, can support around 128000 recursions.
9834
9835       In Unix-like environments, the pcretest test program has a command line
9836       option (-S) that can be used to increase the size of its stack. As long
9837       as  the  stack is large enough, another option (-M) can be used to find
9838       the smallest limits that allow a particular pattern to  match  a  given
9839       subject  string.  This is done by calling pcre[16|32]_exec() repeatedly
9840       with different limits.
9841
9842   Obtaining an estimate of stack usage
9843
9844       The actual amount of stack used per recursion can  vary  quite  a  lot,
9845       depending on the compiler that was used to build PCRE and the optimiza-
9846       tion or debugging options that were set for it. The rule of thumb value
9847       of  500  bytes  mentioned  above  may be larger or smaller than what is
9848       actually needed. A better approximation can be obtained by running this
9849       command:
9850
9851         pcretest -m -C
9852
9853       The  -C  option causes pcretest to output information about the options
9854       with which PCRE was compiled. When -m is also given (before -C), infor-
9855       mation about stack use is given in a line like this:
9856
9857         Match recursion uses stack: approximate frame size = 640 bytes
9858
9859       The value is approximate because some recursions need a bit more (up to
9860       perhaps 16 more bytes).
9861
9862       If the above command is given when PCRE is compiled  to  use  the  heap
9863       instead  of  the  stack  for recursion, the value that is output is the
9864       size of each block that is obtained from the heap.
9865
9866   Changing stack size in Unix-like systems
9867
9868       In Unix-like environments, there is not often a problem with the  stack
9869       unless  very  long  strings  are  involved, though the default limit on
9870       stack size varies from system to system. Values from 8Mb  to  64Mb  are
9871       common. You can find your default limit by running the command:
9872
9873         ulimit -s
9874
9875       Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
9876       though sometimes a more explicit error message is given. You  can  nor-
9877       mally increase the limit on stack size by code such as this:
9878
9879         struct rlimit rlim;
9880         getrlimit(RLIMIT_STACK, &rlim);
9881         rlim.rlim_cur = 100*1024*1024;
9882         setrlimit(RLIMIT_STACK, &rlim);
9883
9884       This  reads  the current limits (soft and hard) using getrlimit(), then
9885       attempts to increase the soft limit to  100Mb  using  setrlimit().  You
9886       must do this before calling pcre[16|32]_exec().
9887
9888   Changing stack size in Mac OS X
9889
9890       Using setrlimit(), as described above, should also work on Mac OS X. It
9891       is also possible to set a stack size when linking a program. There is a
9892       discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
9893       http://developer.apple.com/qa/qa2005/qa1419.html.
9894
9895
9896AUTHOR
9897
9898       Philip Hazel
9899       University Computing Service
9900       Cambridge CB2 3QH, England.
9901
9902
9903REVISION
9904
9905       Last updated: 24 June 2012
9906       Copyright (c) 1997-2012 University of Cambridge.
9907------------------------------------------------------------------------------
9908
9909
9910