xref: /PHP-5.4/ext/pcre/pcrelib/doc/pcre.txt (revision 23917b45)
1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcredemo program. There are separate text files for the pcregrep and
7pcretest commands.
8-----------------------------------------------------------------------------
9
10
11PCRE(3)                    Library Functions Manual                    PCRE(3)
12
13
14
15NAME
16       PCRE - Perl-compatible regular expressions
17
18INTRODUCTION
19
20       The  PCRE  library is a set of functions that implement regular expres-
21       sion pattern matching using the same syntax and semantics as Perl, with
22       just  a few differences. Some features that appeared in Python and PCRE
23       before they appeared in Perl are also available using the  Python  syn-
24       tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25       items, and there is an option for requesting some  minor  changes  that
26       give better JavaScript compatibility.
27
28       Starting with release 8.30, it is possible to compile two separate PCRE
29       libraries:  the  original,  which  supports  8-bit  character   strings
30       (including  UTF-8  strings),  and a second library that supports 16-bit
31       character strings (including UTF-16 strings). The build process  allows
32       either  one  or both to be built. The majority of the work to make this
33       possible was done by Zoltan Herczeg.
34
35       Starting with release 8.32 it is possible to compile a  third  separate
36       PCRE  library  that supports 32-bit character strings (including UTF-32
37       strings). The build process allows any combination of the 8-,  16-  and
38       32-bit  libraries. The work to make this possible was done by Christian
39       Persch.
40
41       The three libraries contain identical sets of  functions,  except  that
42       the  names  in  the 16-bit library start with pcre16_ instead of pcre_,
43       and the names in the 32-bit  library  start  with  pcre32_  instead  of
44       pcre_.  To avoid over-complication and reduce the documentation mainte-
45       nance load, most of the documentation describes the 8-bit library, with
46       the  differences  for  the  16-bit and 32-bit libraries described sepa-
47       rately in the pcre16 and  pcre32  pages.  References  to  functions  or
48       structures  of  the  form  pcre[16|32]_xxx  should  be  read as meaning
49       "pcre_xxx when using the  8-bit  library,  pcre16_xxx  when  using  the
50       16-bit library, or pcre32_xxx when using the 32-bit library".
51
52       The  current implementation of PCRE corresponds approximately with Perl
53       5.12, including support for UTF-8/16/32  encoded  strings  and  Unicode
54       general  category  properties. However, UTF-8/16/32 and Unicode support
55       has to be explicitly enabled; it is not the default. The Unicode tables
56       correspond to Unicode release 6.3.0.
57
58       In  addition to the Perl-compatible matching function, PCRE contains an
59       alternative function that matches the same compiled patterns in a  dif-
60       ferent way. In certain circumstances, the alternative function has some
61       advantages.  For a discussion of the two matching algorithms,  see  the
62       pcrematching page.
63
64       PCRE  is  written  in C and released as a C library. A number of people
65       have written wrappers and interfaces of various kinds.  In  particular,
66       Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
67       library. This is now included as part of  the  PCRE  distribution.  The
68       pcrecpp  page  has  details of this interface. Other people's contribu-
69       tions can be found in the Contrib directory at the  primary  FTP  site,
70       which is:
71
72       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
73
74       Details  of  exactly which Perl regular expression features are and are
75       not supported by PCRE are given in separate documents. See the pcrepat-
76       tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
77       page.
78
79       Some features of PCRE can be included, excluded, or  changed  when  the
80       library  is  built.  The pcre_config() function makes it possible for a
81       client to discover which features are  available.  The  features  them-
82       selves  are described in the pcrebuild page. Documentation about build-
83       ing PCRE for various operating systems can be found in the  README  and
84       NON-AUTOTOOLS_BUILD files in the source distribution.
85
86       The  libraries contains a number of undocumented internal functions and
87       data tables that are used by more than one  of  the  exported  external
88       functions,  but  which  are  not  intended for use by external callers.
89       Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_",  which
90       hopefully  will  not provoke any name clashes. In some environments, it
91       is possible to control which  external  symbols  are  exported  when  a
92       shared  library  is  built, and in these cases the undocumented symbols
93       are not exported.
94
95
96SECURITY CONSIDERATIONS
97
98       If you are using PCRE in a non-UTF application that  permits  users  to
99       supply  arbitrary  patterns  for  compilation, you should be aware of a
100       feature that allows users to turn on UTF support from within a pattern,
101       provided  that  PCRE  was built with UTF support. For example, an 8-bit
102       pattern that begins with "(*UTF8)" or "(*UTF)"  turns  on  UTF-8  mode,
103       which  interprets  patterns and subjects as strings of UTF-8 characters
104       instead of individual 8-bit characters.  This causes both  the  pattern
105       and any data against which it is matched to be checked for UTF-8 valid-
106       ity. If the data string is very long, such a  check  might  use  suffi-
107       ciently  many  resources  as  to cause your application to lose perfor-
108       mance.
109
110       One  way  of  guarding  against  this  possibility  is   to   use   the
111       pcre_fullinfo()  function  to  check the compiled pattern's options for
112       UTF.  Alternatively, from release 8.33, you can set the  PCRE_NEVER_UTF
113       option  at compile time. This causes an compile time error if a pattern
114       contains a UTF-setting sequence.
115
116       If your application is one that supports UTF, be  aware  that  validity
117       checking  can  take time. If the same data string is to be matched many
118       times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
119       and subsequent matches to save redundant checks.
120
121       Another  way  that  performance can be hit is by running a pattern that
122       has a very large search tree against a string that  will  never  match.
123       Nested  unlimited  repeats in a pattern are a common example. PCRE pro-
124       vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
125       ture in the pcreapi page.
126
127
128USER DOCUMENTATION
129
130       The  user  documentation  for PCRE comprises a number of different sec-
131       tions. In the "man" format, each of these is a separate "man page".  In
132       the  HTML  format, each is a separate page, linked from the index page.
133       In the plain text format, the descriptions of the pcregrep and pcretest
134       programs  are  in  files  called pcregrep.txt and pcretest.txt, respec-
135       tively. The remaining sections, except for the pcredemo section  (which
136       is  a  program  listing),  are  concatenated  in  pcre.txt, for ease of
137       searching. The sections are as follows:
138
139         pcre              this document
140         pcre-config       show PCRE installation configuration information
141         pcre16            details of the 16-bit library
142         pcre32            details of the 32-bit library
143         pcreapi           details of PCRE's native C API
144         pcrebuild         building PCRE
145         pcrecallout       details of the callout feature
146         pcrecompat        discussion of Perl compatibility
147         pcrecpp           details of the C++ wrapper for the 8-bit library
148         pcredemo          a demonstration C program that uses PCRE
149         pcregrep          description of the pcregrep command (8-bit only)
150         pcrejit           discussion of the just-in-time optimization support
151         pcrelimits        details of size and other limits
152         pcrematching      discussion of the two matching algorithms
153         pcrepartial       details of the partial matching facility
154         pcrepattern       syntax and semantics of supported
155                             regular expressions
156         pcreperform       discussion of performance issues
157         pcreposix         the POSIX-compatible C API for the 8-bit library
158         pcreprecompile    details of saving and re-using precompiled patterns
159         pcresample        discussion of the pcredemo program
160         pcrestack         discussion of stack usage
161         pcresyntax        quick syntax reference
162         pcretest          description of the pcretest testing command
163         pcreunicode       discussion of Unicode and UTF-8/16/32 support
164
165       In the "man" and HTML formats, there is also a short page  for  each  C
166       library function, listing its arguments and results.
167
168
169AUTHOR
170
171       Philip Hazel
172       University Computing Service
173       Cambridge CB2 3QH, England.
174
175       Putting  an actual email address here seems to have been a spam magnet,
176       so I've taken it away. If you want to email me, use  my  two  initials,
177       followed by the two digits 10, at the domain cam.ac.uk.
178
179
180REVISION
181
182       Last updated: 08 January 2014
183       Copyright (c) 1997-2014 University of Cambridge.
184------------------------------------------------------------------------------
185
186
187PCRE(3)                    Library Functions Manual                    PCRE(3)
188
189
190
191NAME
192       PCRE - Perl-compatible regular expressions
193
194       #include <pcre.h>
195
196
197PCRE 16-BIT API BASIC FUNCTIONS
198
199       pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
200            const char **errptr, int *erroffset,
201            const unsigned char *tableptr);
202
203       pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
204            int *errorcodeptr,
205            const char **errptr, int *erroffset,
206            const unsigned char *tableptr);
207
208       pcre16_extra *pcre16_study(const pcre16 *code, int options,
209            const char **errptr);
210
211       void pcre16_free_study(pcre16_extra *extra);
212
213       int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
214            PCRE_SPTR16 subject, int length, int startoffset,
215            int options, int *ovector, int ovecsize);
216
217       int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
218            PCRE_SPTR16 subject, int length, int startoffset,
219            int options, int *ovector, int ovecsize,
220            int *workspace, int wscount);
221
222
223PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
224
225       int pcre16_copy_named_substring(const pcre16 *code,
226            PCRE_SPTR16 subject, int *ovector,
227            int stringcount, PCRE_SPTR16 stringname,
228            PCRE_UCHAR16 *buffer, int buffersize);
229
230       int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
231            int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
232            int buffersize);
233
234       int pcre16_get_named_substring(const pcre16 *code,
235            PCRE_SPTR16 subject, int *ovector,
236            int stringcount, PCRE_SPTR16 stringname,
237            PCRE_SPTR16 *stringptr);
238
239       int pcre16_get_stringnumber(const pcre16 *code,
240            PCRE_SPTR16 name);
241
242       int pcre16_get_stringtable_entries(const pcre16 *code,
243            PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
244
245       int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
246            int stringcount, int stringnumber,
247            PCRE_SPTR16 *stringptr);
248
249       int pcre16_get_substring_list(PCRE_SPTR16 subject,
250            int *ovector, int stringcount, PCRE_SPTR16 **listptr);
251
252       void pcre16_free_substring(PCRE_SPTR16 stringptr);
253
254       void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
255
256
257PCRE 16-BIT API AUXILIARY FUNCTIONS
258
259       pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
260
261       void pcre16_jit_stack_free(pcre16_jit_stack *stack);
262
263       void pcre16_assign_jit_stack(pcre16_extra *extra,
264            pcre16_jit_callback callback, void *data);
265
266       const unsigned char *pcre16_maketables(void);
267
268       int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
269            int what, void *where);
270
271       int pcre16_refcount(pcre16 *code, int adjust);
272
273       int pcre16_config(int what, void *where);
274
275       const char *pcre16_version(void);
276
277       int pcre16_pattern_to_host_byte_order(pcre16 *code,
278            pcre16_extra *extra, const unsigned char *tables);
279
280
281PCRE 16-BIT API INDIRECTED FUNCTIONS
282
283       void *(*pcre16_malloc)(size_t);
284
285       void (*pcre16_free)(void *);
286
287       void *(*pcre16_stack_malloc)(size_t);
288
289       void (*pcre16_stack_free)(void *);
290
291       int (*pcre16_callout)(pcre16_callout_block *);
292
293
294PCRE 16-BIT API 16-BIT-ONLY FUNCTION
295
296       int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
297            PCRE_SPTR16 input, int length, int *byte_order,
298            int keep_boms);
299
300
301THE PCRE 16-BIT LIBRARY
302
303       Starting  with  release  8.30, it is possible to compile a PCRE library
304       that supports 16-bit character strings, including  UTF-16  strings,  as
305       well  as  or instead of the original 8-bit library. The majority of the
306       work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
307       libraries contain identical sets of functions, used in exactly the same
308       way. Only the names of the functions and the data types of their  argu-
309       ments  and results are different. To avoid over-complication and reduce
310       the documentation maintenance load,  most  of  the  PCRE  documentation
311       describes  the  8-bit  library,  with only occasional references to the
312       16-bit library. This page describes what is different when you use  the
313       16-bit library.
314
315       WARNING:  A  single  application can be linked with both libraries, but
316       you must take care when processing any particular pattern to use  func-
317       tions  from  just one library. For example, if you want to study a pat-
318       tern that was compiled with  pcre16_compile(),  you  must  do  so  with
319       pcre16_study(), not pcre_study(), and you must free the study data with
320       pcre16_free_study().
321
322
323THE HEADER FILE
324
325       There is only one header file, pcre.h. It contains prototypes  for  all
326       the functions in all libraries, as well as definitions of flags, struc-
327       tures, error codes, etc.
328
329
330THE LIBRARY NAME
331
332       In Unix-like systems, the 16-bit library is called libpcre16,  and  can
333       normally  be  accesss  by adding -lpcre16 to the command for linking an
334       application that uses PCRE.
335
336
337STRING TYPES
338
339       In the 8-bit library, strings are passed to PCRE library  functions  as
340       vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
341       strings are passed as vectors of unsigned 16-bit quantities. The  macro
342       PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
343       defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
344       int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
345       as "unsigned short int", but checks that it really  is  a  16-bit  data
346       type.  If  it is not, the build fails with an error message telling the
347       maintainer to modify the definition appropriately.
348
349
350STRUCTURE TYPES
351
352       The types of the opaque structures that are used  for  compiled  16-bit
353       patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
354       The  type  of  the  user-accessible  structure  that  is  returned   by
355       pcre16_study()  is  pcre16_extra, and the type of the structure that is
356       used for passing data to a callout  function  is  pcre16_callout_block.
357       These structures contain the same fields, with the same names, as their
358       8-bit counterparts. The only difference is that pointers  to  character
359       strings are 16-bit instead of 8-bit types.
360
361
36216-BIT FUNCTIONS
363
364       For  every function in the 8-bit library there is a corresponding func-
365       tion in the 16-bit library with a name that starts with pcre16_ instead
366       of  pcre_.  The  prototypes are listed above. In addition, there is one
367       extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
368       function  that converts a UTF-16 character string to host byte order if
369       necessary. The other 16-bit  functions  expect  the  strings  they  are
370       passed to be in host byte order.
371
372       The input and output arguments of pcre16_utf16_to_host_byte_order() may
373       point to the same address, that is, conversion in place  is  supported.
374       The output buffer must be at least as long as the input.
375
376       The  length  argument  specifies the number of 16-bit data units in the
377       input string; a negative value specifies a zero-terminated string.
378
379       If byte_order is NULL, it is assumed that the string starts off in host
380       byte  order. This may be changed by byte-order marks (BOMs) anywhere in
381       the string (commonly as the first character).
382
383       If byte_order is not NULL, a non-zero value of the integer to which  it
384       points  means  that  the input starts off in host byte order, otherwise
385       the opposite order is assumed. Again, BOMs in  the  string  can  change
386       this. The final byte order is passed back at the end of processing.
387
388       If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
389       copied into the output string. Otherwise they are discarded.
390
391       The result of the function is the number of 16-bit  units  placed  into
392       the  output  buffer,  including  the  zero terminator if the string was
393       zero-terminated.
394
395
396SUBJECT STRING OFFSETS
397
398       The lengths and starting offsets of subject strings must  be  specified
399       in  16-bit  data units, and the offsets within subject strings that are
400       returned by the matching functions are in also 16-bit units rather than
401       bytes.
402
403
404NAMED SUBPATTERNS
405
406       The  name-to-number translation table that is maintained for named sub-
407       patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
408       function returns the length of each entry in the table as the number of
409       16-bit data units.
410
411
412OPTION NAMES
413
414       There   are   two   new   general   option   names,   PCRE_UTF16    and
415       PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
416       PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
417       define  the  same bits in the options word. There is a discussion about
418       the validity of UTF-16 strings in the pcreunicode page.
419
420       For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
421       that  returns  1  if UTF-16 support is configured, otherwise 0. If this
422       option  is  given  to  pcre_config()  or  pcre32_config(),  or  if  the
423       PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF32  option is given to pcre16_con-
424       fig(), the result is the PCRE_ERROR_BADOPTION error.
425
426
427CHARACTER CODES
428
429       In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
430       treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
431       that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
432       types  for characters less than 0xff can therefore be influenced by the
433       locale in the same way as before.  Characters greater  than  0xff  have
434       only one case, and no "type" (such as letter or digit).
435
436       In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
437       0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
438       because  those  are "surrogate" values that are used in pairs to encode
439       values greater than 0xffff.
440
441       A UTF-16 string can indicate its endianness by special code knows as  a
442       byte-order mark (BOM). The PCRE functions do not handle this, expecting
443       strings  to  be  in  host  byte  order.  A  utility   function   called
444       pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
445       above).
446
447
448ERROR NAMES
449
450       The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
451       spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
452       given when a compiled pattern is passed to a  function  that  processes
453       patterns  in  the  other  mode, for example, if a pattern compiled with
454       pcre_compile() is passed to pcre16_exec().
455
456       There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
457       invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
458       UTF-8 strings that are described in the section entitled "Reason  codes
459       for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
460       are:
461
462         PCRE_UTF16_ERR1  Missing low surrogate at end of string
463         PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
464         PCRE_UTF16_ERR3  Isolated low surrogate
465         PCRE_UTF16_ERR4  Non-character
466
467
468ERROR TEXTS
469
470       If there is an error while compiling a pattern, the error text that  is
471       passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
472       character string, zero-terminated.
473
474
475CALLOUTS
476
477       The subject and mark fields in the callout block that is  passed  to  a
478       callout function point to 16-bit vectors.
479
480
481TESTING
482
483       The  pcretest  program continues to operate with 8-bit input and output
484       files, but it can be used for testing the 16-bit library. If it is  run
485       with the command line option -16, patterns and subject strings are con-
486       verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
487       library  functions  are used instead of the 8-bit ones. Returned 16-bit
488       strings are converted to 8-bit for output. If both the  8-bit  and  the
489       32-bit libraries were not compiled, pcretest defaults to 16-bit and the
490       -16 option is ignored.
491
492       When PCRE is being built, the RunTest script that is  called  by  "make
493       check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
494       16-bit and 32-bit libraries has been built, and runs the  tests  appro-
495       priately.
496
497
498NOT SUPPORTED IN 16-BIT MODE
499
500       Not all the features of the 8-bit library are available with the 16-bit
501       library. The C++ and POSIX wrapper functions  support  only  the  8-bit
502       library, and the pcregrep program is at present 8-bit only.
503
504
505AUTHOR
506
507       Philip Hazel
508       University Computing Service
509       Cambridge CB2 3QH, England.
510
511
512REVISION
513
514       Last updated: 12 May 2013
515       Copyright (c) 1997-2013 University of Cambridge.
516------------------------------------------------------------------------------
517
518
519PCRE(3)                    Library Functions Manual                    PCRE(3)
520
521
522
523NAME
524       PCRE - Perl-compatible regular expressions
525
526       #include <pcre.h>
527
528
529PCRE 32-BIT API BASIC FUNCTIONS
530
531       pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
532            const char **errptr, int *erroffset,
533            const unsigned char *tableptr);
534
535       pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
536            int *errorcodeptr,
537            const unsigned char *tableptr);
538
539       pcre32_extra *pcre32_study(const pcre32 *code, int options,
540            const char **errptr);
541
542       void pcre32_free_study(pcre32_extra *extra);
543
544       int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
545            PCRE_SPTR32 subject, int length, int startoffset,
546            int options, int *ovector, int ovecsize);
547
548       int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
549            PCRE_SPTR32 subject, int length, int startoffset,
550            int options, int *ovector, int ovecsize,
551            int *workspace, int wscount);
552
553
554PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
555
556       int pcre32_copy_named_substring(const pcre32 *code,
557            PCRE_SPTR32 subject, int *ovector,
558            int stringcount, PCRE_SPTR32 stringname,
559            PCRE_UCHAR32 *buffer, int buffersize);
560
561       int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
562            int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
563            int buffersize);
564
565       int pcre32_get_named_substring(const pcre32 *code,
566            PCRE_SPTR32 subject, int *ovector,
567            int stringcount, PCRE_SPTR32 stringname,
568            PCRE_SPTR32 *stringptr);
569
570       int pcre32_get_stringnumber(const pcre32 *code,
571            PCRE_SPTR32 name);
572
573       int pcre32_get_stringtable_entries(const pcre32 *code,
574            PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
575
576       int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
577            int stringcount, int stringnumber,
578            PCRE_SPTR32 *stringptr);
579
580       int pcre32_get_substring_list(PCRE_SPTR32 subject,
581            int *ovector, int stringcount, PCRE_SPTR32 **listptr);
582
583       void pcre32_free_substring(PCRE_SPTR32 stringptr);
584
585       void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
586
587
588PCRE 32-BIT API AUXILIARY FUNCTIONS
589
590       pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
591
592       void pcre32_jit_stack_free(pcre32_jit_stack *stack);
593
594       void pcre32_assign_jit_stack(pcre32_extra *extra,
595            pcre32_jit_callback callback, void *data);
596
597       const unsigned char *pcre32_maketables(void);
598
599       int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
600            int what, void *where);
601
602       int pcre32_refcount(pcre32 *code, int adjust);
603
604       int pcre32_config(int what, void *where);
605
606       const char *pcre32_version(void);
607
608       int pcre32_pattern_to_host_byte_order(pcre32 *code,
609            pcre32_extra *extra, const unsigned char *tables);
610
611
612PCRE 32-BIT API INDIRECTED FUNCTIONS
613
614       void *(*pcre32_malloc)(size_t);
615
616       void (*pcre32_free)(void *);
617
618       void *(*pcre32_stack_malloc)(size_t);
619
620       void (*pcre32_stack_free)(void *);
621
622       int (*pcre32_callout)(pcre32_callout_block *);
623
624
625PCRE 32-BIT API 32-BIT-ONLY FUNCTION
626
627       int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
628            PCRE_SPTR32 input, int length, int *byte_order,
629            int keep_boms);
630
631
632THE PCRE 32-BIT LIBRARY
633
634       Starting  with  release  8.32, it is possible to compile a PCRE library
635       that supports 32-bit character strings, including  UTF-32  strings,  as
636       well as or instead of the original 8-bit library. This work was done by
637       Christian Persch, based on the work done  by  Zoltan  Herczeg  for  the
638       16-bit  library.  All  three  libraries contain identical sets of func-
639       tions, used in exactly the same way.  Only the names of  the  functions
640       and  the  data  types  of their arguments and results are different. To
641       avoid over-complication and reduce the documentation maintenance  load,
642       most  of  the PCRE documentation describes the 8-bit library, with only
643       occasional references to the 16-bit and  32-bit  libraries.  This  page
644       describes what is different when you use the 32-bit library.
645
646       WARNING:  A  single  application  can  be linked with all or any of the
647       three libraries, but you must take care when processing any  particular
648       pattern  to  use  functions  from just one library. For example, if you
649       want to study a pattern that was compiled  with  pcre32_compile(),  you
650       must do so with pcre32_study(), not pcre_study(), and you must free the
651       study data with pcre32_free_study().
652
653
654THE HEADER FILE
655
656       There is only one header file, pcre.h. It contains prototypes  for  all
657       the functions in all libraries, as well as definitions of flags, struc-
658       tures, error codes, etc.
659
660
661THE LIBRARY NAME
662
663       In Unix-like systems, the 32-bit library is called libpcre32,  and  can
664       normally  be  accesss  by adding -lpcre32 to the command for linking an
665       application that uses PCRE.
666
667
668STRING TYPES
669
670       In the 8-bit library, strings are passed to PCRE library  functions  as
671       vectors  of  bytes  with  the  C  type "char *". In the 32-bit library,
672       strings are passed as vectors of unsigned 32-bit quantities. The  macro
673       PCRE_UCHAR32  specifies  an  appropriate  data type, and PCRE_SPTR32 is
674       defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
675       int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
676       as "unsigned int", but checks that it really is a 32-bit data type.  If
677       it is not, the build fails with an error message telling the maintainer
678       to modify the definition appropriately.
679
680
681STRUCTURE TYPES
682
683       The types of the opaque structures that are used  for  compiled  32-bit
684       patterns  and  JIT stacks are pcre32 and pcre32_jit_stack respectively.
685       The  type  of  the  user-accessible  structure  that  is  returned   by
686       pcre32_study()  is  pcre32_extra, and the type of the structure that is
687       used for passing data to a callout  function  is  pcre32_callout_block.
688       These structures contain the same fields, with the same names, as their
689       8-bit counterparts. The only difference is that pointers  to  character
690       strings are 32-bit instead of 8-bit types.
691
692
69332-BIT FUNCTIONS
694
695       For  every function in the 8-bit library there is a corresponding func-
696       tion in the 32-bit library with a name that starts with pcre32_ instead
697       of  pcre_.  The  prototypes are listed above. In addition, there is one
698       extra function, pcre32_utf32_to_host_byte_order(). This  is  a  utility
699       function  that converts a UTF-32 character string to host byte order if
700       necessary. The other 32-bit  functions  expect  the  strings  they  are
701       passed to be in host byte order.
702
703       The input and output arguments of pcre32_utf32_to_host_byte_order() may
704       point to the same address, that is, conversion in place  is  supported.
705       The output buffer must be at least as long as the input.
706
707       The  length  argument  specifies the number of 32-bit data units in the
708       input string; a negative value specifies a zero-terminated string.
709
710       If byte_order is NULL, it is assumed that the string starts off in host
711       byte  order. This may be changed by byte-order marks (BOMs) anywhere in
712       the string (commonly as the first character).
713
714       If byte_order is not NULL, a non-zero value of the integer to which  it
715       points  means  that  the input starts off in host byte order, otherwise
716       the opposite order is assumed. Again, BOMs in  the  string  can  change
717       this. The final byte order is passed back at the end of processing.
718
719       If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
720       copied into the output string. Otherwise they are discarded.
721
722       The result of the function is the number of 32-bit  units  placed  into
723       the  output  buffer,  including  the  zero terminator if the string was
724       zero-terminated.
725
726
727SUBJECT STRING OFFSETS
728
729       The lengths and starting offsets of subject strings must  be  specified
730       in  32-bit  data units, and the offsets within subject strings that are
731       returned by the matching functions are in also 32-bit units rather than
732       bytes.
733
734
735NAMED SUBPATTERNS
736
737       The  name-to-number translation table that is maintained for named sub-
738       patterns uses 32-bit characters.  The  pcre32_get_stringtable_entries()
739       function returns the length of each entry in the table as the number of
740       32-bit data units.
741
742
743OPTION NAMES
744
745       There   are   two   new   general   option   names,   PCRE_UTF32    and
746       PCRE_NO_UTF32_CHECK,     which     correspond    to    PCRE_UTF8    and
747       PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
748       define  the  same bits in the options word. There is a discussion about
749       the validity of UTF-32 strings in the pcreunicode page.
750
751       For the pcre32_config() function there is an  option  PCRE_CONFIG_UTF32
752       that  returns  1  if UTF-32 support is configured, otherwise 0. If this
753       option  is  given  to  pcre_config()  or  pcre16_config(),  or  if  the
754       PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF16  option is given to pcre32_con-
755       fig(), the result is the PCRE_ERROR_BADOPTION error.
756
757
758CHARACTER CODES
759
760       In 32-bit mode, when  PCRE_UTF32  is  not  set,  character  values  are
761       treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
762       that they can range from 0 to 0x7fffffff instead of 0 to 0xff.  Charac-
763       ter  types for characters less than 0xff can therefore be influenced by
764       the locale in the same way as before.   Characters  greater  than  0xff
765       have only one case, and no "type" (such as letter or digit).
766
767       In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
768       0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
769       because those are "surrogate" values that are ill-formed in UTF-32.
770
771       A  UTF-32 string can indicate its endianness by special code knows as a
772       byte-order mark (BOM). The PCRE functions do not handle this, expecting
773       strings   to   be  in  host  byte  order.  A  utility  function  called
774       pcre32_utf32_to_host_byte_order() is provided to help  with  this  (see
775       above).
776
777
778ERROR NAMES
779
780       The  error  PCRE_ERROR_BADUTF32  corresponds  to its 8-bit counterpart.
781       The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
782       to  a  function that processes patterns in the other mode, for example,
783       if a pattern compiled with pcre_compile() is passed to pcre32_exec().
784
785       There are new error codes whose names  begin  with  PCRE_UTF32_ERR  for
786       invalid  UTF-32  strings,  corresponding to the PCRE_UTF8_ERR codes for
787       UTF-8 strings that are described in the section entitled "Reason  codes
788       for  invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
789       are:
790
791         PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
792         PCRE_UTF32_ERR2  Non-character
793         PCRE_UTF32_ERR3  Character > 0x10ffff
794
795
796ERROR TEXTS
797
798       If there is an error while compiling a pattern, the error text that  is
799       passed  back by pcre32_compile() or pcre32_compile2() is still an 8-bit
800       character string, zero-terminated.
801
802
803CALLOUTS
804
805       The subject and mark fields in the callout block that is  passed  to  a
806       callout function point to 32-bit vectors.
807
808
809TESTING
810
811       The  pcretest  program continues to operate with 8-bit input and output
812       files, but it can be used for testing the 32-bit library. If it is  run
813       with the command line option -32, patterns and subject strings are con-
814       verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
815       library  functions  are used instead of the 8-bit ones. Returned 32-bit
816       strings are converted to 8-bit for output. If both the  8-bit  and  the
817       16-bit libraries were not compiled, pcretest defaults to 32-bit and the
818       -32 option is ignored.
819
820       When PCRE is being built, the RunTest script that is  called  by  "make
821       check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
822       16-bit and 32-bit libraries has been built, and runs the  tests  appro-
823       priately.
824
825
826NOT SUPPORTED IN 32-BIT MODE
827
828       Not all the features of the 8-bit library are available with the 32-bit
829       library. The C++ and POSIX wrapper functions  support  only  the  8-bit
830       library, and the pcregrep program is at present 8-bit only.
831
832
833AUTHOR
834
835       Philip Hazel
836       University Computing Service
837       Cambridge CB2 3QH, England.
838
839
840REVISION
841
842       Last updated: 12 May 2013
843       Copyright (c) 1997-2013 University of Cambridge.
844------------------------------------------------------------------------------
845
846
847PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)
848
849
850
851NAME
852       PCRE - Perl-compatible regular expressions
853
854BUILDING PCRE
855
856       PCRE  is  distributed with a configure script that can be used to build
857       the library in Unix-like environments using the applications  known  as
858       Autotools.   Also  in  the  distribution  are files to support building
859       using CMake instead of configure. The text file README contains general
860       information  about  building  with Autotools (some of which is repeated
861       below), and also has some comments about building on various  operating
862       systems.  There  is  a lot more information about building PCRE without
863       using Autotools (including information about using CMake  and  building
864       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
865       consult this file as well as the README file if you are building  in  a
866       non-Unix-like environment.
867
868
869PCRE BUILD-TIME OPTIONS
870
871       The  rest of this document describes the optional features of PCRE that
872       can be selected when the library is compiled. It  assumes  use  of  the
873       configure  script,  where  the  optional features are selected or dese-
874       lected by providing options to configure before running the  make  com-
875       mand.  However,  the same options can be selected in both Unix-like and
876       non-Unix-like environments using the GUI facility of cmake-gui  if  you
877       are using CMake instead of configure to build PCRE.
878
879       If  you  are not using Autotools or CMake, option selection can be done
880       by editing the config.h file, or by passing parameter settings  to  the
881       compiler, as described in NON-AUTOTOOLS-BUILD.
882
883       The complete list of options for configure (which includes the standard
884       ones such as the  selection  of  the  installation  directory)  can  be
885       obtained by running
886
887         ./configure --help
888
889       The  following  sections  include  descriptions  of options whose names
890       begin with --enable or --disable. These settings specify changes to the
891       defaults  for  the configure command. Because of the way that configure
892       works, --enable and --disable always come in pairs, so  the  complemen-
893       tary  option always exists as well, but as it specifies the default, it
894       is not described.
895
896
897BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
898
899       By default, a library called libpcre  is  built,  containing  functions
900       that  take  string  arguments  contained in vectors of bytes, either as
901       single-byte characters, or interpreted as UTF-8 strings. You  can  also
902       build  a  separate library, called libpcre16, in which strings are con-
903       tained in vectors of 16-bit data units and interpreted either  as  sin-
904       gle-unit characters or UTF-16 strings, by adding
905
906         --enable-pcre16
907
908       to  the  configure  command.  You  can  also build yet another separate
909       library, called libpcre32, in which strings are contained in vectors of
910       32-bit  data  units and interpreted either as single-unit characters or
911       UTF-32 strings, by adding
912
913         --enable-pcre32
914
915       to the configure command. If you do not want the 8-bit library, add
916
917         --disable-pcre8
918
919       as well. At least one of the three libraries must be built.  Note  that
920       the  C++  and  POSIX  wrappers are for the 8-bit library only, and that
921       pcregrep is an 8-bit program. None of these are  built  if  you  select
922       only the 16-bit or 32-bit libraries.
923
924
925BUILDING SHARED AND STATIC LIBRARIES
926
927       The  Autotools  PCRE building process uses libtool to build both shared
928       and static libraries by default. You  can  suppress  one  of  these  by
929       adding one of
930
931         --disable-shared
932         --disable-static
933
934       to the configure command, as required.
935
936
937C++ SUPPORT
938
939       By  default,  if the 8-bit library is being built, the configure script
940       will search for a C++ compiler and C++ header files. If it finds  them,
941       it  automatically  builds  the C++ wrapper library (which supports only
942       8-bit strings). You can disable this by adding
943
944         --disable-cpp
945
946       to the configure command.
947
948
949UTF-8, UTF-16 AND UTF-32 SUPPORT
950
951       To build PCRE with support for UTF Unicode character strings, add
952
953         --enable-utf
954
955       to the configure command. This setting applies to all three  libraries,
956       adding  support  for  UTF-8 to the 8-bit library, support for UTF-16 to
957       the 16-bit library, and  support  for  UTF-32  to  the  to  the  32-bit
958       library.  There  are no separate options for enabling UTF-8, UTF-16 and
959       UTF-32 independently because that would allow ridiculous settings  such
960       as  requesting UTF-16 support while building only the 8-bit library. It
961       is not possible to build one library with UTF support and another with-
962       out  in the same configuration. (For backwards compatibility, --enable-
963       utf8 is a synonym of --enable-utf.)
964
965       Of itself, this setting does not make  PCRE  treat  strings  as  UTF-8,
966       UTF-16  or UTF-32. As well as compiling PCRE with this option, you also
967       have have to set the PCRE_UTF8, PCRE_UTF16  or  PCRE_UTF32  option  (as
968       appropriate) when you call one of the pattern compiling functions.
969
970       If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
971       expects its input to be either ASCII or UTF-8 (depending  on  the  run-
972       time option). It is not possible to support both EBCDIC and UTF-8 codes
973       in the same version of  the  library.  Consequently,  --enable-utf  and
974       --enable-ebcdic are mutually exclusive.
975
976
977UNICODE CHARACTER PROPERTY SUPPORT
978
979       UTF  support allows the libraries to process character codepoints up to
980       0x10ffff in the strings that they handle. On its own, however, it  does
981       not provide any facilities for accessing the properties of such charac-
982       ters. If you want to be able to use the pattern escapes \P, \p, and \X,
983       which refer to Unicode character properties, you must add
984
985         --enable-unicode-properties
986
987       to  the  configure  command. This implies UTF support, even if you have
988       not explicitly requested it.
989
990       Including Unicode property support adds around 30K  of  tables  to  the
991       PCRE  library.  Only  the general category properties such as Lu and Nd
992       are supported. Details are given in the pcrepattern documentation.
993
994
995JUST-IN-TIME COMPILER SUPPORT
996
997       Just-in-time compiler support is included in the build by specifying
998
999         --enable-jit
1000
1001       This support is available only for certain hardware  architectures.  If
1002       this  option  is  set  for  an unsupported architecture, a compile time
1003       error occurs.  See the pcrejit documentation for a  discussion  of  JIT
1004       usage. When JIT support is enabled, pcregrep automatically makes use of
1005       it, unless you add
1006
1007         --disable-pcregrep-jit
1008
1009       to the "configure" command.
1010
1011
1012CODE VALUE OF NEWLINE
1013
1014       By default, PCRE interprets the linefeed (LF) character  as  indicating
1015       the  end  of  a line. This is the normal newline character on Unix-like
1016       systems. You can compile PCRE to use carriage return (CR)  instead,  by
1017       adding
1018
1019         --enable-newline-is-cr
1020
1021       to  the  configure  command.  There  is  also  a --enable-newline-is-lf
1022       option, which explicitly specifies linefeed as the newline character.
1023
1024       Alternatively, you can specify that line endings are to be indicated by
1025       the two character sequence CRLF. If you want this, add
1026
1027         --enable-newline-is-crlf
1028
1029       to the configure command. There is a fourth option, specified by
1030
1031         --enable-newline-is-anycrlf
1032
1033       which  causes  PCRE  to recognize any of the three sequences CR, LF, or
1034       CRLF as indicating a line ending. Finally, a fifth option, specified by
1035
1036         --enable-newline-is-any
1037
1038       causes PCRE to recognize any Unicode newline sequence.
1039
1040       Whatever line ending convention is selected when PCRE is built  can  be
1041       overridden  when  the library functions are called. At build time it is
1042       conventional to use the standard for your operating system.
1043
1044
1045WHAT \R MATCHES
1046
1047       By default, the sequence \R in a pattern matches  any  Unicode  newline
1048       sequence,  whatever  has  been selected as the line ending sequence. If
1049       you specify
1050
1051         --enable-bsr-anycrlf
1052
1053       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
1054       ever  is selected when PCRE is built can be overridden when the library
1055       functions are called.
1056
1057
1058POSIX MALLOC USAGE
1059
1060       When the 8-bit library is called through the POSIX interface  (see  the
1061       pcreposix  documentation),  additional  working storage is required for
1062       holding the pointers to capturing  substrings,  because  PCRE  requires
1063       three integers per substring, whereas the POSIX interface provides only
1064       two. If the number of expected substrings is small, the  wrapper  func-
1065       tion  uses  space  on the stack, because this is faster than using mal-
1066       loc() for each call. The default threshold above which the stack is  no
1067       longer used is 10; it can be changed by adding a setting such as
1068
1069         --with-posix-malloc-threshold=20
1070
1071       to the configure command.
1072
1073
1074HANDLING VERY LARGE PATTERNS
1075
1076       Within  a  compiled  pattern,  offset values are used to point from one
1077       part to another (for example, from an opening parenthesis to an  alter-
1078       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
1079       two-byte values are used for these offsets, leading to a  maximum  size
1080       for  a compiled pattern of around 64K. This is sufficient to handle all
1081       but the most gigantic patterns.  Nevertheless, some people do  want  to
1082       process  truly  enormous patterns, so it is possible to compile PCRE to
1083       use three-byte or four-byte offsets by adding a setting such as
1084
1085         --with-link-size=3
1086
1087       to the configure command. The value given must be 2, 3, or 4.  For  the
1088       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
1089       using longer offsets slows down the operation of PCRE because it has to
1090       load  additional  data  when  handling them. For the 32-bit library the
1091       value is always 4 and cannot be overridden; the value  of  --with-link-
1092       size is ignored.
1093
1094
1095AVOIDING EXCESSIVE STACK USAGE
1096
1097       When matching with the pcre_exec() function, PCRE implements backtrack-
1098       ing by making recursive calls to an internal function  called  match().
1099       In  environments  where  the size of the stack is limited, this can se-
1100       verely limit PCRE's operation. (The Unix environment does  not  usually
1101       suffer from this problem, but it may sometimes be necessary to increase
1102       the maximum stack size.  There is a discussion in the  pcrestack  docu-
1103       mentation.)  An alternative approach to recursion that uses memory from
1104       the heap to remember data, instead of using recursive  function  calls,
1105       has  been  implemented to work round the problem of limited stack size.
1106       If you want to build a version of PCRE that works this way, add
1107
1108         --disable-stack-for-recursion
1109
1110       to the configure command. With this configuration, PCRE  will  use  the
1111       pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
1112       ment functions. By default these point to malloc() and free(), but  you
1113       can replace the pointers so that your own functions are used instead.
1114
1115       Separate  functions  are  provided  rather  than  using pcre_malloc and
1116       pcre_free because the  usage  is  very  predictable:  the  block  sizes
1117       requested  are  always  the  same,  and  the blocks are always freed in
1118       reverse order. A calling program might be able to  implement  optimized
1119       functions  that  perform  better  than  malloc()  and free(). PCRE runs
1120       noticeably more slowly when built in this way. This option affects only
1121       the pcre_exec() function; it is not relevant for pcre_dfa_exec().
1122
1123
1124LIMITING PCRE RESOURCE USAGE
1125
1126       Internally,  PCRE has a function called match(), which it calls repeat-
1127       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
1128       pcre_exec()  function.  By controlling the maximum number of times this
1129       function may be called during a single matching operation, a limit  can
1130       be  placed  on  the resources used by a single call to pcre_exec(). The
1131       limit can be changed at run time, as described in the pcreapi  documen-
1132       tation.  The default is 10 million, but this can be changed by adding a
1133       setting such as
1134
1135         --with-match-limit=500000
1136
1137       to  the  configure  command.  This  setting  has  no  effect   on   the
1138       pcre_dfa_exec() matching function.
1139
1140       In  some  environments  it is desirable to limit the depth of recursive
1141       calls of match() more strictly than the total number of calls, in order
1142       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
1143       for-recursion is specified) that is used. A second limit controls this;
1144       it  defaults  to  the  value  that is set for --with-match-limit, which
1145       imposes no additional constraints. However, you can set a  lower  limit
1146       by adding, for example,
1147
1148         --with-match-limit-recursion=10000
1149
1150       to  the  configure  command.  This  value can also be overridden at run
1151       time.
1152
1153
1154CREATING CHARACTER TABLES AT BUILD TIME
1155
1156       PCRE uses fixed tables for processing characters whose code values  are
1157       less  than 256. By default, PCRE is built with a set of tables that are
1158       distributed in the file pcre_chartables.c.dist. These  tables  are  for
1159       ASCII codes only. If you add
1160
1161         --enable-rebuild-chartables
1162
1163       to  the  configure  command, the distributed tables are no longer used.
1164       Instead, a program called dftables is compiled and  run.  This  outputs
1165       the source for new set of tables, created in the default locale of your
1166       C run-time system. (This method of replacing the tables does  not  work
1167       if  you are cross compiling, because dftables is run on the local host.
1168       If you need to create alternative tables when cross compiling, you will
1169       have to do so "by hand".)
1170
1171
1172USING EBCDIC CODE
1173
1174       PCRE  assumes  by  default that it will run in an environment where the
1175       character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
1176       This  is  the  case for most computer operating systems. PCRE can, how-
1177       ever, be compiled to run in an EBCDIC environment by adding
1178
1179         --enable-ebcdic
1180
1181       to the configure command. This setting implies --enable-rebuild-charta-
1182       bles.  You  should  only  use  it if you know that you are in an EBCDIC
1183       environment (for example,  an  IBM  mainframe  operating  system).  The
1184       --enable-ebcdic option is incompatible with --enable-utf.
1185
1186       The EBCDIC character that corresponds to an ASCII LF is assumed to have
1187       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
1188       is used. In such an environment you should use
1189
1190         --enable-ebcdic-nl25
1191
1192       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
1193       has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
1194       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
1195       acter (which, in Unicode, is 0x85).
1196
1197       The options that select newline behaviour, such as --enable-newline-is-
1198       cr, and equivalent run-time options, refer to these character values in
1199       an EBCDIC environment.
1200
1201
1202PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
1203
1204       By default, pcregrep reads all files as plain text. You can build it so
1205       that it recognizes files whose names end in .gz or .bz2, and reads them
1206       with libz or libbz2, respectively, by adding one or both of
1207
1208         --enable-pcregrep-libz
1209         --enable-pcregrep-libbz2
1210
1211       to the configure command. These options naturally require that the rel-
1212       evant  libraries  are installed on your system. Configuration will fail
1213       if they are not.
1214
1215
1216PCREGREP BUFFER SIZE
1217
1218       pcregrep uses an internal buffer to hold a "window" on the file  it  is
1219       scanning, in order to be able to output "before" and "after" lines when
1220       it finds a match. The size of the buffer is controlled by  a  parameter
1221       whose default value is 20K. The buffer itself is three times this size,
1222       but because of the way it is used for holding "before" lines, the long-
1223       est  line  that  is guaranteed to be processable is the parameter size.
1224       You can change the default parameter value by adding, for example,
1225
1226         --with-pcregrep-bufsize=50K
1227
1228       to the configure command. The caller of pcregrep can, however, override
1229       this value by specifying a run-time option.
1230
1231
1232PCRETEST OPTION FOR LIBREADLINE SUPPORT
1233
1234       If you add
1235
1236         --enable-pcretest-libreadline
1237
1238       to  the  configure  command,  pcretest  is  linked with the libreadline
1239       library, and when its input is from a terminal, it reads it  using  the
1240       readline() function. This provides line-editing and history facilities.
1241       Note that libreadline is GPL-licensed, so if you distribute a binary of
1242       pcretest linked in this way, there may be licensing issues.
1243
1244       Setting  this  option  causes  the -lreadline option to be added to the
1245       pcretest build. In many operating environments with  a  sytem-installed
1246       libreadline this is sufficient. However, in some environments (e.g.  if
1247       an unmodified distribution version of readline is in use),  some  extra
1248       configuration  may  be necessary. The INSTALL file for libreadline says
1249       this:
1250
1251         "Readline uses the termcap functions, but does not link with the
1252         termcap or curses library itself, allowing applications which link
1253         with readline the to choose an appropriate library."
1254
1255       If your environment has not been set up so that an appropriate  library
1256       is automatically included, you may need to add something like
1257
1258         LIBS="-ncurses"
1259
1260       immediately before the configure command.
1261
1262
1263DEBUGGING WITH VALGRIND SUPPORT
1264
1265       By adding the
1266
1267         --enable-valgrind
1268
1269       option  to to the configure command, PCRE will use valgrind annotations
1270       to mark certain memory regions as  unaddressable.  This  allows  it  to
1271       detect invalid memory accesses, and is mostly useful for debugging PCRE
1272       itself.
1273
1274
1275CODE COVERAGE REPORTING
1276
1277       If your C compiler is gcc, you can build a version  of  PCRE  that  can
1278       generate a code coverage report for its test suite. To enable this, you
1279       must install lcov version 1.6 or above. Then specify
1280
1281         --enable-coverage
1282
1283       to the configure command and build PCRE in the usual way.
1284
1285       Note that using ccache (a caching C compiler) is incompatible with code
1286       coverage  reporting. If you have configured ccache to run automatically
1287       on your system, you must set the environment variable
1288
1289         CCACHE_DISABLE=1
1290
1291       before running make to build PCRE, so that ccache is not used.
1292
1293       When --enable-coverage is used,  the  following  addition  targets  are
1294       added to the Makefile:
1295
1296         make coverage
1297
1298       This  creates  a  fresh  coverage report for the PCRE test suite. It is
1299       equivalent to running "make coverage-reset", "make  coverage-baseline",
1300       "make check", and then "make coverage-report".
1301
1302         make coverage-reset
1303
1304       This zeroes the coverage counters, but does nothing else.
1305
1306         make coverage-baseline
1307
1308       This captures baseline coverage information.
1309
1310         make coverage-report
1311
1312       This creates the coverage report.
1313
1314         make coverage-clean-report
1315
1316       This  removes the generated coverage report without cleaning the cover-
1317       age data itself.
1318
1319         make coverage-clean-data
1320
1321       This removes the captured coverage data without removing  the  coverage
1322       files created at compile time (*.gcno).
1323
1324         make coverage-clean
1325
1326       This  cleans all coverage data including the generated coverage report.
1327       For more information about code coverage, see the gcov and  lcov  docu-
1328       mentation.
1329
1330
1331SEE ALSO
1332
1333       pcreapi(3), pcre16, pcre32, pcre_config(3).
1334
1335
1336AUTHOR
1337
1338       Philip Hazel
1339       University Computing Service
1340       Cambridge CB2 3QH, England.
1341
1342
1343REVISION
1344
1345       Last updated: 12 May 2013
1346       Copyright (c) 1997-2013 University of Cambridge.
1347------------------------------------------------------------------------------
1348
1349
1350PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)
1351
1352
1353
1354NAME
1355       PCRE - Perl-compatible regular expressions
1356
1357PCRE MATCHING ALGORITHMS
1358
1359       This document describes the two different algorithms that are available
1360       in PCRE for matching a compiled regular expression against a given sub-
1361       ject  string.  The  "standard"  algorithm  is  the  one provided by the
1362       pcre_exec(), pcre16_exec() and pcre32_exec() functions. These  work  in
1363       the  same as as Perl's matching function, and provide a Perl-compatible
1364       matching  operation.   The  just-in-time  (JIT)  optimization  that  is
1365       described  in  the pcrejit documentation is compatible with these func-
1366       tions.
1367
1368       An  alternative  algorithm  is   provided   by   the   pcre_dfa_exec(),
1369       pcre16_dfa_exec()  and  pcre32_dfa_exec()  functions; they operate in a
1370       different way, and are not Perl-compatible. This alternative has advan-
1371       tages and disadvantages compared with the standard algorithm, and these
1372       are described below.
1373
1374       When there is only one possible way in which a given subject string can
1375       match  a pattern, the two algorithms give the same answer. A difference
1376       arises, however, when there are multiple possibilities. For example, if
1377       the pattern
1378
1379         ^<.*>
1380
1381       is matched against the string
1382
1383         <something> <something else> <something further>
1384
1385       there are three possible answers. The standard algorithm finds only one
1386       of them, whereas the alternative algorithm finds all three.
1387
1388
1389REGULAR EXPRESSIONS AS TREES
1390
1391       The set of strings that are matched by a regular expression can be rep-
1392       resented  as  a  tree structure. An unlimited repetition in the pattern
1393       makes the tree of infinite size, but it is still a tree.  Matching  the
1394       pattern  to a given subject string (from a given starting point) can be
1395       thought of as a search of the tree.  There are two  ways  to  search  a
1396       tree:  depth-first  and  breadth-first, and these correspond to the two
1397       matching algorithms provided by PCRE.
1398
1399
1400THE STANDARD MATCHING ALGORITHM
1401
1402       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
1403       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
1404       depth-first search of the pattern tree. That is, it  proceeds  along  a
1405       single path through the tree, checking that the subject matches what is
1406       required. When there is a mismatch, the algorithm  tries  any  alterna-
1407       tives  at  the  current point, and if they all fail, it backs up to the
1408       previous branch point in the  tree,  and  tries  the  next  alternative
1409       branch  at  that  level.  This often involves backing up (moving to the
1410       left) in the subject string as well.  The  order  in  which  repetition
1411       branches  are  tried  is controlled by the greedy or ungreedy nature of
1412       the quantifier.
1413
1414       If a leaf node is reached, a matching string has  been  found,  and  at
1415       that  point the algorithm stops. Thus, if there is more than one possi-
1416       ble match, this algorithm returns the first one that it finds.  Whether
1417       this  is the shortest, the longest, or some intermediate length depends
1418       on the way the greedy and ungreedy repetition quantifiers are specified
1419       in the pattern.
1420
1421       Because  it  ends  up  with a single path through the tree, it is rela-
1422       tively straightforward for this algorithm to keep  track  of  the  sub-
1423       strings  that  are  matched  by portions of the pattern in parentheses.
1424       This provides support for capturing parentheses and back references.
1425
1426
1427THE ALTERNATIVE MATCHING ALGORITHM
1428
1429       This algorithm conducts a breadth-first search of  the  tree.  Starting
1430       from  the  first  matching  point  in the subject, it scans the subject
1431       string from left to right, once, character by character, and as it does
1432       this,  it remembers all the paths through the tree that represent valid
1433       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
1434       though  it is not implemented as a traditional finite state machine (it
1435       keeps multiple states active simultaneously).
1436
1437       Although the general principle of this matching algorithm  is  that  it
1438       scans  the subject string only once, without backtracking, there is one
1439       exception: when a lookaround assertion is encountered,  the  characters
1440       following  or  preceding  the  current  point  have to be independently
1441       inspected.
1442
1443       The scan continues until either the end of the subject is  reached,  or
1444       there  are  no more unterminated paths. At this point, terminated paths
1445       represent the different matching possibilities (if there are none,  the
1446       match  has  failed).   Thus,  if there is more than one possible match,
1447       this algorithm finds all of them, and in particular, it finds the long-
1448       est.  The  matches are returned in decreasing order of length. There is
1449       an option to stop the algorithm after the first match (which is  neces-
1450       sarily the shortest) is found.
1451
1452       Note that all the matches that are found start at the same point in the
1453       subject. If the pattern
1454
1455         cat(er(pillar)?)?
1456
1457       is matched against the string "the caterpillar catchment",  the  result
1458       will  be the three strings "caterpillar", "cater", and "cat" that start
1459       at the fifth character of the subject. The algorithm does not automati-
1460       cally move on to find matches that start at later positions.
1461
1462       PCRE's  "auto-possessification" optimization usually applies to charac-
1463       ter repeats at the end of a pattern (as well as internally). For  exam-
1464       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
1465       is no point even considering the possibility of backtracking  into  the
1466       repeated  digits.  For  DFA matching, this means that only one possible
1467       match is found. If you really do want multiple matches in  such  cases,
1468       either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS
1469       option when compiling.
1470
1471       There are a number of features of PCRE regular expressions that are not
1472       supported by the alternative matching algorithm. They are as follows:
1473
1474       1.  Because  the  algorithm  finds  all possible matches, the greedy or
1475       ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
1476       ungreedy quantifiers are treated in exactly the same way. However, pos-
1477       sessive quantifiers can make a difference when what follows could  also
1478       match what is quantified, for example in a pattern like this:
1479
1480         ^a++\w!
1481
1482       This  pattern matches "aaab!" but not "aaa!", which would be matched by
1483       a non-possessive quantifier. Similarly, if an atomic group is  present,
1484       it  is matched as if it were a standalone pattern at the current point,
1485       and the longest match is then "locked in" for the rest of  the  overall
1486       pattern.
1487
1488       2. When dealing with multiple paths through the tree simultaneously, it
1489       is not straightforward to keep track of  captured  substrings  for  the
1490       different  matching  possibilities,  and  PCRE's implementation of this
1491       algorithm does not attempt to do this. This means that no captured sub-
1492       strings are available.
1493
1494       3.  Because no substrings are captured, back references within the pat-
1495       tern are not supported, and cause errors if encountered.
1496
1497       4. For the same reason, conditional expressions that use  a  backrefer-
1498       ence  as  the  condition or test for a specific group recursion are not
1499       supported.
1500
1501       5. Because many paths through the tree may be  active,  the  \K  escape
1502       sequence, which resets the start of the match when encountered (but may
1503       be on some paths and not on others), is not  supported.  It  causes  an
1504       error if encountered.
1505
1506       6.  Callouts  are  supported, but the value of the capture_top field is
1507       always 1, and the value of the capture_last field is always -1.
1508
1509       7. The \C escape sequence, which (in  the  standard  algorithm)  always
1510       matches  a  single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
1511       not supported in these modes, because the alternative  algorithm  moves
1512       through the subject string one character (not data unit) at a time, for
1513       all active paths through the tree.
1514
1515       8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
1516       are  not  supported.  (*FAIL)  is supported, and behaves like a failing
1517       negative assertion.
1518
1519
1520ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1521
1522       Using the alternative matching algorithm provides the following  advan-
1523       tages:
1524
1525       1. All possible matches (at a single point in the subject) are automat-
1526       ically found, and in particular, the longest match is  found.  To  find
1527       more than one match using the standard algorithm, you have to do kludgy
1528       things with callouts.
1529
1530       2. Because the alternative algorithm  scans  the  subject  string  just
1531       once, and never needs to backtrack (except for lookbehinds), it is pos-
1532       sible to pass very long subject strings to  the  matching  function  in
1533       several pieces, checking for partial matching each time. Although it is
1534       possible to do multi-segment matching using the standard  algorithm  by
1535       retaining  partially  matched  substrings,  it is more complicated. The
1536       pcrepartial documentation gives details of partial  matching  and  dis-
1537       cusses multi-segment matching.
1538
1539
1540DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1541
1542       The alternative algorithm suffers from a number of disadvantages:
1543
1544       1.  It  is  substantially  slower  than the standard algorithm. This is
1545       partly because it has to search for all possible matches, but  is  also
1546       because it is less susceptible to optimization.
1547
1548       2. Capturing parentheses and back references are not supported.
1549
1550       3. Although atomic groups are supported, their use does not provide the
1551       performance advantage that it does for the standard algorithm.
1552
1553
1554AUTHOR
1555
1556       Philip Hazel
1557       University Computing Service
1558       Cambridge CB2 3QH, England.
1559
1560
1561REVISION
1562
1563       Last updated: 12 November 2013
1564       Copyright (c) 1997-2012 University of Cambridge.
1565------------------------------------------------------------------------------
1566
1567
1568PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)
1569
1570
1571
1572NAME
1573       PCRE - Perl-compatible regular expressions
1574
1575       #include <pcre.h>
1576
1577
1578PCRE NATIVE API BASIC FUNCTIONS
1579
1580       pcre *pcre_compile(const char *pattern, int options,
1581            const char **errptr, int *erroffset,
1582            const unsigned char *tableptr);
1583
1584       pcre *pcre_compile2(const char *pattern, int options,
1585            int *errorcodeptr,
1586            const char **errptr, int *erroffset,
1587            const unsigned char *tableptr);
1588
1589       pcre_extra *pcre_study(const pcre *code, int options,
1590            const char **errptr);
1591
1592       void pcre_free_study(pcre_extra *extra);
1593
1594       int pcre_exec(const pcre *code, const pcre_extra *extra,
1595            const char *subject, int length, int startoffset,
1596            int options, int *ovector, int ovecsize);
1597
1598       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1599            const char *subject, int length, int startoffset,
1600            int options, int *ovector, int ovecsize,
1601            int *workspace, int wscount);
1602
1603
1604PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1605
1606       int pcre_copy_named_substring(const pcre *code,
1607            const char *subject, int *ovector,
1608            int stringcount, const char *stringname,
1609            char *buffer, int buffersize);
1610
1611       int pcre_copy_substring(const char *subject, int *ovector,
1612            int stringcount, int stringnumber, char *buffer,
1613            int buffersize);
1614
1615       int pcre_get_named_substring(const pcre *code,
1616            const char *subject, int *ovector,
1617            int stringcount, const char *stringname,
1618            const char **stringptr);
1619
1620       int pcre_get_stringnumber(const pcre *code,
1621            const char *name);
1622
1623       int pcre_get_stringtable_entries(const pcre *code,
1624            const char *name, char **first, char **last);
1625
1626       int pcre_get_substring(const char *subject, int *ovector,
1627            int stringcount, int stringnumber,
1628            const char **stringptr);
1629
1630       int pcre_get_substring_list(const char *subject,
1631            int *ovector, int stringcount, const char ***listptr);
1632
1633       void pcre_free_substring(const char *stringptr);
1634
1635       void pcre_free_substring_list(const char **stringptr);
1636
1637
1638PCRE NATIVE API AUXILIARY FUNCTIONS
1639
1640       int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
1641            const char *subject, int length, int startoffset,
1642            int options, int *ovector, int ovecsize,
1643            pcre_jit_stack *jstack);
1644
1645       pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1646
1647       void pcre_jit_stack_free(pcre_jit_stack *stack);
1648
1649       void pcre_assign_jit_stack(pcre_extra *extra,
1650            pcre_jit_callback callback, void *data);
1651
1652       const unsigned char *pcre_maketables(void);
1653
1654       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1655            int what, void *where);
1656
1657       int pcre_refcount(pcre *code, int adjust);
1658
1659       int pcre_config(int what, void *where);
1660
1661       const char *pcre_version(void);
1662
1663       int pcre_pattern_to_host_byte_order(pcre *code,
1664            pcre_extra *extra, const unsigned char *tables);
1665
1666
1667PCRE NATIVE API INDIRECTED FUNCTIONS
1668
1669       void *(*pcre_malloc)(size_t);
1670
1671       void (*pcre_free)(void *);
1672
1673       void *(*pcre_stack_malloc)(size_t);
1674
1675       void (*pcre_stack_free)(void *);
1676
1677       int (*pcre_callout)(pcre_callout_block *);
1678
1679       int (*pcre_stack_guard)(void);
1680
1681
1682PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1683
1684       As  well  as  support  for  8-bit character strings, PCRE also supports
1685       16-bit strings (from release 8.30) and  32-bit  strings  (from  release
1686       8.32),  by means of two additional libraries. They can be built as well
1687       as, or instead of, the 8-bit library. To avoid too  much  complication,
1688       this  document describes the 8-bit versions of the functions, with only
1689       occasional references to the 16-bit and 32-bit libraries.
1690
1691       The 16-bit and 32-bit functions operate in the same way as their  8-bit
1692       counterparts;  they  just  use different data types for their arguments
1693       and results, and their names start with pcre16_ or pcre32_  instead  of
1694       pcre_.  For  every  option  that  has  UTF8  in  its name (for example,
1695       PCRE_UTF8), there are corresponding 16-bit and 32-bit names  with  UTF8
1696       replaced by UTF16 or UTF32, respectively. This facility is in fact just
1697       cosmetic; the 16-bit and 32-bit option names define the same  bit  val-
1698       ues.
1699
1700       References to bytes and UTF-8 in this document should be read as refer-
1701       ences to 16-bit data units and UTF-16 when using the 16-bit library, or
1702       32-bit  data  units  and  UTF-32  when using the 32-bit library, unless
1703       specified otherwise.  More details of the specific differences for  the
1704       16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
1705
1706
1707PCRE API OVERVIEW
1708
1709       PCRE has its own native API, which is described in this document. There
1710       are also some wrapper functions (for the 8-bit library only) that  cor-
1711       respond  to  the  POSIX  regular  expression  API, but they do not give
1712       access to all the functionality. They are described  in  the  pcreposix
1713       documentation.  Both  of these APIs define a set of C function calls. A
1714       C++ wrapper (again for the 8-bit library only) is also distributed with
1715       PCRE. It is documented in the pcrecpp page.
1716
1717       The  native  API  C  function prototypes are defined in the header file
1718       pcre.h, and on Unix-like systems the (8-bit) library itself  is  called
1719       libpcre.  It  can  normally be accessed by adding -lpcre to the command
1720       for linking an application that uses PCRE. The header file defines  the
1721       macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1722       numbers for the library. Applications can use these to include  support
1723       for different releases of PCRE.
1724
1725       In a Windows environment, if you want to statically link an application
1726       program against a non-dll pcre.a  file,  you  must  define  PCRE_STATIC
1727       before  including  pcre.h or pcrecpp.h, because otherwise the pcre_mal-
1728       loc()   and   pcre_free()   exported   functions   will   be   declared
1729       __declspec(dllimport), with unwanted results.
1730
1731       The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
1732       pcre_exec() are used for compiling and matching regular expressions  in
1733       a  Perl-compatible  manner. A sample program that demonstrates the sim-
1734       plest way of using them is provided in the file  called  pcredemo.c  in
1735       the PCRE source distribution. A listing of this program is given in the
1736       pcredemo documentation, and the pcresample documentation describes  how
1737       to compile and run it.
1738
1739       Just-in-time  compiler  support is an optional feature of PCRE that can
1740       be built in appropriate hardware environments. It greatly speeds up the
1741       matching  performance  of  many  patterns.  Simple  programs can easily
1742       request that it be used if available, by  setting  an  option  that  is
1743       ignored  when  it is not relevant. More complicated programs might need
1744       to    make    use    of    the    functions     pcre_jit_stack_alloc(),
1745       pcre_jit_stack_free(),  and pcre_assign_jit_stack() in order to control
1746       the JIT code's memory usage.
1747
1748       From release 8.32 there is also a direct interface for  JIT  execution,
1749       which  gives  improved performance. The JIT-specific functions are dis-
1750       cussed in the pcrejit documentation.
1751
1752       A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1753       ble,  is  also provided. This uses a different algorithm for the match-
1754       ing. The alternative algorithm finds all possible matches (at  a  given
1755       point  in  the  subject), and scans the subject just once (unless there
1756       are lookbehind assertions). However, this  algorithm  does  not  return
1757       captured  substrings.  A description of the two matching algorithms and
1758       their advantages and disadvantages is given in the  pcrematching  docu-
1759       mentation.
1760
1761       In  addition  to  the  main compiling and matching functions, there are
1762       convenience functions for extracting captured substrings from a subject
1763       string that is matched by pcre_exec(). They are:
1764
1765         pcre_copy_substring()
1766         pcre_copy_named_substring()
1767         pcre_get_substring()
1768         pcre_get_named_substring()
1769         pcre_get_substring_list()
1770         pcre_get_stringnumber()
1771         pcre_get_stringtable_entries()
1772
1773       pcre_free_substring() and pcre_free_substring_list() are also provided,
1774       to free the memory used for extracted strings.
1775
1776       The function pcre_maketables() is used to  build  a  set  of  character
1777       tables   in   the   current   locale  for  passing  to  pcre_compile(),
1778       pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
1779       provided  for  specialist  use.  Most  commonly,  no special tables are
1780       passed, in which case internal tables that are generated when  PCRE  is
1781       built are used.
1782
1783       The  function  pcre_fullinfo()  is used to find out information about a
1784       compiled pattern. The function pcre_version() returns a  pointer  to  a
1785       string containing the version of PCRE and its date of release.
1786
1787       The  function  pcre_refcount()  maintains  a  reference count in a data
1788       block containing a compiled pattern. This is provided for  the  benefit
1789       of object-oriented applications.
1790
1791       The  global  variables  pcre_malloc and pcre_free initially contain the
1792       entry points of the standard malloc()  and  free()  functions,  respec-
1793       tively. PCRE calls the memory management functions via these variables,
1794       so a calling program can replace them if it  wishes  to  intercept  the
1795       calls. This should be done before calling any PCRE functions.
1796
1797       The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
1798       indirections to memory management functions.  These  special  functions
1799       are  used  only  when  PCRE is compiled to use the heap for remembering
1800       data, instead of recursive function calls, when running the pcre_exec()
1801       function.  See  the  pcrebuild  documentation  for details of how to do
1802       this. It is a non-standard way of building PCRE, for  use  in  environ-
1803       ments  that  have  limited stacks. Because of the greater use of memory
1804       management, it runs more slowly. Separate  functions  are  provided  so
1805       that  special-purpose  external  code  can  be used for this case. When
1806       used, these functions are always called in a  stack-like  manner  (last
1807       obtained,  first freed), and always for memory blocks of the same size.
1808       There is a discussion about PCRE's stack usage in the  pcrestack  docu-
1809       mentation.
1810
1811       The global variable pcre_callout initially contains NULL. It can be set
1812       by the caller to a "callout" function, which PCRE  will  then  call  at
1813       specified  points during a matching operation. Details are given in the
1814       pcrecallout documentation.
1815
1816       The global variable pcre_stack_guard initially contains NULL. It can be
1817       set  by  the  caller  to  a function that is called by PCRE whenever it
1818       starts to compile a parenthesized part of a pattern.  When  parentheses
1819       are nested, PCRE uses recursive function calls, which use up the system
1820       stack. This function is provided so that applications  with  restricted
1821       stacks  can  force a compilation error if the stack runs out. The func-
1822       tion should return zero if all is well, or non-zero to force an error.
1823
1824
1825NEWLINES
1826
1827       PCRE supports five different conventions for indicating line breaks  in
1828       strings:  a  single  CR (carriage return) character, a single LF (line-
1829       feed) character, the two-character sequence CRLF, any of the three pre-
1830       ceding,  or any Unicode newline sequence. The Unicode newline sequences
1831       are the three just mentioned, plus the single characters  VT  (vertical
1832       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1833       separator, U+2028), and PS (paragraph separator, U+2029).
1834
1835       Each of the first three conventions is used by at least  one  operating
1836       system  as its standard newline sequence. When PCRE is built, a default
1837       can be specified.  The default default is LF, which is the  Unix  stan-
1838       dard.  When  PCRE  is run, the default can be overridden, either when a
1839       pattern is compiled, or when it is matched.
1840
1841       At compile time, the newline convention can be specified by the options
1842       argument  of  pcre_compile(), or it can be specified by special text at
1843       the start of the pattern itself; this overrides any other settings. See
1844       the pcrepattern page for details of the special character sequences.
1845
1846       In the PCRE documentation the word "newline" is used to mean "the char-
1847       acter or pair of characters that indicate a line break". The choice  of
1848       newline  convention  affects  the  handling of the dot, circumflex, and
1849       dollar metacharacters, the handling of #-comments in /x mode, and, when
1850       CRLF  is a recognized line ending sequence, the match position advance-
1851       ment for a non-anchored pattern. There is more detail about this in the
1852       section on pcre_exec() options below.
1853
1854       The  choice of newline convention does not affect the interpretation of
1855       the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
1856       which is controlled in a similar way, but by separate options.
1857
1858
1859MULTITHREADING
1860
1861       The  PCRE  functions  can be used in multi-threading applications, with
1862       the  proviso  that  the  memory  management  functions  pointed  to  by
1863       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1864       callout and stack-checking functions pointed  to  by  pcre_callout  and
1865       pcre_stack_guard, are shared by all threads.
1866
1867       The  compiled form of a regular expression is not altered during match-
1868       ing, so the same compiled pattern can safely be used by several threads
1869       at once.
1870
1871       If  the just-in-time optimization feature is being used, it needs sepa-
1872       rate memory stack areas for each thread. See the pcrejit  documentation
1873       for more details.
1874
1875
1876SAVING PRECOMPILED PATTERNS FOR LATER USE
1877
1878       The compiled form of a regular expression can be saved and re-used at a
1879       later time, possibly by a different program, and even on a  host  other
1880       than  the  one  on  which  it  was  compiled.  Details are given in the
1881       pcreprecompile documentation,  which  includes  a  description  of  the
1882       pcre_pattern_to_host_byte_order()  function. However, compiling a regu-
1883       lar expression with one version of PCRE for use with a  different  ver-
1884       sion is not guaranteed to work and may cause crashes.
1885
1886
1887CHECKING BUILD-TIME OPTIONS
1888
1889       int pcre_config(int what, void *where);
1890
1891       The  function pcre_config() makes it possible for a PCRE client to dis-
1892       cover which optional features have been compiled into the PCRE library.
1893       The  pcrebuild documentation has more details about these optional fea-
1894       tures.
1895
1896       The first argument for pcre_config() is an  integer,  specifying  which
1897       information is required; the second argument is a pointer to a variable
1898       into which the information is placed. The returned  value  is  zero  on
1899       success,  or  the negative error code PCRE_ERROR_BADOPTION if the value
1900       in the first argument is not recognized. The following  information  is
1901       available:
1902
1903         PCRE_CONFIG_UTF8
1904
1905       The  output is an integer that is set to one if UTF-8 support is avail-
1906       able; otherwise it is set to zero. This value should normally be  given
1907       to the 8-bit version of this function, pcre_config(). If it is given to
1908       the  16-bit  or  32-bit  version  of  this  function,  the  result   is
1909       PCRE_ERROR_BADOPTION.
1910
1911         PCRE_CONFIG_UTF16
1912
1913       The output is an integer that is set to one if UTF-16 support is avail-
1914       able; otherwise it is set to zero. This value should normally be  given
1915       to the 16-bit version of this function, pcre16_config(). If it is given
1916       to the 8-bit  or  32-bit  version  of  this  function,  the  result  is
1917       PCRE_ERROR_BADOPTION.
1918
1919         PCRE_CONFIG_UTF32
1920
1921       The output is an integer that is set to one if UTF-32 support is avail-
1922       able; otherwise it is set to zero. This value should normally be  given
1923       to the 32-bit version of this function, pcre32_config(). If it is given
1924       to the 8-bit  or  16-bit  version  of  this  function,  the  result  is
1925       PCRE_ERROR_BADOPTION.
1926
1927         PCRE_CONFIG_UNICODE_PROPERTIES
1928
1929       The  output  is  an  integer  that is set to one if support for Unicode
1930       character properties is available; otherwise it is set to zero.
1931
1932         PCRE_CONFIG_JIT
1933
1934       The output is an integer that is set to one if support for just-in-time
1935       compiling is available; otherwise it is set to zero.
1936
1937         PCRE_CONFIG_JITTARGET
1938
1939       The  output is a pointer to a zero-terminated "const char *" string. If
1940       JIT support is available, the string contains the name of the architec-
1941       ture  for  which the JIT compiler is configured, for example "x86 32bit
1942       (little endian + unaligned)". If JIT  support  is  not  available,  the
1943       result is NULL.
1944
1945         PCRE_CONFIG_NEWLINE
1946
1947       The  output  is  an integer whose value specifies the default character
1948       sequence that is recognized as meaning "newline". The values  that  are
1949       supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
1950       for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC  environments,  CR,
1951       ANYCRLF,  and  ANY  yield the same values. However, the value for LF is
1952       normally 21, though some EBCDIC environments use 37. The  corresponding
1953       values  for  CRLF are 3349 and 3365. The default should normally corre-
1954       spond to the standard sequence for your operating system.
1955
1956         PCRE_CONFIG_BSR
1957
1958       The output is an integer whose value indicates what character sequences
1959       the  \R  escape sequence matches by default. A value of 0 means that \R
1960       matches any Unicode line ending sequence; a value of 1  means  that  \R
1961       matches only CR, LF, or CRLF. The default can be overridden when a pat-
1962       tern is compiled or matched.
1963
1964         PCRE_CONFIG_LINK_SIZE
1965
1966       The output is an integer that contains the number  of  bytes  used  for
1967       internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
1968       library, the value can be 2, 3, or 4. For the 16-bit library, the value
1969       is  either  2  or  4  and  is  still  a number of bytes. For the 32-bit
1970       library, the value is either 2 or 4 and is still a number of bytes. The
1971       default value of 2 is sufficient for all but the most massive patterns,
1972       since it allows the compiled pattern to be up to 64K  in  size.  Larger
1973       values  allow larger regular expressions to be compiled, at the expense
1974       of slower matching.
1975
1976         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1977
1978       The output is an integer that contains the threshold  above  which  the
1979       POSIX  interface  uses malloc() for output vectors. Further details are
1980       given in the pcreposix documentation.
1981
1982         PCRE_CONFIG_PARENS_LIMIT
1983
1984       The output is a long integer that gives the maximum depth of nesting of
1985       parentheses  (of  any  kind) in a pattern. This limit is imposed to cap
1986       the amount of system stack used when a pattern is compiled. It is spec-
1987       ified  when PCRE is built; the default is 250. This limit does not take
1988       into account the stack that may already be used by the calling applica-
1989       tion.  For  finer  control  over compilation stack usage, you can set a
1990       pointer to an external checking function in pcre_stack_guard.
1991
1992         PCRE_CONFIG_MATCH_LIMIT
1993
1994       The output is a long integer that gives the default limit for the  num-
1995       ber  of  internal  matching  function calls in a pcre_exec() execution.
1996       Further details are given with pcre_exec() below.
1997
1998         PCRE_CONFIG_MATCH_LIMIT_RECURSION
1999
2000       The output is a long integer that gives the default limit for the depth
2001       of   recursion  when  calling  the  internal  matching  function  in  a
2002       pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
2003       below.
2004
2005         PCRE_CONFIG_STACKRECURSE
2006
2007       The  output is an integer that is set to one if internal recursion when
2008       running pcre_exec() is implemented by recursive function calls that use
2009       the  stack  to remember their state. This is the usual way that PCRE is
2010       compiled. The output is zero if PCRE was compiled to use blocks of data
2011       on  the  heap  instead  of  recursive  function  calls.  In  this case,
2012       pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
2013       blocks on the heap, thus avoiding the use of the stack.
2014
2015
2016COMPILING A PATTERN
2017
2018       pcre *pcre_compile(const char *pattern, int options,
2019            const char **errptr, int *erroffset,
2020            const unsigned char *tableptr);
2021
2022       pcre *pcre_compile2(const char *pattern, int options,
2023            int *errorcodeptr,
2024            const char **errptr, int *erroffset,
2025            const unsigned char *tableptr);
2026
2027       Either of the functions pcre_compile() or pcre_compile2() can be called
2028       to compile a pattern into an internal form. The only difference between
2029       the  two interfaces is that pcre_compile2() has an additional argument,
2030       errorcodeptr, via which a numerical error  code  can  be  returned.  To
2031       avoid  too  much repetition, we refer just to pcre_compile() below, but
2032       the information applies equally to pcre_compile2().
2033
2034       The pattern is a C string terminated by a binary zero, and is passed in
2035       the  pattern  argument.  A  pointer to a single block of memory that is
2036       obtained via pcre_malloc is returned. This contains the  compiled  code
2037       and related data. The pcre type is defined for the returned block; this
2038       is a typedef for a structure whose contents are not externally defined.
2039       It is up to the caller to free the memory (via pcre_free) when it is no
2040       longer required.
2041
2042       Although the compiled code of a PCRE regex is relocatable, that is,  it
2043       does not depend on memory location, the complete pcre data block is not
2044       fully relocatable, because it may contain a copy of the tableptr  argu-
2045       ment, which is an address (see below).
2046
2047       The options argument contains various bit settings that affect the com-
2048       pilation. It should be zero if no options are required.  The  available
2049       options  are  described  below. Some of them (in particular, those that
2050       are compatible with Perl, but some others as well) can also be set  and
2051       unset  from  within  the  pattern  (see the detailed description in the
2052       pcrepattern documentation). For those options that can be different  in
2053       different  parts  of  the pattern, the contents of the options argument
2054       specifies their settings at the start of compilation and execution. The
2055       PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2056       PCRE_NO_START_OPTIMIZE options can be set at the time  of  matching  as
2057       well as at compile time.
2058
2059       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
2060       if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
2061       sets the variable pointed to by errptr to point to a textual error mes-
2062       sage. This is a static string that is part of the library. You must not
2063       try  to  free it. Normally, the offset from the start of the pattern to
2064       the data unit that was being processed when the error was discovered is
2065       placed  in the variable pointed to by erroffset, which must not be NULL
2066       (if it is, an immediate error is given). However, for an invalid  UTF-8
2067       or  UTF-16  string,  the  offset  is that of the first data unit of the
2068       failing character.
2069
2070       Some errors are not detected until the whole pattern has been  scanned;
2071       in  these  cases,  the offset passed back is the length of the pattern.
2072       Note that the offset is in data units, not characters, even  in  a  UTF
2073       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
2074       acter.
2075
2076       If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
2077       codeptr  argument is not NULL, a non-zero error code number is returned
2078       via this argument in the event of an error. This is in addition to  the
2079       textual error message. Error codes and messages are listed below.
2080
2081       If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
2082       character tables that are  built  when  PCRE  is  compiled,  using  the
2083       default  C  locale.  Otherwise, tableptr must be an address that is the
2084       result of a call to pcre_maketables(). This value is  stored  with  the
2085       compiled  pattern,  and  used  again by pcre_exec() and pcre_dfa_exec()
2086       when the pattern is matched. For more discussion, see  the  section  on
2087       locale support below.
2088
2089       This  code  fragment  shows a typical straightforward call to pcre_com-
2090       pile():
2091
2092         pcre *re;
2093         const char *error;
2094         int erroffset;
2095         re = pcre_compile(
2096           "^A.*Z",          /* the pattern */
2097           0,                /* default options */
2098           &error,           /* for error message */
2099           &erroffset,       /* for error offset */
2100           NULL);            /* use default character tables */
2101
2102       The following names for option bits are defined in  the  pcre.h  header
2103       file:
2104
2105         PCRE_ANCHORED
2106
2107       If this bit is set, the pattern is forced to be "anchored", that is, it
2108       is constrained to match only at the first matching point in the  string
2109       that  is being searched (the "subject string"). This effect can also be
2110       achieved by appropriate constructs in the pattern itself, which is  the
2111       only way to do it in Perl.
2112
2113         PCRE_AUTO_CALLOUT
2114
2115       If this bit is set, pcre_compile() automatically inserts callout items,
2116       all with number 255, before each pattern item. For  discussion  of  the
2117       callout facility, see the pcrecallout documentation.
2118
2119         PCRE_BSR_ANYCRLF
2120         PCRE_BSR_UNICODE
2121
2122       These options (which are mutually exclusive) control what the \R escape
2123       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2124       or to match any Unicode newline sequence. The default is specified when
2125       PCRE is built. It can be overridden from within the pattern, or by set-
2126       ting an option when a compiled pattern is matched.
2127
2128         PCRE_CASELESS
2129
2130       If  this  bit is set, letters in the pattern match both upper and lower
2131       case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
2132       changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2133       always understands the concept of case for characters whose values  are
2134       less  than 128, so caseless matching is always possible. For characters
2135       with higher values, the concept of case is supported if  PCRE  is  com-
2136       piled  with Unicode property support, but not otherwise. If you want to
2137       use caseless matching for characters 128 and  above,  you  must  ensure
2138       that  PCRE  is  compiled  with Unicode property support as well as with
2139       UTF-8 support.
2140
2141         PCRE_DOLLAR_ENDONLY
2142
2143       If this bit is set, a dollar metacharacter in the pattern matches  only
2144       at  the  end  of the subject string. Without this option, a dollar also
2145       matches immediately before a newline at the end of the string (but  not
2146       before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
2147       if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
2148       Perl, and no way to set it within a pattern.
2149
2150         PCRE_DOTALL
2151
2152       If  this bit is set, a dot metacharacter in the pattern matches a char-
2153       acter of any value, including one that indicates a newline. However, it
2154       only  ever  matches  one character, even if newlines are coded as CRLF.
2155       Without this option, a dot does not match when the current position  is
2156       at a newline. This option is equivalent to Perl's /s option, and it can
2157       be changed within a pattern by a (?s) option setting. A negative  class
2158       such as [^a] always matches newline characters, independent of the set-
2159       ting of this option.
2160
2161         PCRE_DUPNAMES
2162
2163       If this bit is set, names used to identify capturing  subpatterns  need
2164       not be unique. This can be helpful for certain types of pattern when it
2165       is known that only one instance of the named  subpattern  can  ever  be
2166       matched.  There  are  more details of named subpatterns below; see also
2167       the pcrepattern documentation.
2168
2169         PCRE_EXTENDED
2170
2171       If this bit is set, most white space  characters  in  the  pattern  are
2172       totally  ignored  except when escaped or inside a character class. How-
2173       ever, white space is not allowed within  sequences  such  as  (?>  that
2174       introduce  various  parenthesized  subpatterns,  nor within a numerical
2175       quantifier such as {1,3}.  However, ignorable white space is  permitted
2176       between an item and a following quantifier and between a quantifier and
2177       a following + that indicates possessiveness.
2178
2179       White space did not used to include the VT character (code 11), because
2180       Perl did not treat this character as white space. However, Perl changed
2181       at release 5.18, so PCRE followed  at  release  8.34,  and  VT  is  now
2182       treated as white space.
2183
2184       PCRE_EXTENDED  also  causes characters between an unescaped # outside a
2185       character class  and  the  next  newline,  inclusive,  to  be  ignored.
2186       PCRE_EXTENDED  is equivalent to Perl's /x option, and it can be changed
2187       within a pattern by a (?x) option setting.
2188
2189       Which characters are interpreted  as  newlines  is  controlled  by  the
2190       options  passed to pcre_compile() or by a special sequence at the start
2191       of the pattern, as described in the section entitled  "Newline  conven-
2192       tions" in the pcrepattern documentation. Note that the end of this type
2193       of comment is  a  literal  newline  sequence  in  the  pattern;  escape
2194       sequences that happen to represent a newline do not count.
2195
2196       This  option  makes  it possible to include comments inside complicated
2197       patterns.  Note, however, that this applies only  to  data  characters.
2198       White  space  characters  may  never  appear  within  special character
2199       sequences in a pattern, for example within the sequence (?( that intro-
2200       duces a conditional subpattern.
2201
2202         PCRE_EXTRA
2203
2204       This  option  was invented in order to turn on additional functionality
2205       of PCRE that is incompatible with Perl, but it  is  currently  of  very
2206       little  use. When set, any backslash in a pattern that is followed by a
2207       letter that has no special meaning  causes  an  error,  thus  reserving
2208       these  combinations  for  future  expansion.  By default, as in Perl, a
2209       backslash followed by a letter with no special meaning is treated as  a
2210       literal. (Perl can, however, be persuaded to give an error for this, by
2211       running it with the -w option.) There are at present no other  features
2212       controlled  by this option. It can also be set by a (?X) option setting
2213       within a pattern.
2214
2215         PCRE_FIRSTLINE
2216
2217       If this option is set, an  unanchored  pattern  is  required  to  match
2218       before  or  at  the  first  newline  in  the subject string, though the
2219       matched text may continue over the newline.
2220
2221         PCRE_JAVASCRIPT_COMPAT
2222
2223       If this option is set, PCRE's behaviour is changed in some ways so that
2224       it  is  compatible with JavaScript rather than Perl. The changes are as
2225       follows:
2226
2227       (1) A lone closing square bracket in a pattern  causes  a  compile-time
2228       error,  because this is illegal in JavaScript (by default it is treated
2229       as a data character). Thus, the pattern AB]CD becomes illegal when this
2230       option is set.
2231
2232       (2)  At run time, a back reference to an unset subpattern group matches
2233       an empty string (by default this causes the current  matching  alterna-
2234       tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
2235       set (assuming it can find an "a" in the subject), whereas it  fails  by
2236       default, for Perl compatibility.
2237
2238       (3) \U matches an upper case "U" character; by default \U causes a com-
2239       pile time error (Perl uses \U to upper case subsequent characters).
2240
2241       (4) \u matches a lower case "u" character unless it is followed by four
2242       hexadecimal  digits,  in  which case the hexadecimal number defines the
2243       code point to match. By default, \u causes a compile time  error  (Perl
2244       uses it to upper case the following character).
2245
2246       (5)  \x matches a lower case "x" character unless it is followed by two
2247       hexadecimal digits, in which case the hexadecimal  number  defines  the
2248       code  point  to  match. By default, as in Perl, a hexadecimal number is
2249       always expected after \x, but it may have zero, one, or two digits (so,
2250       for example, \xz matches a binary zero character followed by z).
2251
2252         PCRE_MULTILINE
2253
2254       By  default,  for  the purposes of matching "start of line" and "end of
2255       line", PCRE treats the subject string as consisting of a single line of
2256       characters,  even if it actually contains newlines. The "start of line"
2257       metacharacter (^) matches only at the start of the string, and the "end
2258       of  line"  metacharacter  ($) matches only at the end of the string, or
2259       before a terminating newline (except when PCRE_DOLLAR_ENDONLY is  set).
2260       Note,  however,  that  unless  PCRE_DOTALL  is set, the "any character"
2261       metacharacter (.) does not match at a newline. This behaviour  (for  ^,
2262       $, and dot) is the same as Perl.
2263
2264       When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
2265       constructs match immediately following or immediately  before  internal
2266       newlines  in  the  subject string, respectively, as well as at the very
2267       start and end. This is equivalent to Perl's /m option, and  it  can  be
2268       changed within a pattern by a (?m) option setting. If there are no new-
2269       lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
2270       setting PCRE_MULTILINE has no effect.
2271
2272         PCRE_NEVER_UTF
2273
2274       This option locks out interpretation of the pattern as UTF-8 (or UTF-16
2275       or UTF-32 in the 16-bit and 32-bit libraries). In particular,  it  pre-
2276       vents  the  creator of the pattern from switching to UTF interpretation
2277       by starting the pattern with (*UTF). This may be useful in applications
2278       that  process  patterns  from  external  sources.  The  combination  of
2279       PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
2280
2281         PCRE_NEWLINE_CR
2282         PCRE_NEWLINE_LF
2283         PCRE_NEWLINE_CRLF
2284         PCRE_NEWLINE_ANYCRLF
2285         PCRE_NEWLINE_ANY
2286
2287       These options override the default newline definition that  was  chosen
2288       when  PCRE  was built. Setting the first or the second specifies that a
2289       newline is indicated by a single character (CR  or  LF,  respectively).
2290       Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2291       two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
2292       that any of the three preceding sequences should be recognized. Setting
2293       PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
2294       recognized.
2295
2296       In  an ASCII/Unicode environment, the Unicode newline sequences are the
2297       three just mentioned, plus the  single  characters  VT  (vertical  tab,
2298       U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2299       arator, U+2028), and PS (paragraph separator, U+2029).  For  the  8-bit
2300       library, the last two are recognized only in UTF-8 mode.
2301
2302       When  PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2303       code for CR is 0x0d, the same as ASCII. However, the character code for
2304       LF  is  normally 0x15, though in some EBCDIC environments 0x25 is used.
2305       Whichever of these is not LF is made to  correspond  to  Unicode's  NEL
2306       character.  EBCDIC  codes  are all less than 256. For more details, see
2307       the pcrebuild documentation.
2308
2309       The newline setting in the  options  word  uses  three  bits  that  are
2310       treated as a number, giving eight possibilities. Currently only six are
2311       used (default plus the five values above). This means that if  you  set
2312       more  than one newline option, the combination may or may not be sensi-
2313       ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
2314       PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
2315       cause an error.
2316
2317       The only time that a line break in a pattern  is  specially  recognized
2318       when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
2319       characters, and so are ignored in this mode. Also, an unescaped #  out-
2320       side  a  character class indicates a comment that lasts until after the
2321       next line break sequence. In other circumstances, line break  sequences
2322       in patterns are treated as literal data.
2323
2324       The newline option that is set at compile time becomes the default that
2325       is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
2326
2327         PCRE_NO_AUTO_CAPTURE
2328
2329       If this option is set, it disables the use of numbered capturing paren-
2330       theses  in the pattern. Any opening parenthesis that is not followed by
2331       ? behaves as if it were followed by ?: but named parentheses can  still
2332       be  used  for  capturing  (and  they acquire numbers in the usual way).
2333       There is no equivalent of this option in Perl.
2334
2335         PCRE_NO_AUTO_POSSESS
2336
2337       If this option is set, it disables "auto-possessification". This is  an
2338       optimization  that,  for example, turns a+b into a++b in order to avoid
2339       backtracks into a+ that can never be successful. However,  if  callouts
2340       are  in  use,  auto-possessification  means that some of them are never
2341       taken. You can set this option if you want the matching functions to do
2342       a  full  unoptimized  search and run all the callouts, but it is mainly
2343       provided for testing purposes.
2344
2345         PCRE_NO_START_OPTIMIZE
2346
2347       This is an option that acts at matching time; that is, it is really  an
2348       option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
2349       time, it is remembered with the compiled pattern and assumed at  match-
2350       ing  time.  This is necessary if you want to use JIT execution, because
2351       the JIT compiler needs to know whether or not this option is  set.  For
2352       details see the discussion of PCRE_NO_START_OPTIMIZE below.
2353
2354         PCRE_UCP
2355
2356       This  option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
2357       \w, and some of the POSIX character classes.  By  default,  only  ASCII
2358       characters  are  recognized, but if PCRE_UCP is set, Unicode properties
2359       are used instead to classify characters. More details are given in  the
2360       section  on generic character types in the pcrepattern page. If you set
2361       PCRE_UCP, matching one of the items it affects takes much  longer.  The
2362       option  is  available only if PCRE has been compiled with Unicode prop-
2363       erty support.
2364
2365         PCRE_UNGREEDY
2366
2367       This option inverts the "greediness" of the quantifiers  so  that  they
2368       are  not greedy by default, but become greedy if followed by "?". It is
2369       not compatible with Perl. It can also be set by a (?U)  option  setting
2370       within the pattern.
2371
2372         PCRE_UTF8
2373
2374       This  option  causes PCRE to regard both the pattern and the subject as
2375       strings of UTF-8 characters instead of single-byte strings. However, it
2376       is  available  only  when PCRE is built to include UTF support. If not,
2377       the use of this option provokes an error. Details of  how  this  option
2378       changes the behaviour of PCRE are given in the pcreunicode page.
2379
2380         PCRE_NO_UTF8_CHECK
2381
2382       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2383       automatically checked. There is a  discussion  about  the  validity  of
2384       UTF-8  strings in the pcreunicode page. If an invalid UTF-8 sequence is
2385       found, pcre_compile() returns an error. If you already know  that  your
2386       pattern  is valid, and you want to skip this check for performance rea-
2387       sons, you can set the PCRE_NO_UTF8_CHECK option.  When it is  set,  the
2388       effect of passing an invalid UTF-8 string as a pattern is undefined. It
2389       may cause your program to crash or loop. Note that this option can also
2390       be  passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
2391       checking of subject strings only. If the same string is  being  matched
2392       many  times, the option can be safely set for the second and subsequent
2393       matchings to improve performance.
2394
2395
2396COMPILATION ERROR CODES
2397
2398       The following table lists the error  codes  than  may  be  returned  by
2399       pcre_compile2(),  along with the error messages that may be returned by
2400       both compiling functions. Note that error  messages  are  always  8-bit
2401       ASCII  strings,  even  in 16-bit or 32-bit mode. As PCRE has developed,
2402       some error codes have fallen out of use. To avoid confusion, they  have
2403       not been re-used.
2404
2405          0  no error
2406          1  \ at end of pattern
2407          2  \c at end of pattern
2408          3  unrecognized character follows \
2409          4  numbers out of order in {} quantifier
2410          5  number too big in {} quantifier
2411          6  missing terminating ] for character class
2412          7  invalid escape sequence in character class
2413          8  range out of order in character class
2414          9  nothing to repeat
2415         10  [this code is not in use]
2416         11  internal error: unexpected repeat
2417         12  unrecognized character after (? or (?-
2418         13  POSIX named classes are supported only within a class
2419         14  missing )
2420         15  reference to non-existent subpattern
2421         16  erroffset passed as NULL
2422         17  unknown option bit(s) set
2423         18  missing ) after comment
2424         19  [this code is not in use]
2425         20  regular expression is too large
2426         21  failed to get memory
2427         22  unmatched parentheses
2428         23  internal error: code overflow
2429         24  unrecognized character after (?<
2430         25  lookbehind assertion is not fixed length
2431         26  malformed number or name after (?(
2432         27  conditional group contains more than two branches
2433         28  assertion expected after (?(
2434         29  (?R or (?[+-]digits must be followed by )
2435         30  unknown POSIX class name
2436         31  POSIX collating elements are not supported
2437         32  this version of PCRE is compiled without UTF support
2438         33  [this code is not in use]
2439         34  character value in \x{} or \o{} is too large
2440         35  invalid condition (?(0)
2441         36  \C not allowed in lookbehind assertion
2442         37  PCRE does not support \L, \l, \N{name}, \U, or \u
2443         38  number after (?C is > 255
2444         39  closing ) for (?C expected
2445         40  recursive call could loop indefinitely
2446         41  unrecognized character after (?P
2447         42  syntax error in subpattern name (missing terminator)
2448         43  two named subpatterns have the same name
2449         44  invalid UTF-8 string (specifically UTF-8)
2450         45  support for \P, \p, and \X has not been compiled
2451         46  malformed \P or \p sequence
2452         47  unknown property name after \P or \p
2453         48  subpattern name is too long (maximum 32 characters)
2454         49  too many named subpatterns (maximum 10000)
2455         50  [this code is not in use]
2456         51  octal value is greater than \377 in 8-bit non-UTF-8 mode
2457         52  internal error: overran compiling workspace
2458         53  internal error: previously-checked referenced subpattern
2459               not found
2460         54  DEFINE group contains more than one branch
2461         55  repeating a DEFINE group is not allowed
2462         56  inconsistent NEWLINE options
2463         57  \g is not followed by a braced, angle-bracketed, or quoted
2464               name/number or by a plain number
2465         58  a numbered reference must not be zero
2466         59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
2467         60  (*VERB) not recognized or malformed
2468         61  number is too big
2469         62  subpattern name expected
2470         63  digit expected after (?+
2471         64  ] is an invalid data character in JavaScript compatibility mode
2472         65  different names for subpatterns of the same number are
2473               not allowed
2474         66  (*MARK) must have an argument
2475         67  this version of PCRE is not compiled with Unicode property
2476               support
2477         68  \c must be followed by an ASCII character
2478         69  \k is not followed by a braced, angle-bracketed, or quoted name
2479         70  internal error: unknown opcode in find_fixedlength()
2480         71  \N is not supported in a class
2481         72  too many forward references
2482         73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
2483         74  invalid UTF-16 string (specifically UTF-16)
2484         75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2485         76  character value in \u.... sequence is too large
2486         77  invalid UTF-32 string (specifically UTF-32)
2487         78  setting UTF is disabled by the application
2488         79  non-hex character in \x{} (closing brace missing?)
2489         80  non-octal character in \o{} (closing brace missing?)
2490         81  missing opening brace after \o
2491         82  parentheses are too deeply nested
2492         83  invalid range in character class
2493         84  group name must start with a non-digit
2494         85  parentheses are too deeply nested (stack check)
2495
2496       The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
2497       values may be used if the limits were changed when PCRE was built.
2498
2499
2500STUDYING A PATTERN
2501
2502       pcre_extra *pcre_study(const pcre *code, int options,
2503            const char **errptr);
2504
2505       If a compiled pattern is going to be used several times,  it  is  worth
2506       spending more time analyzing it in order to speed up the time taken for
2507       matching. The function pcre_study() takes a pointer to a compiled  pat-
2508       tern as its first argument. If studying the pattern produces additional
2509       information that will help speed up matching,  pcre_study()  returns  a
2510       pointer  to a pcre_extra block, in which the study_data field points to
2511       the results of the study.
2512
2513       The  returned  value  from  pcre_study()  can  be  passed  directly  to
2514       pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-
2515       tains other fields that can be set by the caller before  the  block  is
2516       passed; these are described below in the section on matching a pattern.
2517
2518       If  studying  the  pattern  does  not  produce  any useful information,
2519       pcre_study() returns NULL by default.  In  that  circumstance,  if  the
2520       calling program wants to pass any of the other fields to pcre_exec() or
2521       pcre_dfa_exec(), it must set up its own pcre_extra block.  However,  if
2522       pcre_study()  is  called  with  the  PCRE_STUDY_EXTRA_NEEDED option, it
2523       returns a pcre_extra block even if studying did not find any additional
2524       information.  It  may still return NULL, however, if an error occurs in
2525       pcre_study().
2526
2527       The second argument of pcre_study() contains  option  bits.  There  are
2528       three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2529
2530         PCRE_STUDY_JIT_COMPILE
2531         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2532         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2533
2534       If  any  of  these are set, and the just-in-time compiler is available,
2535       the pattern is further compiled into machine code  that  executes  much
2536       faster  than  the  pcre_exec()  interpretive  matching function. If the
2537       just-in-time compiler is not available, these options are ignored.  All
2538       undefined bits in the options argument must be zero.
2539
2540       JIT  compilation  is  a heavyweight optimization. It can take some time
2541       for patterns to be analyzed, and for one-off matches  and  simple  pat-
2542       terns  the benefit of faster execution might be offset by a much slower
2543       study time.  Not all patterns can be optimized by the JIT compiler. For
2544       those  that cannot be handled, matching automatically falls back to the
2545       pcre_exec() interpreter. For more details, see the  pcrejit  documenta-
2546       tion.
2547
2548       The  third argument for pcre_study() is a pointer for an error message.
2549       If studying succeeds (even if no data is  returned),  the  variable  it
2550       points  to  is  set  to NULL. Otherwise it is set to point to a textual
2551       error message. This is a static string that is part of the library. You
2552       must  not  try  to  free it. You should test the error pointer for NULL
2553       after calling pcre_study(), to be sure that it has run successfully.
2554
2555       When you are finished with a pattern, you can free the memory used  for
2556       the study data by calling pcre_free_study(). This function was added to
2557       the API for release 8.20. For earlier versions,  the  memory  could  be
2558       freed  with  pcre_free(), just like the pattern itself. This will still
2559       work in cases where JIT optimization is not used, but it  is  advisable
2560       to change to the new function when convenient.
2561
2562       This  is  a typical way in which pcre_study() is used (except that in a
2563       real application there should be tests for errors):
2564
2565         int rc;
2566         pcre *re;
2567         pcre_extra *sd;
2568         re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
2569         sd = pcre_study(
2570           re,             /* result of pcre_compile() */
2571           0,              /* no options */
2572           &error);        /* set to NULL or points to a message */
2573         rc = pcre_exec(   /* see below for details of pcre_exec() options */
2574           re, sd, "subject", 7, 0, 0, ovector, 30);
2575         ...
2576         pcre_free_study(sd);
2577         pcre_free(re);
2578
2579       Studying a pattern does two things: first, a lower bound for the length
2580       of subject string that is needed to match the pattern is computed. This
2581       does not mean that there are any strings of that length that match, but
2582       it  does  guarantee that no shorter strings match. The value is used to
2583       avoid wasting time by trying to match strings that are shorter than the
2584       lower  bound.  You  can find out the value in a calling program via the
2585       pcre_fullinfo() function.
2586
2587       Studying a pattern is also useful for non-anchored patterns that do not
2588       have  a  single fixed starting character. A bitmap of possible starting
2589       bytes is created. This speeds up finding a position in the  subject  at
2590       which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2591       values less than 256.  In 32-bit mode, the bitmap is  used  for  32-bit
2592       values less than 256.)
2593
2594       These  two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
2595       and the information is also used by the JIT  compiler.   The  optimiza-
2596       tions  can  be  disabled  by setting the PCRE_NO_START_OPTIMIZE option.
2597       You might want to do this if your pattern contains callouts or  (*MARK)
2598       and  you  want  to make use of these facilities in cases where matching
2599       fails.
2600
2601       PCRE_NO_START_OPTIMIZE can be specified at either compile time or  exe-
2602       cution   time.   However,   if   PCRE_NO_START_OPTIMIZE  is  passed  to
2603       pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-
2604       cution  is disabled. For JIT execution to work with PCRE_NO_START_OPTI-
2605       MIZE, the option must be set at compile time.
2606
2607       There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
2608
2609
2610LOCALE SUPPORT
2611
2612       PCRE handles caseless matching, and determines whether  characters  are
2613       letters,  digits, or whatever, by reference to a set of tables, indexed
2614       by character code point. When running in UTF-8 mode, or in the  16-  or
2615       32-bit libraries, this applies only to characters with code points less
2616       than 256. By default, higher-valued code  points  never  match  escapes
2617       such  as \w or \d. However, if PCRE is built with Unicode property sup-
2618       port, all characters can be tested with \p and \P,  or,  alternatively,
2619       the  PCRE_UCP option can be set when a pattern is compiled; this causes
2620       \w and friends to use Unicode property support instead of the  built-in
2621       tables.
2622
2623       The  use  of  locales  with Unicode is discouraged. If you are handling
2624       characters with code points greater than 128,  you  should  either  use
2625       Unicode support, or use locales, but not try to mix the two.
2626
2627       PCRE  contains  an  internal set of tables that are used when the final
2628       argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
2629       applications.  Normally, the internal tables recognize only ASCII char-
2630       acters. However, when PCRE is built, it is possible to cause the inter-
2631       nal tables to be rebuilt in the default "C" locale of the local system,
2632       which may cause them to be different.
2633
2634       The internal tables can always be overridden by tables supplied by  the
2635       application that calls PCRE. These may be created in a different locale
2636       from the default. As more and more applications change  to  using  Uni-
2637       code, the need for this locale support is expected to die away.
2638
2639       External  tables  are  built by calling the pcre_maketables() function,
2640       which has no arguments, in the relevant locale. The result can then  be
2641       passed  to  pcre_compile() as often as necessary. For example, to build
2642       and use tables that  are  appropriate  for  the  French  locale  (where
2643       accented  characters  with  values greater than 128 are treated as let-
2644       ters), the following code could be used:
2645
2646         setlocale(LC_CTYPE, "fr_FR");
2647         tables = pcre_maketables();
2648         re = pcre_compile(..., tables);
2649
2650       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2651       if you are using Windows, the name for the French locale is "french".
2652
2653       When  pcre_maketables()  runs,  the  tables are built in memory that is
2654       obtained via pcre_malloc. It is the caller's responsibility  to  ensure
2655       that  the memory containing the tables remains available for as long as
2656       it is needed.
2657
2658       The pointer that is passed to pcre_compile() is saved with the compiled
2659       pattern,  and the same tables are used via this pointer by pcre_study()
2660       and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single  pat-
2661       tern, compilation, studying and matching all happen in the same locale,
2662       but different patterns can be processed in different locales.
2663
2664       It is possible to pass a table pointer or NULL (indicating the  use  of
2665       the internal tables) to pcre_exec() or pcre_dfa_exec() (see the discus-
2666       sion below in the section on matching a pattern). This facility is pro-
2667       vided  for  use  with  pre-compiled  patterns  that have been saved and
2668       reloaded.  Character tables are not saved with patterns, so if  a  non-
2669       standard table was used at compile time, it must be provided again when
2670       the reloaded pattern is matched. Attempting to  use  this  facility  to
2671       match a pattern in a different locale from the one in which it was com-
2672       piled is likely to lead to anomalous (usually incorrect) results.
2673
2674
2675INFORMATION ABOUT A PATTERN
2676
2677       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2678            int what, void *where);
2679
2680       The pcre_fullinfo() function returns information about a compiled  pat-
2681       tern.  It replaces the pcre_info() function, which was removed from the
2682       library at version 8.30, after more than 10 years of obsolescence.
2683
2684       The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
2685       pattern.  The second argument is the result of pcre_study(), or NULL if
2686       the pattern was not studied. The third argument specifies  which  piece
2687       of  information  is required, and the fourth argument is a pointer to a
2688       variable to receive the data. The yield of the  function  is  zero  for
2689       success, or one of the following negative numbers:
2690
2691         PCRE_ERROR_NULL           the argument code was NULL
2692                                   the argument where was NULL
2693         PCRE_ERROR_BADMAGIC       the "magic number" was not found
2694         PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
2695                                   endianness
2696         PCRE_ERROR_BADOPTION      the value of what was invalid
2697         PCRE_ERROR_UNSET          the requested field is not set
2698
2699       The  "magic  number" is placed at the start of each compiled pattern as
2700       an simple check against passing an arbitrary memory pointer. The  endi-
2701       anness error can occur if a compiled pattern is saved and reloaded on a
2702       different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
2703       the length of the compiled pattern:
2704
2705         int rc;
2706         size_t length;
2707         rc = pcre_fullinfo(
2708           re,               /* result of pcre_compile() */
2709           sd,               /* result of pcre_study(), or NULL */
2710           PCRE_INFO_SIZE,   /* what is required */
2711           &length);         /* where to put the data */
2712
2713       The  possible  values for the third argument are defined in pcre.h, and
2714       are as follows:
2715
2716         PCRE_INFO_BACKREFMAX
2717
2718       Return the number of the highest back reference  in  the  pattern.  The
2719       fourth  argument  should  point to an int variable. Zero is returned if
2720       there are no back references.
2721
2722         PCRE_INFO_CAPTURECOUNT
2723
2724       Return the number of capturing subpatterns in the pattern.  The  fourth
2725       argument should point to an int variable.
2726
2727         PCRE_INFO_DEFAULT_TABLES
2728
2729       Return  a pointer to the internal default character tables within PCRE.
2730       The fourth argument should point to an unsigned char *  variable.  This
2731       information call is provided for internal use by the pcre_study() func-
2732       tion. External callers can cause PCRE to use  its  internal  tables  by
2733       passing a NULL table pointer.
2734
2735         PCRE_INFO_FIRSTBYTE (deprecated)
2736
2737       Return information about the first data unit of any matched string, for
2738       a non-anchored pattern. The name of this option  refers  to  the  8-bit
2739       library,  where  data units are bytes. The fourth argument should point
2740       to an int variable. Negative values are used for  special  cases.  How-
2741       ever,  this  means  that when the 32-bit library is in non-UTF-32 mode,
2742       the full 32-bit range of characters cannot be returned. For  this  rea-
2743       son,  this  value  is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and
2744       PCRE_INFO_FIRSTCHARACTER instead.
2745
2746       If there is a fixed first value, for example, the  letter  "c"  from  a
2747       pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
2748       library, the value is always less than 256. In the 16-bit  library  the
2749       value can be up to 0xffff. In the 32-bit library the value can be up to
2750       0x10ffff.
2751
2752       If there is no fixed first value, and if either
2753
2754       (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2755       branch starts with "^", or
2756
2757       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2758       set (if it were set, the pattern would be anchored),
2759
2760       -1 is returned, indicating that the pattern matches only at  the  start
2761       of  a  subject string or after any newline within the string. Otherwise
2762       -2 is returned. For anchored patterns, -2 is returned.
2763
2764         PCRE_INFO_FIRSTCHARACTER
2765
2766       Return the value of the first data  unit  (non-UTF  character)  of  any
2767       matched  string  in  the  situation where PCRE_INFO_FIRSTCHARACTERFLAGS
2768       returns 1; otherwise return 0. The fourth argument should point  to  an
2769       uint_t variable.
2770
2771       In  the 8-bit library, the value is always less than 256. In the 16-bit
2772       library the value can be up to 0xffff. In the 32-bit library in  UTF-32
2773       mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not
2774       using UTF-32 mode.
2775
2776         PCRE_INFO_FIRSTCHARACTERFLAGS
2777
2778       Return information about the first data unit of any matched string, for
2779       a  non-anchored  pattern.  The  fourth  argument should point to an int
2780       variable.
2781
2782       If there is a fixed first value, for example, the  letter  "c"  from  a
2783       pattern  such  as  (cat|cow|coyote),  1  is returned, and the character
2784       value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there  is  no
2785       fixed first value, and if either
2786
2787       (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
2788       branch starts with "^", or
2789
2790       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2791       set (if it were set, the pattern would be anchored),
2792
2793       2 is returned, indicating that the pattern matches only at the start of
2794       a subject string or after any newline within the string. Otherwise 0 is
2795       returned. For anchored patterns, 0 is returned.
2796
2797         PCRE_INFO_FIRSTTABLE
2798
2799       If  the pattern was studied, and this resulted in the construction of a
2800       256-bit table indicating a fixed set of values for the first data  unit
2801       in  any  matching string, a pointer to the table is returned. Otherwise
2802       NULL is returned. The fourth argument should point to an unsigned  char
2803       * variable.
2804
2805         PCRE_INFO_HASCRORLF
2806
2807       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2808       characters, otherwise 0. The fourth argument should  point  to  an  int
2809       variable.  An explicit match is either a literal CR or LF character, or
2810       \r or \n.
2811
2812         PCRE_INFO_JCHANGED
2813
2814       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2815       otherwise  0. The fourth argument should point to an int variable. (?J)
2816       and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2817
2818         PCRE_INFO_JIT
2819
2820       Return 1 if the pattern was studied with one of the  JIT  options,  and
2821       just-in-time compiling was successful. The fourth argument should point
2822       to an int variable. A return value of 0 means that JIT support  is  not
2823       available  in this version of PCRE, or that the pattern was not studied
2824       with a JIT option, or that the JIT compiler could not handle this  par-
2825       ticular  pattern. See the pcrejit documentation for details of what can
2826       and cannot be handled.
2827
2828         PCRE_INFO_JITSIZE
2829
2830       If the pattern was successfully studied with a JIT option,  return  the
2831       size  of the JIT compiled code, otherwise return zero. The fourth argu-
2832       ment should point to a size_t variable.
2833
2834         PCRE_INFO_LASTLITERAL
2835
2836       Return the value of the rightmost literal data unit that must exist  in
2837       any  matched  string, other than at its start, if such a value has been
2838       recorded. The fourth argument should point to an int variable. If there
2839       is no such value, -1 is returned. For anchored patterns, a last literal
2840       value is recorded only if it follows something of variable length.  For
2841       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2842       /^a\dz\d/ the returned value is -1.
2843
2844       Since for the 32-bit library using the non-UTF-32 mode,  this  function
2845       is  unable to return the full 32-bit range of characters, this value is
2846       deprecated;     instead     the     PCRE_INFO_REQUIREDCHARFLAGS     and
2847       PCRE_INFO_REQUIREDCHAR values should be used.
2848
2849         PCRE_INFO_MATCH_EMPTY
2850
2851       Return  1  if  the  pattern can match an empty string, otherwise 0. The
2852       fourth argument should point to an int variable.
2853
2854         PCRE_INFO_MATCHLIMIT
2855
2856       If the pattern set a match limit by  including  an  item  of  the  form
2857       (*LIMIT_MATCH=nnnn)  at  the  start,  the value is returned. The fourth
2858       argument should point to an unsigned 32-bit integer. If no  such  value
2859       has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
2860       PCRE_ERROR_UNSET.
2861
2862         PCRE_INFO_MAXLOOKBEHIND
2863
2864       Return the number of characters (NB not  data  units)  in  the  longest
2865       lookbehind  assertion  in  the pattern. This information is useful when
2866       doing multi-segment matching using  the  partial  matching  facilities.
2867       Note that the simple assertions \b and \B require a one-character look-
2868       behind. \A also registers a one-character lookbehind,  though  it  does
2869       not  actually inspect the previous character. This is to ensure that at
2870       least one character from the old segment is retained when a new segment
2871       is processed. Otherwise, if there are no lookbehinds in the pattern, \A
2872       might match incorrectly at the start of a new segment.
2873
2874         PCRE_INFO_MINLENGTH
2875
2876       If the pattern was studied and a minimum length  for  matching  subject
2877       strings  was  computed,  its  value is returned. Otherwise the returned
2878       value is -1. The value is a number of characters, which in UTF mode may
2879       be  different from the number of data units. The fourth argument should
2880       point to an int variable. A non-negative value is a lower bound to  the
2881       length  of  any  matching  string. There may not be any strings of that
2882       length that do actually match, but every string that does match  is  at
2883       least that long.
2884
2885         PCRE_INFO_NAMECOUNT
2886         PCRE_INFO_NAMEENTRYSIZE
2887         PCRE_INFO_NAMETABLE
2888
2889       PCRE  supports the use of named as well as numbered capturing parenthe-
2890       ses. The names are just an additional way of identifying the  parenthe-
2891       ses, which still acquire numbers. Several convenience functions such as
2892       pcre_get_named_substring() are provided for  extracting  captured  sub-
2893       strings  by  name. It is also possible to extract the data directly, by
2894       first converting the name to a number in order to  access  the  correct
2895       pointers in the output vector (described with pcre_exec() below). To do
2896       the conversion, you need  to  use  the  name-to-number  map,  which  is
2897       described by these three values.
2898
2899       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2900       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2901       of  each  entry;  both  of  these  return  an int value. The entry size
2902       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
2903       a pointer to the first entry of the table. This is a pointer to char in
2904       the 8-bit library, where the first two bytes of each entry are the num-
2905       ber  of  the capturing parenthesis, most significant byte first. In the
2906       16-bit library, the pointer points to 16-bit data units, the  first  of
2907       which  contains  the  parenthesis  number.  In  the 32-bit library, the
2908       pointer points to 32-bit data units, the first of  which  contains  the
2909       parenthesis  number.  The  rest of the entry is the corresponding name,
2910       zero terminated.
2911
2912       The names are in alphabetical order. If (?| is used to create  multiple
2913       groups  with  the same number, as described in the section on duplicate
2914       subpattern numbers in the pcrepattern page, the groups may be given the
2915       same  name,  but  there is only one entry in the table. Different names
2916       for groups of the same number are not permitted.  Duplicate  names  for
2917       subpatterns with different numbers are permitted, but only if PCRE_DUP-
2918       NAMES is set. They appear in the table in the order in which they  were
2919       found  in  the  pattern.  In  the  absence  of (?| this is the order of
2920       increasing number; when (?| is used this is not  necessarily  the  case
2921       because later subpatterns may have lower numbers.
2922
2923       As  a  simple  example of the name/number table, consider the following
2924       pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2925       set, so white space - including newlines - is ignored):
2926
2927         (?<date> (?<year>(\d\d)?\d\d) -
2928         (?<month>\d\d) - (?<day>\d\d) )
2929
2930       There  are  four  named subpatterns, so the table has four entries, and
2931       each entry in the table is eight bytes long. The table is  as  follows,
2932       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2933       as ??:
2934
2935         00 01 d  a  t  e  00 ??
2936         00 05 d  a  y  00 ?? ??
2937         00 04 m  o  n  t  h  00
2938         00 02 y  e  a  r  00 ??
2939
2940       When writing code to extract data  from  named  subpatterns  using  the
2941       name-to-number  map,  remember that the length of the entries is likely
2942       to be different for each compiled pattern.
2943
2944         PCRE_INFO_OKPARTIAL
2945
2946       Return 1  if  the  pattern  can  be  used  for  partial  matching  with
2947       pcre_exec(),  otherwise  0.  The fourth argument should point to an int
2948       variable. From  release  8.00,  this  always  returns  1,  because  the
2949       restrictions  that  previously  applied  to  partial matching have been
2950       lifted. The pcrepartial documentation gives details of  partial  match-
2951       ing.
2952
2953         PCRE_INFO_OPTIONS
2954
2955       Return  a  copy of the options with which the pattern was compiled. The
2956       fourth argument should point to an unsigned long  int  variable.  These
2957       option bits are those specified in the call to pcre_compile(), modified
2958       by any top-level option settings at the start of the pattern itself. In
2959       other  words,  they are the options that will be in force when matching
2960       starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
2961       the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
2962       and PCRE_EXTENDED.
2963
2964       A pattern is automatically anchored by PCRE if  all  of  its  top-level
2965       alternatives begin with one of the following:
2966
2967         ^     unless PCRE_MULTILINE is set
2968         \A    always
2969         \G    always
2970         .*    if PCRE_DOTALL is set and there are no back
2971                 references to the subpattern in which .* appears
2972
2973       For such patterns, the PCRE_ANCHORED bit is set in the options returned
2974       by pcre_fullinfo().
2975
2976         PCRE_INFO_RECURSIONLIMIT
2977
2978       If the pattern set a recursion limit by including an item of  the  form
2979       (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
2980       argument should point to an unsigned 32-bit integer. If no  such  value
2981       has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
2982       PCRE_ERROR_UNSET.
2983
2984         PCRE_INFO_SIZE
2985
2986       Return the size of  the  compiled  pattern  in  bytes  (for  all  three
2987       libraries). The fourth argument should point to a size_t variable. This
2988       value does not include the size of the pcre structure that is  returned
2989       by  pcre_compile().  The  value  that  is  passed  as  the  argument to
2990       pcre_malloc() when pcre_compile() is getting memory in which  to  place
2991       the compiled data is the value returned by this option plus the size of
2992       the pcre structure. Studying a compiled pattern, with or  without  JIT,
2993       does not alter the value returned by this option.
2994
2995         PCRE_INFO_STUDYSIZE
2996
2997       Return  the  size  in bytes (for all three libraries) of the data block
2998       pointed to by the study_data field in a pcre_extra block. If pcre_extra
2999       is  NULL, or there is no study data, zero is returned. The fourth argu-
3000       ment should point to a size_t variable. The study_data field is set  by
3001       pcre_study() to record information that will speed up matching (see the
3002       section entitled  "Studying  a  pattern"  above).  The  format  of  the
3003       study_data  block is private, but its length is made available via this
3004       option so that it can be saved and  restored  (see  the  pcreprecompile
3005       documentation for details).
3006
3007         PCRE_INFO_REQUIREDCHARFLAGS
3008
3009       Returns  1 if there is a rightmost literal data unit that must exist in
3010       any matched string, other than at its start. The fourth argument should
3011       point  to an int variable. If there is no such value, 0 is returned. If
3012       returning  1,  the  character  value  itself  can  be  retrieved  using
3013       PCRE_INFO_REQUIREDCHAR.
3014
3015       For anchored patterns, a last literal value is recorded only if it fol-
3016       lows something  of  variable  length.  For  example,  for  the  pattern
3017       /^a\d+z\d+/   the   returned   value   1   (with   "z"   returned  from
3018       PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
3019
3020         PCRE_INFO_REQUIREDCHAR
3021
3022       Return the value of the rightmost literal data unit that must exist  in
3023       any  matched  string, other than at its start, if such a value has been
3024       recorded. The fourth argument should point to an uint32_t variable.  If
3025       there is no such value, 0 is returned.
3026
3027
3028REFERENCE COUNTS
3029
3030       int pcre_refcount(pcre *code, int adjust);
3031
3032       The  pcre_refcount()  function is used to maintain a reference count in
3033       the data block that contains a compiled pattern. It is provided for the
3034       benefit  of  applications  that  operate  in an object-oriented manner,
3035       where different parts of the application may be using the same compiled
3036       pattern, but you want to free the block when they are all done.
3037
3038       When a pattern is compiled, the reference count field is initialized to
3039       zero.  It is changed only by calling this function, whose action is  to
3040       add  the  adjust  value  (which may be positive or negative) to it. The
3041       yield of the function is the new value. However, the value of the count
3042       is  constrained to lie between 0 and 65535, inclusive. If the new value
3043       is outside these limits, it is forced to the appropriate limit value.
3044
3045       Except when it is zero, the reference count is not correctly  preserved
3046       if  a  pattern  is  compiled on one host and then transferred to a host
3047       whose byte-order is different. (This seems a highly unlikely scenario.)
3048
3049
3050MATCHING A PATTERN: THE TRADITIONAL FUNCTION
3051
3052       int pcre_exec(const pcre *code, const pcre_extra *extra,
3053            const char *subject, int length, int startoffset,
3054            int options, int *ovector, int ovecsize);
3055
3056       The function pcre_exec() is called to match a subject string against  a
3057       compiled  pattern, which is passed in the code argument. If the pattern
3058       was studied, the result of the study should  be  passed  in  the  extra
3059       argument.  You  can call pcre_exec() with the same code and extra argu-
3060       ments as many times as you like, in order to  match  different  subject
3061       strings with the same pattern.
3062
3063       This  function  is  the  main  matching facility of the library, and it
3064       operates in a Perl-like manner. For specialist use  there  is  also  an
3065       alternative  matching function, which is described below in the section
3066       about the pcre_dfa_exec() function.
3067
3068       In most applications, the pattern will have been compiled (and  option-
3069       ally  studied)  in the same process that calls pcre_exec(). However, it
3070       is possible to save compiled patterns and study data, and then use them
3071       later  in  different processes, possibly even on different hosts. For a
3072       discussion about this, see the pcreprecompile documentation.
3073
3074       Here is an example of a simple call to pcre_exec():
3075
3076         int rc;
3077         int ovector[30];
3078         rc = pcre_exec(
3079           re,             /* result of pcre_compile() */
3080           NULL,           /* we didn't study the pattern */
3081           "some string",  /* the subject string */
3082           11,             /* the length of the subject string */
3083           0,              /* start at offset 0 in the subject */
3084           0,              /* default options */
3085           ovector,        /* vector of integers for substring information */
3086           30);            /* number of elements (NOT size in bytes) */
3087
3088   Extra data for pcre_exec()
3089
3090       If the extra argument is not NULL, it must point to a  pcre_extra  data
3091       block.  The pcre_study() function returns such a block (when it doesn't
3092       return NULL), but you can also create one for yourself, and pass  addi-
3093       tional  information  in it. The pcre_extra block contains the following
3094       fields (not necessarily in this order):
3095
3096         unsigned long int flags;
3097         void *study_data;
3098         void *executable_jit;
3099         unsigned long int match_limit;
3100         unsigned long int match_limit_recursion;
3101         void *callout_data;
3102         const unsigned char *tables;
3103         unsigned char **mark;
3104
3105       In the 16-bit version of  this  structure,  the  mark  field  has  type
3106       "PCRE_UCHAR16 **".
3107
3108       In  the  32-bit  version  of  this  structure,  the mark field has type
3109       "PCRE_UCHAR32 **".
3110
3111       The flags field is used to specify which of the other fields  are  set.
3112       The flag bits are:
3113
3114         PCRE_EXTRA_CALLOUT_DATA
3115         PCRE_EXTRA_EXECUTABLE_JIT
3116         PCRE_EXTRA_MARK
3117         PCRE_EXTRA_MATCH_LIMIT
3118         PCRE_EXTRA_MATCH_LIMIT_RECURSION
3119         PCRE_EXTRA_STUDY_DATA
3120         PCRE_EXTRA_TABLES
3121
3122       Other  flag  bits should be set to zero. The study_data field and some-
3123       times the executable_jit field are set in the pcre_extra block that  is
3124       returned  by pcre_study(), together with the appropriate flag bits. You
3125       should not set these yourself, but you may add to the block by  setting
3126       other fields and their corresponding flag bits.
3127
3128       The match_limit field provides a means of preventing PCRE from using up
3129       a vast amount of resources when running patterns that are not going  to
3130       match,  but  which  have  a very large number of possibilities in their
3131       search trees. The classic example is a pattern that uses nested  unlim-
3132       ited repeats.
3133
3134       Internally,  pcre_exec() uses a function called match(), which it calls
3135       repeatedly (sometimes recursively). The limit  set  by  match_limit  is
3136       imposed  on the number of times this function is called during a match,
3137       which has the effect of limiting the amount of  backtracking  that  can
3138       take place. For patterns that are not anchored, the count restarts from
3139       zero for each position in the subject string.
3140
3141       When pcre_exec() is called with a pattern that was successfully studied
3142       with  a  JIT  option, the way that the matching is executed is entirely
3143       different.  However, there is still the possibility of runaway matching
3144       that goes on for a very long time, and so the match_limit value is also
3145       used in this case (but in a different way) to limit how long the match-
3146       ing can continue.
3147
3148       The  default  value  for  the  limit can be set when PCRE is built; the
3149       default default is 10 million, which handles all but the  most  extreme
3150       cases.  You  can  override  the  default by suppling pcre_exec() with a
3151       pcre_extra    block    in    which    match_limit    is    set,     and
3152       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
3153       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
3154
3155       A value for the match limit may also be supplied  by  an  item  at  the
3156       start of a pattern of the form
3157
3158         (*LIMIT_MATCH=d)
3159
3160       where  d is a decimal number. However, such a setting is ignored unless
3161       d is less than the limit set by the caller of  pcre_exec()  or,  if  no
3162       such limit is set, less than the default.
3163
3164       The  match_limit_recursion field is similar to match_limit, but instead
3165       of limiting the total number of times that match() is called, it limits
3166       the  depth  of  recursion. The recursion depth is a smaller number than
3167       the total number of calls, because not all calls to match() are  recur-
3168       sive.  This limit is of use only if it is set smaller than match_limit.
3169
3170       Limiting  the  recursion  depth limits the amount of machine stack that
3171       can be used, or, when PCRE has been compiled to use memory on the  heap
3172       instead  of the stack, the amount of heap memory that can be used. This
3173       limit is not relevant, and is ignored, when matching is done using  JIT
3174       compiled code.
3175
3176       The  default  value  for  match_limit_recursion can be set when PCRE is
3177       built; the default default  is  the  same  value  as  the  default  for
3178       match_limit.  You can override the default by suppling pcre_exec() with
3179       a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
3180       PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
3181       limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
3182
3183       A value for the recursion limit may also be supplied by an item at  the
3184       start of a pattern of the form
3185
3186         (*LIMIT_RECURSION=d)
3187
3188       where  d is a decimal number. However, such a setting is ignored unless
3189       d is less than the limit set by the caller of  pcre_exec()  or,  if  no
3190       such limit is set, less than the default.
3191
3192       The  callout_data  field is used in conjunction with the "callout" fea-
3193       ture, and is described in the pcrecallout documentation.
3194
3195       The tables field is provided for use with patterns that have been  pre-
3196       compiled using custom character tables, saved to disc or elsewhere, and
3197       then reloaded, because the tables that were used to compile  a  pattern
3198       are  not saved with it. See the pcreprecompile documentation for a dis-
3199       cussion of saving compiled patterns for later use. If  NULL  is  passed
3200       using this mechanism, it forces PCRE's internal tables to be used.
3201
3202       Warning:  The  tables  that  pcre_exec() uses must be the same as those
3203       that were used when the pattern was compiled. If this is not the  case,
3204       the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
3205       compiled and matched in the same process, this field  should  never  be
3206       set. In this (the most common) case, the correct table pointer is auto-
3207       matically passed with  the  compiled  pattern  from  pcre_compile()  to
3208       pcre_exec().
3209
3210       If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
3211       set to point to a suitable variable. If the pattern contains any  back-
3212       tracking  control verbs such as (*MARK:NAME), and the execution ends up
3213       with a name to pass back, a pointer to the  name  string  (zero  termi-
3214       nated)  is  placed  in  the  variable pointed to by the mark field. The
3215       names are within the compiled pattern; if you wish  to  retain  such  a
3216       name  you must copy it before freeing the memory of a compiled pattern.
3217       If there is no name to pass back, the variable pointed to by  the  mark
3218       field  is  set  to NULL. For details of the backtracking control verbs,
3219       see the section entitled "Backtracking control" in the pcrepattern doc-
3220       umentation.
3221
3222   Option bits for pcre_exec()
3223
3224       The  unused  bits of the options argument for pcre_exec() must be zero.
3225       The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
3226       PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
3227       PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,   and
3228       PCRE_PARTIAL_SOFT.
3229
3230       If  the  pattern  was successfully studied with one of the just-in-time
3231       (JIT) compile options, the only supported options for JIT execution are
3232       PCRE_NO_UTF8_CHECK,     PCRE_NOTBOL,     PCRE_NOTEOL,    PCRE_NOTEMPTY,
3233       PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If  an
3234       unsupported  option  is  used, JIT execution is disabled and the normal
3235       interpretive code in pcre_exec() is run.
3236
3237         PCRE_ANCHORED
3238
3239       The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
3240       matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
3241       turned out to be anchored by virtue of its contents, it cannot be  made
3242       unachored at matching time.
3243
3244         PCRE_BSR_ANYCRLF
3245         PCRE_BSR_UNICODE
3246
3247       These options (which are mutually exclusive) control what the \R escape
3248       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
3249       or  to  match  any Unicode newline sequence. These options override the
3250       choice that was made or defaulted when the pattern was compiled.
3251
3252         PCRE_NEWLINE_CR
3253         PCRE_NEWLINE_LF
3254         PCRE_NEWLINE_CRLF
3255         PCRE_NEWLINE_ANYCRLF
3256         PCRE_NEWLINE_ANY
3257
3258       These options override  the  newline  definition  that  was  chosen  or
3259       defaulted  when the pattern was compiled. For details, see the descrip-
3260       tion of pcre_compile()  above.  During  matching,  the  newline  choice
3261       affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
3262       ters. It may also alter the way the match position is advanced after  a
3263       match failure for an unanchored pattern.
3264
3265       When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
3266       set, and a match attempt for an unanchored pattern fails when the  cur-
3267       rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
3268       explicit matches for  CR  or  LF  characters,  the  match  position  is
3269       advanced by two characters instead of one, in other words, to after the
3270       CRLF.
3271
3272       The above rule is a compromise that makes the most common cases work as
3273       expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
3274       option is not set), it does not match the string "\r\nA" because, after
3275       failing  at the start, it skips both the CR and the LF before retrying.
3276       However, the pattern [\r\n]A does match that string,  because  it  con-
3277       tains an explicit CR or LF reference, and so advances only by one char-
3278       acter after the first failure.
3279
3280       An explicit match for CR of LF is either a literal appearance of one of
3281       those  characters,  or  one  of the \r or \n escape sequences. Implicit
3282       matches such as [^X] do not count, nor does \s (which includes  CR  and
3283       LF in the characters that it matches).
3284
3285       Notwithstanding  the above, anomalous effects may still occur when CRLF
3286       is a valid newline sequence and explicit \r or \n escapes appear in the
3287       pattern.
3288
3289         PCRE_NOTBOL
3290
3291       This option specifies that first character of the subject string is not
3292       the beginning of a line, so the  circumflex  metacharacter  should  not
3293       match  before it. Setting this without PCRE_MULTILINE (at compile time)
3294       causes circumflex never to match. This option affects only  the  behav-
3295       iour of the circumflex metacharacter. It does not affect \A.
3296
3297         PCRE_NOTEOL
3298
3299       This option specifies that the end of the subject string is not the end
3300       of a line, so the dollar metacharacter should not match it nor  (except
3301       in  multiline mode) a newline immediately before it. Setting this with-
3302       out PCRE_MULTILINE (at compile time) causes dollar never to match. This
3303       option  affects only the behaviour of the dollar metacharacter. It does
3304       not affect \Z or \z.
3305
3306         PCRE_NOTEMPTY
3307
3308       An empty string is not considered to be a valid match if this option is
3309       set.  If  there are alternatives in the pattern, they are tried. If all
3310       the alternatives match the empty string, the entire  match  fails.  For
3311       example, if the pattern
3312
3313         a?b?
3314
3315       is  applied  to  a  string not beginning with "a" or "b", it matches an
3316       empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
3317       match is not valid, so PCRE searches further into the string for occur-
3318       rences of "a" or "b".
3319
3320         PCRE_NOTEMPTY_ATSTART
3321
3322       This is like PCRE_NOTEMPTY, except that an empty string match  that  is
3323       not  at  the  start  of  the  subject  is  permitted. If the pattern is
3324       anchored, such a match can occur only if the pattern contains \K.
3325
3326       Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
3327       PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
3328       match of the empty string within its split() function, and  when  using
3329       the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
3330       matching a null string by first trying the match again at the same off-
3331       set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
3332       fails, by advancing the starting offset (see below) and trying an ordi-
3333       nary  match  again. There is some code that demonstrates how to do this
3334       in the pcredemo sample program. In the most general case, you  have  to
3335       check  to  see  if the newline convention recognizes CRLF as a newline,
3336       and if so, and the current character is CR followed by LF, advance  the
3337       starting offset by two characters instead of one.
3338
3339         PCRE_NO_START_OPTIMIZE
3340
3341       There  are a number of optimizations that pcre_exec() uses at the start
3342       of a match, in order to speed up the process. For  example,  if  it  is
3343       known that an unanchored match must start with a specific character, it
3344       searches the subject for that character, and fails  immediately  if  it
3345       cannot  find  it,  without actually running the main matching function.
3346       This means that a special item such as (*COMMIT) at the start of a pat-
3347       tern  is  not  considered until after a suitable starting point for the
3348       match has been found. Also, when callouts or (*MARK) items are in  use,
3349       these "start-up" optimizations can cause them to be skipped if the pat-
3350       tern is never actually used. The start-up optimizations are in effect a
3351       pre-scan of the subject that takes place before the pattern is run.
3352
3353       The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
3354       possibly causing performance to suffer,  but  ensuring  that  in  cases
3355       where  the  result is "no match", the callouts do occur, and that items
3356       such as (*COMMIT) and (*MARK) are considered at every possible starting
3357       position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
3358       compile time,  it  cannot  be  unset  at  matching  time.  The  use  of
3359       PCRE_NO_START_OPTIMIZE  at  matching  time  (that  is,  passing  it  to
3360       pcre_exec()) disables JIT execution; in  this  situation,  matching  is
3361       always done using interpretively.
3362
3363       Setting  PCRE_NO_START_OPTIMIZE  can  change  the outcome of a matching
3364       operation.  Consider the pattern
3365
3366         (*COMMIT)ABC
3367
3368       When this is compiled, PCRE records the fact that a  match  must  start
3369       with  the  character  "A".  Suppose the subject string is "DEFABC". The
3370       start-up optimization scans along the subject, finds "A" and  runs  the
3371       first  match attempt from there. The (*COMMIT) item means that the pat-
3372       tern must match the current starting position, which in this  case,  it
3373       does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE
3374       set, the initial scan along the subject string  does  not  happen.  The
3375       first  match  attempt  is  run  starting  from "D" and when this fails,
3376       (*COMMIT) prevents any further matches  being  tried,  so  the  overall
3377       result  is  "no  match". If the pattern is studied, more start-up opti-
3378       mizations may be used. For example, a minimum length  for  the  subject
3379       may be recorded. Consider the pattern
3380
3381         (*MARK:A)(X|Y)
3382
3383       The  minimum  length  for  a  match is one character. If the subject is
3384       "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then
3385       finally  an empty string.  If the pattern is studied, the final attempt
3386       does not take place, because PCRE knows that the subject is too  short,
3387       and  so  the  (*MARK) is never encountered.  In this case, studying the
3388       pattern does not affect the overall match result, which  is  still  "no
3389       match", but it does affect the auxiliary information that is returned.
3390
3391         PCRE_NO_UTF8_CHECK
3392
3393       When PCRE_UTF8 is set at compile time, the validity of the subject as a
3394       UTF-8 string is automatically checked when pcre_exec() is  subsequently
3395       called.  The entire string is checked before any other processing takes
3396       place. The value of startoffset is  also  checked  to  ensure  that  it
3397       points  to  the start of a UTF-8 character. There is a discussion about
3398       the validity of UTF-8 strings in the pcreunicode page.  If  an  invalid
3399       sequence   of   bytes   is   found,   pcre_exec()   returns  the  error
3400       PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3401       truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
3402       both cases, information about the precise nature of the error may  also
3403       be  returned (see the descriptions of these errors in the section enti-
3404       tled Error return values from pcre_exec() below).  If startoffset  con-
3405       tains a value that does not point to the start of a UTF-8 character (or
3406       to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
3407
3408       If you already know that your subject is valid, and you  want  to  skip
3409       these    checks    for   performance   reasons,   you   can   set   the
3410       PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
3411       do  this  for the second and subsequent calls to pcre_exec() if you are
3412       making repeated calls to find all  the  matches  in  a  single  subject
3413       string.  However,  you  should  be  sure  that the value of startoffset
3414       points to the start of a character (or the end of  the  subject).  When
3415       PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3416       subject or an invalid value of startoffset is undefined.  Your  program
3417       may crash or loop.
3418
3419         PCRE_PARTIAL_HARD
3420         PCRE_PARTIAL_SOFT
3421
3422       These  options turn on the partial matching feature. For backwards com-
3423       patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
3424       match  occurs if the end of the subject string is reached successfully,
3425       but there are not enough subject characters to complete the  match.  If
3426       this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
3427       matching continues by testing any remaining alternatives.  Only  if  no
3428       complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
3429       PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
3430       caller  is  prepared to handle a partial match, but only if no complete
3431       match can be found.
3432
3433       If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
3434       case,  if  a  partial  match  is found, pcre_exec() immediately returns
3435       PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
3436       other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
3437       ered to be more important that an alternative complete match.
3438
3439       In both cases, the portion of the string that was  inspected  when  the
3440       partial match was found is set as the first matching string. There is a
3441       more detailed discussion of partial and  multi-segment  matching,  with
3442       examples, in the pcrepartial documentation.
3443
3444   The string to be matched by pcre_exec()
3445
3446       The  subject string is passed to pcre_exec() as a pointer in subject, a
3447       length in length, and a starting offset in startoffset. The  units  for
3448       length  and  startoffset  are  bytes for the 8-bit library, 16-bit data
3449       items for the 16-bit library, and 32-bit  data  items  for  the  32-bit
3450       library.
3451
3452       If  startoffset  is negative or greater than the length of the subject,
3453       pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
3454       zero,  the  search  for a match starts at the beginning of the subject,
3455       and this is by far the most common case. In UTF-8 or UTF-16  mode,  the
3456       offset  must  point to the start of a character, or the end of the sub-
3457       ject (in UTF-32 mode, one data unit equals one character, so  all  off-
3458       sets  are  valid).  Unlike  the pattern string, the subject may contain
3459       binary zeroes.
3460
3461       A non-zero starting offset is useful when searching for  another  match
3462       in  the same subject by calling pcre_exec() again after a previous suc-
3463       cess.  Setting startoffset differs from just passing over  a  shortened
3464       string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
3465       with any kind of lookbehind. For example, consider the pattern
3466
3467         \Biss\B
3468
3469       which finds occurrences of "iss" in the middle of  words.  (\B  matches
3470       only  if  the  current position in the subject is not a word boundary.)
3471       When applied to the string "Mississipi" the first call  to  pcre_exec()
3472       finds  the  first  occurrence. If pcre_exec() is called again with just
3473       the remainder of the subject,  namely  "issipi",  it  does  not  match,
3474       because \B is always false at the start of the subject, which is deemed
3475       to be a word boundary. However, if pcre_exec()  is  passed  the  entire
3476       string again, but with startoffset set to 4, it finds the second occur-
3477       rence of "iss" because it is able to look behind the starting point  to
3478       discover that it is preceded by a letter.
3479
3480       Finding  all  the  matches  in a subject is tricky when the pattern can
3481       match an empty string. It is possible to emulate Perl's /g behaviour by
3482       first   trying   the   match   again  at  the  same  offset,  with  the
3483       PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
3484       fails,  advancing  the  starting  offset  and  trying an ordinary match
3485       again. There is some code that demonstrates how to do this in the pcre-
3486       demo sample program. In the most general case, you have to check to see
3487       if the newline convention recognizes CRLF as a newline, and if so,  and
3488       the current character is CR followed by LF, advance the starting offset
3489       by two characters instead of one.
3490
3491       If a non-zero starting offset is passed when the pattern  is  anchored,
3492       one attempt to match at the given offset is made. This can only succeed
3493       if the pattern does not require the match to be at  the  start  of  the
3494       subject.
3495
3496   How pcre_exec() returns captured substrings
3497
3498       In  general, a pattern matches a certain portion of the subject, and in
3499       addition, further substrings from the subject  may  be  picked  out  by
3500       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
3501       this is called "capturing" in what follows, and the  phrase  "capturing
3502       subpattern"  is  used for a fragment of a pattern that picks out a sub-
3503       string. PCRE supports several other kinds of  parenthesized  subpattern
3504       that do not cause substrings to be captured.
3505
3506       Captured substrings are returned to the caller via a vector of integers
3507       whose address is passed in ovector. The number of elements in the  vec-
3508       tor  is  passed in ovecsize, which must be a non-negative number. Note:
3509       this argument is NOT the size of ovector in bytes.
3510
3511       The first two-thirds of the vector is used to pass back  captured  sub-
3512       strings,  each  substring using a pair of integers. The remaining third
3513       of the vector is used as workspace by pcre_exec() while  matching  cap-
3514       turing  subpatterns, and is not available for passing back information.
3515       The number passed in ovecsize should always be a multiple of three.  If
3516       it is not, it is rounded down.
3517
3518       When  a  match  is successful, information about captured substrings is
3519       returned in pairs of integers, starting at the  beginning  of  ovector,
3520       and  continuing  up  to two-thirds of its length at the most. The first
3521       element of each pair is set to the offset of the first character  in  a
3522       substring,  and  the second is set to the offset of the first character
3523       after the end of a substring. These values are always  data  unit  off-
3524       sets,  even  in  UTF  mode. They are byte offsets in the 8-bit library,
3525       16-bit data item offsets in the 16-bit library, and  32-bit  data  item
3526       offsets in the 32-bit library. Note: they are not character counts.
3527
3528       The  first  pair  of  integers, ovector[0] and ovector[1], identify the
3529       portion of the subject string matched by the entire pattern.  The  next
3530       pair  is  used for the first capturing subpattern, and so on. The value
3531       returned by pcre_exec() is one more than the highest numbered pair that
3532       has  been  set.  For example, if two substrings have been captured, the
3533       returned value is 3. If there are no capturing subpatterns, the  return
3534       value from a successful match is 1, indicating that just the first pair
3535       of offsets has been set.
3536
3537       If a capturing subpattern is matched repeatedly, it is the last portion
3538       of the string that it matched that is returned.
3539
3540       If  the vector is too small to hold all the captured substring offsets,
3541       it is used as far as possible (up to two-thirds of its length), and the
3542       function  returns a value of zero. If neither the actual string matched
3543       nor any captured substrings are of interest, pcre_exec() may be  called
3544       with  ovector passed as NULL and ovecsize as zero. However, if the pat-
3545       tern contains back references and the ovector  is  not  big  enough  to
3546       remember  the related substrings, PCRE has to get additional memory for
3547       use during matching. Thus it is usually advisable to supply an  ovector
3548       of reasonable size.
3549
3550       There  are  some  cases where zero is returned (indicating vector over-
3551       flow) when in fact the vector is exactly the right size for  the  final
3552       match. For example, consider the pattern
3553
3554         (a)(?:(b)c|bd)
3555
3556       If  a  vector of 6 elements (allowing for only 1 captured substring) is
3557       given with subject string "abd", pcre_exec() will try to set the second
3558       captured string, thereby recording a vector overflow, before failing to
3559       match "c" and backing up  to  try  the  second  alternative.  The  zero
3560       return,  however,  does  correctly  indicate that the maximum number of
3561       slots (namely 2) have been filled. In similar cases where there is tem-
3562       porary  overflow,  but  the final number of used slots is actually less
3563       than the maximum, a non-zero value is returned.
3564
3565       The pcre_fullinfo() function can be used to find out how many capturing
3566       subpatterns  there  are  in  a  compiled pattern. The smallest size for
3567       ovector that will allow for n captured substrings, in addition  to  the
3568       offsets of the substring matched by the whole pattern, is (n+1)*3.
3569
3570       It  is  possible for capturing subpattern number n+1 to match some part
3571       of the subject when subpattern n has not been used at all. For example,
3572       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
3573       return from the function is 4, and subpatterns 1 and 3 are matched, but
3574       2  is  not.  When  this happens, both values in the offset pairs corre-
3575       sponding to unused subpatterns are set to -1.
3576
3577       Offset values that correspond to unused subpatterns at the end  of  the
3578       expression  are  also  set  to  -1. For example, if the string "abc" is
3579       matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
3580       matched.  The  return  from the function is 2, because the highest used
3581       capturing subpattern number is 1, and the offsets for  for  the  second
3582       and  third  capturing subpatterns (assuming the vector is large enough,
3583       of course) are set to -1.
3584
3585       Note: Elements in the first two-thirds of ovector that  do  not  corre-
3586       spond  to  capturing parentheses in the pattern are never changed. That
3587       is, if a pattern contains n capturing parentheses, no more  than  ovec-
3588       tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
3589       the first two-thirds) retain whatever values they previously had.
3590
3591       Some convenience functions are provided  for  extracting  the  captured
3592       substrings as separate strings. These are described below.
3593
3594   Error return values from pcre_exec()
3595
3596       If  pcre_exec()  fails, it returns a negative number. The following are
3597       defined in the header file:
3598
3599         PCRE_ERROR_NOMATCH        (-1)
3600
3601       The subject string did not match the pattern.
3602
3603         PCRE_ERROR_NULL           (-2)
3604
3605       Either code or subject was passed as NULL,  or  ovector  was  NULL  and
3606       ovecsize was not zero.
3607
3608         PCRE_ERROR_BADOPTION      (-3)
3609
3610       An unrecognized bit was set in the options argument.
3611
3612         PCRE_ERROR_BADMAGIC       (-4)
3613
3614       PCRE  stores a 4-byte "magic number" at the start of the compiled code,
3615       to catch the case when it is passed a junk pointer and to detect when a
3616       pattern that was compiled in an environment of one endianness is run in
3617       an environment with the other endianness. This is the error  that  PCRE
3618       gives when the magic number is not present.
3619
3620         PCRE_ERROR_UNKNOWN_OPCODE (-5)
3621
3622       While running the pattern match, an unknown item was encountered in the
3623       compiled pattern. This error could be caused by a bug  in  PCRE  or  by
3624       overwriting of the compiled pattern.
3625
3626         PCRE_ERROR_NOMEMORY       (-6)
3627
3628       If  a  pattern contains back references, but the ovector that is passed
3629       to pcre_exec() is not big enough to remember the referenced substrings,
3630       PCRE  gets  a  block of memory at the start of matching to use for this
3631       purpose. If the call via pcre_malloc() fails, this error is given.  The
3632       memory is automatically freed at the end of matching.
3633
3634       This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
3635       This can happen only when PCRE has been compiled with  --disable-stack-
3636       for-recursion.
3637
3638         PCRE_ERROR_NOSUBSTRING    (-7)
3639
3640       This  error is used by the pcre_copy_substring(), pcre_get_substring(),
3641       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
3642       returned by pcre_exec().
3643
3644         PCRE_ERROR_MATCHLIMIT     (-8)
3645
3646       The  backtracking  limit,  as  specified  by the match_limit field in a
3647       pcre_extra structure (or defaulted) was reached.  See  the  description
3648       above.
3649
3650         PCRE_ERROR_CALLOUT        (-9)
3651
3652       This error is never generated by pcre_exec() itself. It is provided for
3653       use by callout functions that want to yield a distinctive  error  code.
3654       See the pcrecallout documentation for details.
3655
3656         PCRE_ERROR_BADUTF8        (-10)
3657
3658       A  string  that contains an invalid UTF-8 byte sequence was passed as a
3659       subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
3660       the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
3661       start of the the invalid UTF-8 character is placed in  the  first  ele-
3662       ment,  and  a  reason  code is placed in the second element. The reason
3663       codes are listed in the following section.  For backward compatibility,
3664       if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
3665       acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
3666       PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
3667
3668         PCRE_ERROR_BADUTF8_OFFSET (-11)
3669
3670       The  UTF-8  byte  sequence that was passed as a subject was checked and
3671       found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
3672       value  of startoffset did not point to the beginning of a UTF-8 charac-
3673       ter or the end of the subject.
3674
3675         PCRE_ERROR_PARTIAL        (-12)
3676
3677       The subject string did not match, but it did match partially.  See  the
3678       pcrepartial documentation for details of partial matching.
3679
3680         PCRE_ERROR_BADPARTIAL     (-13)
3681
3682       This  code  is  no  longer  in  use.  It was formerly returned when the
3683       PCRE_PARTIAL option was used with a compiled pattern  containing  items
3684       that  were  not  supported  for  partial  matching.  From  release 8.00
3685       onwards, there are no restrictions on partial matching.
3686
3687         PCRE_ERROR_INTERNAL       (-14)
3688
3689       An unexpected internal error has occurred. This error could  be  caused
3690       by a bug in PCRE or by overwriting of the compiled pattern.
3691
3692         PCRE_ERROR_BADCOUNT       (-15)
3693
3694       This error is given if the value of the ovecsize argument is negative.
3695
3696         PCRE_ERROR_RECURSIONLIMIT (-21)
3697
3698       The internal recursion limit, as specified by the match_limit_recursion
3699       field in a pcre_extra structure (or defaulted)  was  reached.  See  the
3700       description above.
3701
3702         PCRE_ERROR_BADNEWLINE     (-23)
3703
3704       An invalid combination of PCRE_NEWLINE_xxx options was given.
3705
3706         PCRE_ERROR_BADOFFSET      (-24)
3707
3708       The value of startoffset was negative or greater than the length of the
3709       subject, that is, the value in length.
3710
3711         PCRE_ERROR_SHORTUTF8      (-25)
3712
3713       This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
3714       string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
3715       option is set.  Information  about  the  failure  is  returned  as  for
3716       PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
3717       this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
3718       tion  of returned information; it is retained for backwards compatibil-
3719       ity.
3720
3721         PCRE_ERROR_RECURSELOOP    (-26)
3722
3723       This error is returned when pcre_exec() detects a recursion loop within
3724       the  pattern. Specifically, it means that either the whole pattern or a
3725       subpattern has been called recursively for the second time at the  same
3726       position in the subject string. Some simple patterns that might do this
3727       are detected and faulted at compile time, but more  complicated  cases,
3728       in particular mutual recursions between two different subpatterns, can-
3729       not be detected until run time.
3730
3731         PCRE_ERROR_JIT_STACKLIMIT (-27)
3732
3733       This error is returned when a pattern  that  was  successfully  studied
3734       using  a  JIT compile option is being matched, but the memory available
3735       for the just-in-time processing stack is  not  large  enough.  See  the
3736       pcrejit documentation for more details.
3737
3738         PCRE_ERROR_BADMODE        (-28)
3739
3740       This error is given if a pattern that was compiled by the 8-bit library
3741       is passed to a 16-bit or 32-bit library function, or vice versa.
3742
3743         PCRE_ERROR_BADENDIANNESS  (-29)
3744
3745       This error is given if  a  pattern  that  was  compiled  and  saved  is
3746       reloaded  on  a  host  with  different endianness. The utility function
3747       pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3748       so that it runs on the new host.
3749
3750         PCRE_ERROR_JIT_BADOPTION
3751
3752       This  error  is  returned  when a pattern that was successfully studied
3753       using a JIT compile option is being  matched,  but  the  matching  mode
3754       (partial  or complete match) does not correspond to any JIT compilation
3755       mode. When the JIT fast path function is used, this error may  be  also
3756       given  for  invalid  options.  See  the  pcrejit documentation for more
3757       details.
3758
3759         PCRE_ERROR_BADLENGTH      (-32)
3760
3761       This error is given if pcre_exec() is called with a negative value  for
3762       the length argument.
3763
3764       Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
3765
3766   Reason codes for invalid UTF-8 strings
3767
3768       This  section  applies  only  to  the  8-bit library. The corresponding
3769       information for the 16-bit and 32-bit libraries is given in the  pcre16
3770       and pcre32 pages.
3771
3772       When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3773       UTF8, and the size of the output vector (ovecsize) is at least  2,  the
3774       offset  of  the  start  of the invalid UTF-8 character is placed in the
3775       first output vector element (ovector[0]) and a reason code is placed in
3776       the  second  element  (ovector[1]). The reason codes are given names in
3777       the pcre.h header file:
3778
3779         PCRE_UTF8_ERR1
3780         PCRE_UTF8_ERR2
3781         PCRE_UTF8_ERR3
3782         PCRE_UTF8_ERR4
3783         PCRE_UTF8_ERR5
3784
3785       The string ends with a truncated UTF-8 character;  the  code  specifies
3786       how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
3787       characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
3788       nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
3789       checked first; hence the possibility of 4 or 5 missing bytes.
3790
3791         PCRE_UTF8_ERR6
3792         PCRE_UTF8_ERR7
3793         PCRE_UTF8_ERR8
3794         PCRE_UTF8_ERR9
3795         PCRE_UTF8_ERR10
3796
3797       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3798       the  character  do  not have the binary value 0b10 (that is, either the
3799       most significant bit is 0, or the next bit is 1).
3800
3801         PCRE_UTF8_ERR11
3802         PCRE_UTF8_ERR12
3803
3804       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
3805       long; these code points are excluded by RFC 3629.
3806
3807         PCRE_UTF8_ERR13
3808
3809       A  4-byte character has a value greater than 0x10fff; these code points
3810       are excluded by RFC 3629.
3811
3812         PCRE_UTF8_ERR14
3813
3814       A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
3815       range  of code points are reserved by RFC 3629 for use with UTF-16, and
3816       so are excluded from UTF-8.
3817
3818         PCRE_UTF8_ERR15
3819         PCRE_UTF8_ERR16
3820         PCRE_UTF8_ERR17
3821         PCRE_UTF8_ERR18
3822         PCRE_UTF8_ERR19
3823
3824       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
3825       for  a  value that can be represented by fewer bytes, which is invalid.
3826       For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
3827       rect coding uses just one byte.
3828
3829         PCRE_UTF8_ERR20
3830
3831       The two most significant bits of the first byte of a character have the
3832       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
3833       ond  is  0). Such a byte can only validly occur as the second or subse-
3834       quent byte of a multi-byte character.
3835
3836         PCRE_UTF8_ERR21
3837
3838       The first byte of a character has the value 0xfe or 0xff. These  values
3839       can never occur in a valid UTF-8 string.
3840
3841         PCRE_UTF8_ERR22
3842
3843       This  error  code  was  formerly  used when the presence of a so-called
3844       "non-character" caused an error. Unicode corrigendum #9 makes it  clear
3845       that  such  characters should not cause a string to be rejected, and so
3846       this code is no longer in use and is never returned.
3847
3848
3849EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3850
3851       int pcre_copy_substring(const char *subject, int *ovector,
3852            int stringcount, int stringnumber, char *buffer,
3853            int buffersize);
3854
3855       int pcre_get_substring(const char *subject, int *ovector,
3856            int stringcount, int stringnumber,
3857            const char **stringptr);
3858
3859       int pcre_get_substring_list(const char *subject,
3860            int *ovector, int stringcount, const char ***listptr);
3861
3862       Captured substrings can be  accessed  directly  by  using  the  offsets
3863       returned  by  pcre_exec()  in  ovector.  For convenience, the functions
3864       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
3865       string_list()  are  provided for extracting captured substrings as new,
3866       separate, zero-terminated strings. These functions identify  substrings
3867       by  number.  The  next section describes functions for extracting named
3868       substrings.
3869
3870       A substring that contains a binary zero is correctly extracted and  has
3871       a  further zero added on the end, but the result is not, of course, a C
3872       string.  However, you can process such a string  by  referring  to  the
3873       length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
3874       string().  Unfortunately, the interface to pcre_get_substring_list() is
3875       not  adequate for handling strings containing binary zeros, because the
3876       end of the final string is not independently indicated.
3877
3878       The first three arguments are the same for all  three  of  these  func-
3879       tions:  subject  is  the subject string that has just been successfully
3880       matched, ovector is a pointer to the vector of integer offsets that was
3881       passed to pcre_exec(), and stringcount is the number of substrings that
3882       were captured by the match, including the substring  that  matched  the
3883       entire regular expression. This is the value returned by pcre_exec() if
3884       it is greater than zero. If pcre_exec() returned zero, indicating  that
3885       it  ran out of space in ovector, the value passed as stringcount should
3886       be the number of elements in the vector divided by three.
3887
3888       The functions pcre_copy_substring() and pcre_get_substring() extract  a
3889       single  substring,  whose  number  is given as stringnumber. A value of
3890       zero extracts the substring that matched the  entire  pattern,  whereas
3891       higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
3892       string(), the string is placed in buffer,  whose  length  is  given  by
3893       buffersize,  while  for  pcre_get_substring()  a new block of memory is
3894       obtained via pcre_malloc, and its address is  returned  via  stringptr.
3895       The  yield  of  the function is the length of the string, not including
3896       the terminating zero, or one of these error codes:
3897
3898         PCRE_ERROR_NOMEMORY       (-6)
3899
3900       The buffer was too small for pcre_copy_substring(), or the  attempt  to
3901       get memory failed for pcre_get_substring().
3902
3903         PCRE_ERROR_NOSUBSTRING    (-7)
3904
3905       There is no substring whose number is stringnumber.
3906
3907       The  pcre_get_substring_list()  function  extracts  all  available sub-
3908       strings and builds a list of pointers to them. All this is  done  in  a
3909       single block of memory that is obtained via pcre_malloc. The address of
3910       the memory block is returned via listptr, which is also  the  start  of
3911       the  list  of  string pointers. The end of the list is marked by a NULL
3912       pointer. The yield of the function is zero if all  went  well,  or  the
3913       error code
3914
3915         PCRE_ERROR_NOMEMORY       (-6)
3916
3917       if the attempt to get the memory block failed.
3918
3919       When  any of these functions encounter a substring that is unset, which
3920       can happen when capturing subpattern number n+1 matches  some  part  of
3921       the  subject, but subpattern n has not been used at all, they return an
3922       empty string. This can be distinguished from a genuine zero-length sub-
3923       string  by inspecting the appropriate offset in ovector, which is nega-
3924       tive for unset substrings.
3925
3926       The two convenience functions pcre_free_substring() and  pcre_free_sub-
3927       string_list()  can  be  used  to free the memory returned by a previous
3928       call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
3929       tively.  They  do  nothing  more  than  call the function pointed to by
3930       pcre_free, which of course could be called directly from a  C  program.
3931       However,  PCRE is used in some situations where it is linked via a spe-
3932       cial  interface  to  another  programming  language  that  cannot   use
3933       pcre_free  directly;  it is for these cases that the functions are pro-
3934       vided.
3935
3936
3937EXTRACTING CAPTURED SUBSTRINGS BY NAME
3938
3939       int pcre_get_stringnumber(const pcre *code,
3940            const char *name);
3941
3942       int pcre_copy_named_substring(const pcre *code,
3943            const char *subject, int *ovector,
3944            int stringcount, const char *stringname,
3945            char *buffer, int buffersize);
3946
3947       int pcre_get_named_substring(const pcre *code,
3948            const char *subject, int *ovector,
3949            int stringcount, const char *stringname,
3950            const char **stringptr);
3951
3952       To extract a substring by name, you first have to find associated  num-
3953       ber.  For example, for this pattern
3954
3955         (a+)b(?<xxx>\d+)...
3956
3957       the number of the subpattern called "xxx" is 2. If the name is known to
3958       be unique (PCRE_DUPNAMES was not set), you can find the number from the
3959       name by calling pcre_get_stringnumber(). The first argument is the com-
3960       piled pattern, and the second is the name. The yield of the function is
3961       the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
3962       subpattern of that name.
3963
3964       Given the number, you can extract the substring directly, or use one of
3965       the functions described in the previous section. For convenience, there
3966       are also two functions that do the whole job.
3967
3968       Most   of   the   arguments    of    pcre_copy_named_substring()    and
3969       pcre_get_named_substring()  are  the  same  as  those for the similarly
3970       named functions that extract by number. As these are described  in  the
3971       previous  section,  they  are not re-described here. There are just two
3972       differences:
3973
3974       First, instead of a substring number, a substring name is  given.  Sec-
3975       ond, there is an extra argument, given at the start, which is a pointer
3976       to the compiled pattern. This is needed in order to gain access to  the
3977       name-to-number translation table.
3978
3979       These  functions call pcre_get_stringnumber(), and if it succeeds, they
3980       then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
3981       ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
3982       behaviour may not be what you want (see the next section).
3983
3984       Warning: If the pattern uses the (?| feature to set up multiple subpat-
3985       terns  with  the  same number, as described in the section on duplicate
3986       subpattern numbers in the pcrepattern page, you  cannot  use  names  to
3987       distinguish  the  different subpatterns, because names are not included
3988       in the compiled code. The matching process uses only numbers. For  this
3989       reason,  the  use of different names for subpatterns of the same number
3990       causes an error at compile time.
3991
3992
3993DUPLICATE SUBPATTERN NAMES
3994
3995       int pcre_get_stringtable_entries(const pcre *code,
3996            const char *name, char **first, char **last);
3997
3998       When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
3999       subpatterns  are not required to be unique. (Duplicate names are always
4000       allowed for subpatterns with the same number, created by using the  (?|
4001       feature.  Indeed,  if  such subpatterns are named, they are required to
4002       use the same names.)
4003
4004       Normally, patterns with duplicate names are such that in any one match,
4005       only  one of the named subpatterns participates. An example is shown in
4006       the pcrepattern documentation.
4007
4008       When   duplicates   are   present,   pcre_copy_named_substring()    and
4009       pcre_get_named_substring()  return the first substring corresponding to
4010       the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
4011       (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
4012       function returns one of the numbers that are associated with the  name,
4013       but it is not defined which it is.
4014
4015       If  you want to get full details of all captured substrings for a given
4016       name, you must use  the  pcre_get_stringtable_entries()  function.  The
4017       first argument is the compiled pattern, and the second is the name. The
4018       third and fourth are pointers to variables which  are  updated  by  the
4019       function. After it has run, they point to the first and last entries in
4020       the name-to-number table  for  the  given  name.  The  function  itself
4021       returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
4022       there are none. The format of the table is described above in the  sec-
4023       tion  entitled  Information about a pattern above.  Given all the rele-
4024       vant entries for the name, you can extract each of their  numbers,  and
4025       hence the captured data, if any.
4026
4027
4028FINDING ALL POSSIBLE MATCHES
4029
4030       The  traditional  matching  function  uses a similar algorithm to Perl,
4031       which stops when it finds the first match, starting at a given point in
4032       the  subject.  If you want to find all possible matches, or the longest
4033       possible match, consider using the alternative matching  function  (see
4034       below)  instead.  If you cannot use the alternative function, but still
4035       need to find all possible matches, you can kludge it up by  making  use
4036       of the callout facility, which is described in the pcrecallout documen-
4037       tation.
4038
4039       What you have to do is to insert a callout right at the end of the pat-
4040       tern.   When your callout function is called, extract and save the cur-
4041       rent matched substring. Then return  1,  which  forces  pcre_exec()  to
4042       backtrack  and  try other alternatives. Ultimately, when it runs out of
4043       matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
4044
4045
4046OBTAINING AN ESTIMATE OF STACK USAGE
4047
4048       Matching certain patterns using pcre_exec() can use a  lot  of  process
4049       stack,  which  in  certain  environments can be rather limited in size.
4050       Some users find it helpful to have an estimate of the amount  of  stack
4051       that  is  used  by  pcre_exec(),  to help them set recursion limits, as
4052       described in the pcrestack documentation. The estimate that  is  output
4053       by pcretest when called with the -m and -C options is obtained by call-
4054       ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for  its
4055       first five arguments.
4056
4057       Normally,  if  its  first  argument  is  NULL,  pcre_exec() immediately
4058       returns the negative error code PCRE_ERROR_NULL, but with this  special
4059       combination  of  arguments,  it returns instead a negative number whose
4060       absolute value is the approximate stack frame size in bytes.  (A  nega-
4061       tive  number  is  used so that it is clear that no match has happened.)
4062       The value is approximate because in  some  cases,  recursive  calls  to
4063       pcre_exec() occur when there are one or two additional variables on the
4064       stack.
4065
4066       If PCRE has been compiled to use the heap  instead  of  the  stack  for
4067       recursion,  the  value  returned  is  the  size  of  each block that is
4068       obtained from the heap.
4069
4070
4071MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
4072
4073       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
4074            const char *subject, int length, int startoffset,
4075            int options, int *ovector, int ovecsize,
4076            int *workspace, int wscount);
4077
4078       The function pcre_dfa_exec()  is  called  to  match  a  subject  string
4079       against  a  compiled pattern, using a matching algorithm that scans the
4080       subject string just once, and does not backtrack.  This  has  different
4081       characteristics  to  the  normal  algorithm, and is not compatible with
4082       Perl. Some of the features of PCRE patterns are not  supported.  Never-
4083       theless,  there are times when this kind of matching can be useful. For
4084       a discussion of the two matching algorithms, and  a  list  of  features
4085       that  pcre_dfa_exec() does not support, see the pcrematching documenta-
4086       tion.
4087
4088       The arguments for the pcre_dfa_exec() function  are  the  same  as  for
4089       pcre_exec(), plus two extras. The ovector argument is used in a differ-
4090       ent way, and this is described below. The other  common  arguments  are
4091       used  in  the  same way as for pcre_exec(), so their description is not
4092       repeated here.
4093
4094       The two additional arguments provide workspace for  the  function.  The
4095       workspace  vector  should  contain at least 20 elements. It is used for
4096       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
4097       workspace  will  be  needed for patterns and subjects where there are a
4098       lot of potential matches.
4099
4100       Here is an example of a simple call to pcre_dfa_exec():
4101
4102         int rc;
4103         int ovector[10];
4104         int wspace[20];
4105         rc = pcre_dfa_exec(
4106           re,             /* result of pcre_compile() */
4107           NULL,           /* we didn't study the pattern */
4108           "some string",  /* the subject string */
4109           11,             /* the length of the subject string */
4110           0,              /* start at offset 0 in the subject */
4111           0,              /* default options */
4112           ovector,        /* vector of integers for substring information */
4113           10,             /* number of elements (NOT size in bytes) */
4114           wspace,         /* working space vector */
4115           20);            /* number of elements (NOT size in bytes) */
4116
4117   Option bits for pcre_dfa_exec()
4118
4119       The unused bits of the options argument  for  pcre_dfa_exec()  must  be
4120       zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
4121       LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
4122       PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,
4123       PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-
4124       TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last
4125       four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
4126       description is not repeated here.
4127
4128         PCRE_PARTIAL_HARD
4129         PCRE_PARTIAL_SOFT
4130
4131       These  have the same general effect as they do for pcre_exec(), but the
4132       details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
4133       pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
4134       ject is reached and there is still at least  one  matching  possibility
4135       that requires additional characters. This happens even if some complete
4136       matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
4137       code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
4138       of the subject is reached, there have been  no  complete  matches,  but
4139       there  is  still  at least one matching possibility. The portion of the
4140       string that was inspected when the longest partial match was  found  is
4141       set  as  the  first  matching  string  in  both cases.  There is a more
4142       detailed discussion of partial and multi-segment matching,  with  exam-
4143       ples, in the pcrepartial documentation.
4144
4145         PCRE_DFA_SHORTEST
4146
4147       Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
4148       stop as soon as it has found one match. Because of the way the alterna-
4149       tive  algorithm  works, this is necessarily the shortest possible match
4150       at the first possible matching point in the subject string.
4151
4152         PCRE_DFA_RESTART
4153
4154       When pcre_dfa_exec() returns a partial match, it is possible to call it
4155       again,  with  additional  subject characters, and have it continue with
4156       the same match. The PCRE_DFA_RESTART option requests this action;  when
4157       it  is  set,  the workspace and wscount options must reference the same
4158       vector as before because data about the match so far is  left  in  them
4159       after a partial match. There is more discussion of this facility in the
4160       pcrepartial documentation.
4161
4162   Successful returns from pcre_dfa_exec()
4163
4164       When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
4165       string in the subject. Note, however, that all the matches from one run
4166       of the function start at the same point in  the  subject.  The  shorter
4167       matches  are all initial substrings of the longer matches. For example,
4168       if the pattern
4169
4170         <.*>
4171
4172       is matched against the string
4173
4174         This is <something> <something else> <something further> no more
4175
4176       the three matched strings are
4177
4178         <something>
4179         <something> <something else>
4180         <something> <something else> <something further>
4181
4182       On success, the yield of the function is a number  greater  than  zero,
4183       which  is  the  number of matched substrings. The substrings themselves
4184       are returned in ovector. Each string uses two elements;  the  first  is
4185       the  offset  to  the start, and the second is the offset to the end. In
4186       fact, all the strings have the same start  offset.  (Space  could  have
4187       been  saved by giving this only once, but it was decided to retain some
4188       compatibility with the way pcre_exec() returns data,  even  though  the
4189       meaning of the strings is different.)
4190
4191       The strings are returned in reverse order of length; that is, the long-
4192       est matching string is given first. If there were too many  matches  to
4193       fit  into ovector, the yield of the function is zero, and the vector is
4194       filled with the longest matches.  Unlike  pcre_exec(),  pcre_dfa_exec()
4195       can use the entire ovector for returning matched strings.
4196
4197       NOTE:  PCRE's  "auto-possessification"  optimization usually applies to
4198       character repeats at the end of a pattern (as well as internally).  For
4199       example,  the  pattern "a\d+" is compiled as if it were "a\d++" because
4200       there is no point even considering the possibility of backtracking into
4201       the  repeated digits. For DFA matching, this means that only one possi-
4202       ble match is found. If you really do  want  multiple  matches  in  such
4203       cases,   either   use   an   ungreedy   repeat  ("a\d+?")  or  set  the
4204       PCRE_NO_AUTO_POSSESS option when compiling.
4205
4206   Error returns from pcre_dfa_exec()
4207
4208       The pcre_dfa_exec() function returns a negative number when  it  fails.
4209       Many  of  the  errors  are  the  same as for pcre_exec(), and these are
4210       described above.  There are in addition the following errors  that  are
4211       specific to pcre_dfa_exec():
4212
4213         PCRE_ERROR_DFA_UITEM      (-16)
4214
4215       This  return is given if pcre_dfa_exec() encounters an item in the pat-
4216       tern that it does not support, for instance, the use of \C  or  a  back
4217       reference.
4218
4219         PCRE_ERROR_DFA_UCOND      (-17)
4220
4221       This  return  is  given  if pcre_dfa_exec() encounters a condition item
4222       that uses a back reference for the condition, or a test  for  recursion
4223       in a specific group. These are not supported.
4224
4225         PCRE_ERROR_DFA_UMLIMIT    (-18)
4226
4227       This  return  is given if pcre_dfa_exec() is called with an extra block
4228       that contains a setting of  the  match_limit  or  match_limit_recursion
4229       fields.  This  is  not  supported (these fields are meaningless for DFA
4230       matching).
4231
4232         PCRE_ERROR_DFA_WSSIZE     (-19)
4233
4234       This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
4235       workspace vector.
4236
4237         PCRE_ERROR_DFA_RECURSE    (-20)
4238
4239       When  a  recursive subpattern is processed, the matching function calls
4240       itself recursively, using private vectors for  ovector  and  workspace.
4241       This  error  is  given  if  the output vector is not large enough. This
4242       should be extremely rare, as a vector of size 1000 is used.
4243
4244         PCRE_ERROR_DFA_BADRESTART (-30)
4245
4246       When pcre_dfa_exec() is called with the PCRE_DFA_RESTART  option,  some
4247       plausibility  checks  are  made on the contents of the workspace, which
4248       should contain data about the previous partial match. If any  of  these
4249       checks fail, this error is given.
4250
4251
4252SEE ALSO
4253
4254       pcre16(3),   pcre32(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),
4255       pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
4256       sample(3), pcrestack(3).
4257
4258
4259AUTHOR
4260
4261       Philip Hazel
4262       University Computing Service
4263       Cambridge CB2 3QH, England.
4264
4265
4266REVISION
4267
4268       Last updated: 09 February 2014
4269       Copyright (c) 1997-2014 University of Cambridge.
4270------------------------------------------------------------------------------
4271
4272
4273PCRECALLOUT(3)             Library Functions Manual             PCRECALLOUT(3)
4274
4275
4276
4277NAME
4278       PCRE - Perl-compatible regular expressions
4279
4280SYNOPSIS
4281
4282       #include <pcre.h>
4283
4284       int (*pcre_callout)(pcre_callout_block *);
4285
4286       int (*pcre16_callout)(pcre16_callout_block *);
4287
4288       int (*pcre32_callout)(pcre32_callout_block *);
4289
4290
4291DESCRIPTION
4292
4293       PCRE provides a feature called "callout", which is a means of temporar-
4294       ily passing control to the caller of PCRE  in  the  middle  of  pattern
4295       matching.  The  caller of PCRE provides an external function by putting
4296       its entry point in the global variable pcre_callout (pcre16_callout for
4297       the 16-bit library, pcre32_callout for the 32-bit library). By default,
4298       this variable contains NULL, which disables all calling out.
4299
4300       Within a regular expression, (?C) indicates the  points  at  which  the
4301       external  function  is  to  be  called. Different callout points can be
4302       identified by putting a number less than 256 after the  letter  C.  The
4303       default  value  is  zero.   For  example,  this pattern has two callout
4304       points:
4305
4306         (?C1)abc(?C2)def
4307
4308       If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
4309       PCRE  automatically  inserts callouts, all with number 255, before each
4310       item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
4311       pattern
4312
4313         A(\d{2}|--)
4314
4315       it is processed as if it were
4316
4317       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4318
4319       Notice  that  there  is a callout before and after each parenthesis and
4320       alternation bar. If the pattern contains a conditional group whose con-
4321       dition  is  an  assertion, an automatic callout is inserted immediately
4322       before the condition. Such a callout may also be  inserted  explicitly,
4323       for example:
4324
4325         (?(?C9)(?=a)ab|de)
4326
4327       This  applies only to assertion conditions (because they are themselves
4328       independent groups).
4329
4330       Automatic callouts can be used for tracking  the  progress  of  pattern
4331       matching.   The pcretest program has a pattern qualifier (/C) that sets
4332       automatic callouts; when it is used, the output indicates how the  pat-
4333       tern  is  being matched. This is useful information when you are trying
4334       to optimize the performance of a particular pattern.
4335
4336
4337MISSING CALLOUTS
4338
4339       You should be aware that, because of optimizations in the way PCRE com-
4340       piles and matches patterns, callouts sometimes do not happen exactly as
4341       you might expect.
4342
4343       At compile time, PCRE "auto-possessifies" repeated items when it  knows
4344       that  what follows cannot be part of the repeat. For example, a+[bc] is
4345       compiled as if it were a++[bc]. The pcretest output when  this  pattern
4346       is  anchored  and  then  applied  with automatic callouts to the string
4347       "aaaa" is:
4348
4349         --->aaaa
4350          +0 ^        ^
4351          +1 ^        a+
4352          +3 ^   ^    [bc]
4353         No match
4354
4355       This indicates that when matching [bc] fails, there is no  backtracking
4356       into  a+  and  therefore the callouts that would be taken for the back-
4357       tracks do not occur.  You can disable the  auto-possessify  feature  by
4358       passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern
4359       with (*NO_AUTO_POSSESS). If this is done  in  pcretest  (using  the  /O
4360       qualifier), the output changes to this:
4361
4362         --->aaaa
4363          +0 ^        ^
4364          +1 ^        a+
4365          +3 ^   ^    [bc]
4366          +3 ^  ^     [bc]
4367          +3 ^ ^      [bc]
4368          +3 ^^       [bc]
4369         No match
4370
4371       This time, when matching [bc] fails, the matcher backtracks into a+ and
4372       tries again, repeatedly, until a+ itself fails.
4373
4374       Other optimizations that provide fast "no match"  results  also  affect
4375       callouts.  For example, if the pattern is
4376
4377         ab(?C4)cd
4378
4379       PCRE knows that any matching string must contain the letter "d". If the
4380       subject string is "abyz", the lack of "d" means that  matching  doesn't
4381       ever  start,  and  the  callout is never reached. However, with "abyd",
4382       though the result is still no match, the callout is obeyed.
4383
4384       If the pattern is studied, PCRE knows the minimum length of a  matching
4385       string,  and will immediately give a "no match" return without actually
4386       running a match if the subject is not long enough, or,  for  unanchored
4387       patterns, if it has been scanned far enough.
4388
4389       You  can disable these optimizations by passing the PCRE_NO_START_OPTI-
4390       MIZE option to the matching function, or by starting the  pattern  with
4391       (*NO_START_OPT).  This slows down the matching process, but does ensure
4392       that callouts such as the example above are obeyed.
4393
4394
4395THE CALLOUT INTERFACE
4396
4397       During matching, when PCRE reaches a callout point, the external  func-
4398       tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
4399       set). This applies to both normal and DFA matching. The  only  argument
4400       to   the   callout   function   is  a  pointer  to  a  pcre_callout  or
4401       pcre[16|32]_callout block.  These  structures  contains  the  following
4402       fields:
4403
4404         int           version;
4405         int           callout_number;
4406         int          *offset_vector;
4407         const char   *subject;           (8-bit version)
4408         PCRE_SPTR16   subject;           (16-bit version)
4409         PCRE_SPTR32   subject;           (32-bit version)
4410         int           subject_length;
4411         int           start_match;
4412         int           current_position;
4413         int           capture_top;
4414         int           capture_last;
4415         void         *callout_data;
4416         int           pattern_position;
4417         int           next_item_length;
4418         const unsigned char *mark;       (8-bit version)
4419         const PCRE_UCHAR16  *mark;       (16-bit version)
4420         const PCRE_UCHAR32  *mark;       (32-bit version)
4421
4422       The  version  field  is an integer containing the version number of the
4423       block format. The initial version was 0; the current version is 2.  The
4424       version  number  will  change  again in future if additional fields are
4425       added, but the intention is never to remove any of the existing fields.
4426
4427       The callout_number field contains the number of the  callout,  as  com-
4428       piled  into  the pattern (that is, the number after ?C for manual call-
4429       outs, and 255 for automatically generated callouts).
4430
4431       The offset_vector field is a pointer to the vector of offsets that  was
4432       passed  by  the  caller  to  the matching function. When pcre_exec() or
4433       pcre[16|32]_exec() is used, the contents can be inspected, in order  to
4434       extract  substrings  that  have been matched so far, in the same way as
4435       for extracting substrings after a match  has  completed.  For  the  DFA
4436       matching functions, this field is not useful.
4437
4438       The subject and subject_length fields contain copies of the values that
4439       were passed to the matching function.
4440
4441       The start_match field normally contains the offset within  the  subject
4442       at  which  the  current  match  attempt started. However, if the escape
4443       sequence \K has been encountered, this value is changed to reflect  the
4444       modified  starting  point.  If the pattern is not anchored, the callout
4445       function may be called several times from the same point in the pattern
4446       for different starting points in the subject.
4447
4448       The  current_position  field  contains the offset within the subject of
4449       the current match pointer.
4450
4451       When the pcre_exec() or pcre[16|32]_exec()  is  used,  the  capture_top
4452       field  contains  one  more than the number of the highest numbered cap-
4453       tured substring so far. If no substrings have been captured, the  value
4454       of  capture_top  is one. This is always the case when the DFA functions
4455       are used, because they do not support captured substrings.
4456
4457       The capture_last field contains the number of the  most  recently  cap-
4458       tured  substring. However, when a recursion exits, the value reverts to
4459       what it was outside the recursion, as do the  values  of  all  captured
4460       substrings.  If  no  substrings  have  been captured, the value of cap-
4461       ture_last is -1. This is always the case for  the  DFA  matching  func-
4462       tions.
4463
4464       The  callout_data  field  contains a value that is passed to a matching
4465       function specifically so that it can be passed back in callouts. It  is
4466       passed  in  the callout_data field of a pcre_extra or pcre[16|32]_extra
4467       data structure. If no such data was passed, the value  of  callout_data
4468       in  a  callout  block is NULL. There is a description of the pcre_extra
4469       structure in the pcreapi documentation.
4470
4471       The pattern_position field is present from version  1  of  the  callout
4472       structure. It contains the offset to the next item to be matched in the
4473       pattern string.
4474
4475       The next_item_length field is present from version  1  of  the  callout
4476       structure. It contains the length of the next item to be matched in the
4477       pattern string. When the callout immediately  precedes  an  alternation
4478       bar,  a  closing  parenthesis, or the end of the pattern, the length is
4479       zero. When the callout precedes an opening parenthesis, the  length  is
4480       that of the entire subpattern.
4481
4482       The  pattern_position  and next_item_length fields are intended to help
4483       in distinguishing between different automatic callouts, which all  have
4484       the same callout number. However, they are set for all callouts.
4485
4486       The  mark  field is present from version 2 of the callout structure. In
4487       callouts from pcre_exec() or pcre[16|32]_exec() it contains  a  pointer
4488       to  the  zero-terminated  name  of  the  most  recently passed (*MARK),
4489       (*PRUNE), or (*THEN) item in the match, or NULL if no such  items  have
4490       been  passed.  Instances  of  (*PRUNE) or (*THEN) without a name do not
4491       obliterate a previous (*MARK). In callouts from the DFA matching  func-
4492       tions this field always contains NULL.
4493
4494
4495RETURN VALUES
4496
4497       The  external callout function returns an integer to PCRE. If the value
4498       is zero, matching proceeds as normal. If  the  value  is  greater  than
4499       zero,  matching  fails  at  the current point, but the testing of other
4500       matching possibilities goes ahead, just as if a lookahead assertion had
4501       failed.  If  the  value  is less than zero, the match is abandoned, the
4502       matching function returns the negative value.
4503
4504       Negative  values  should  normally  be   chosen   from   the   set   of
4505       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4506       dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
4507       reserved  for  use  by callout functions; it will never be used by PCRE
4508       itself.
4509
4510
4511AUTHOR
4512
4513       Philip Hazel
4514       University Computing Service
4515       Cambridge CB2 3QH, England.
4516
4517
4518REVISION
4519
4520       Last updated: 12 November 2013
4521       Copyright (c) 1997-2013 University of Cambridge.
4522------------------------------------------------------------------------------
4523
4524
4525PCRECOMPAT(3)              Library Functions Manual              PCRECOMPAT(3)
4526
4527
4528
4529NAME
4530       PCRE - Perl-compatible regular expressions
4531
4532DIFFERENCES BETWEEN PCRE AND PERL
4533
4534       This  document describes the differences in the ways that PCRE and Perl
4535       handle regular expressions. The differences  described  here  are  with
4536       respect to Perl versions 5.10 and above.
4537
4538       1. PCRE has only a subset of Perl's Unicode support. Details of what it
4539       does have are given in the pcreunicode page.
4540
4541       2. PCRE allows repeat quantifiers only on parenthesized assertions, but
4542       they  do  not mean what you might think. For example, (?!a){3} does not
4543       assert that the next three characters are not "a". It just asserts that
4544       the next character is not "a" three times (in principle: PCRE optimizes
4545       this to run the assertion just once). Perl allows repeat quantifiers on
4546       other assertions such as \b, but these do not seem to have any use.
4547
4548       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
4549       tions are counted, but their entries in the offsets  vector  are  never
4550       set.  Perl sometimes (but not always) sets its numerical variables from
4551       inside negative assertions.
4552
4553       4. Though binary zero characters are supported in the  subject  string,
4554       they are not allowed in a pattern string because it is passed as a nor-
4555       mal C string, terminated by zero. The escape sequence \0 can be used in
4556       the pattern to represent a binary zero.
4557
4558       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
4559       \U, and \N when followed by a character name or Unicode value.  (\N  on
4560       its own, matching a non-newline character, is supported.) In fact these
4561       are implemented by Perl's general string-handling and are not  part  of
4562       its  pattern  matching engine. If any of these are encountered by PCRE,
4563       an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
4564       PAT  option  is set, \U and \u are interpreted as JavaScript interprets
4565       them.
4566
4567       6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
4568       is  built  with Unicode character property support. The properties that
4569       can be tested with \p and \P are limited to the general category  prop-
4570       erties  such  as  Lu and Nd, script names such as Greek or Han, and the
4571       derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
4572       property,  which  Perl  does  not; the Perl documentation says "Because
4573       Perl hides the need for the user to understand the internal representa-
4574       tion  of Unicode characters, there is no need to implement the somewhat
4575       messy concept of surrogates."
4576
4577       7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
4578       ters  in  between  are  treated as literals. This is slightly different
4579       from Perl in that $ and @ are  also  handled  as  literals  inside  the
4580       quotes.  In Perl, they cause variable interpolation (but of course PCRE
4581       does not have variables). Note the following examples:
4582
4583           Pattern            PCRE matches      Perl matches
4584
4585           \Qabc$xyz\E        abc$xyz           abc followed by the
4586                                                  contents of $xyz
4587           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4588           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4589
4590       The \Q...\E sequence is recognized both inside  and  outside  character
4591       classes.
4592
4593       8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
4594       constructions. However, there is support for recursive  patterns.  This
4595       is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
4596       "callout" feature allows an external function to be called during  pat-
4597       tern matching. See the pcrecallout documentation for details.
4598
4599       9.  Subpatterns  that  are called as subroutines (whether or not recur-
4600       sively) are always treated as atomic  groups  in  PCRE.  This  is  like
4601       Python,  but  unlike Perl.  Captured values that are set outside a sub-
4602       routine call can be reference from inside in PCRE,  but  not  in  Perl.
4603       There is a discussion that explains these differences in more detail in
4604       the section on recursion differences from Perl in the pcrepattern page.
4605
4606       10. If any of the backtracking control verbs are used in  a  subpattern
4607       that  is  called  as  a  subroutine (whether or not recursively), their
4608       effect is confined to that subpattern; it does not extend to  the  sur-
4609       rounding  pattern.  This is not always the case in Perl. In particular,
4610       if (*THEN) is present in a group that is called as  a  subroutine,  its
4611       action is limited to that group, even if the group does not contain any
4612       | characters. Note that such subpatterns are processed as  anchored  at
4613       the point where they are tested.
4614
4615       11.  If a pattern contains more than one backtracking control verb, the
4616       first one that is backtracked onto acts. For example,  in  the  pattern
4617       A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
4618       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4619       it is the same as PCRE, but there are examples where it differs.
4620
4621       12.  Most  backtracking  verbs in assertions have their normal actions.
4622       They are not confined to the assertion.
4623
4624       13. There are some differences that are concerned with the settings  of
4625       captured  strings  when  part  of  a  pattern is repeated. For example,
4626       matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
4627       unset, but in PCRE it is set to "b".
4628
4629       14.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
4630       pattern names is not as general as Perl's. This is a consequence of the
4631       fact the PCRE works internally just with numbers, using an external ta-
4632       ble to translate between numbers and names. In  particular,  a  pattern
4633       such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
4634       the same number but different names, is not supported,  and  causes  an
4635       error  at compile time. If it were allowed, it would not be possible to
4636       distinguish which parentheses matched, because both names map  to  cap-
4637       turing subpattern number 1. To avoid this confusing situation, an error
4638       is given at compile time.
4639
4640       15. Perl recognizes comments in some places that  PCRE  does  not,  for
4641       example,  between  the  ( and ? at the start of a subpattern. If the /x
4642       modifier is set, Perl allows white space between ( and ?  (though  cur-
4643       rent  Perls  warn that this is deprecated) but PCRE never does, even if
4644       the PCRE_EXTENDED option is set.
4645
4646       16. Perl, when in warning mode, gives warnings  for  character  classes
4647       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4648       als. PCRE has no warning features, so it gives an error in these  cases
4649       because they are almost certainly user mistakes.
4650
4651       17.  In  PCRE,  the upper/lower case character properties Lu and Ll are
4652       not affected when case-independent matching is specified. For  example,
4653       \p{Lu} always matches an upper case letter. I think Perl has changed in
4654       this respect; in the release at the time of writing (5.16), \p{Lu}  and
4655       \p{Ll} match all letters, regardless of case, when case independence is
4656       specified.
4657
4658       18. PCRE provides some extensions to the Perl regular expression facil-
4659       ities.   Perl  5.10  includes new features that are not in earlier ver-
4660       sions of Perl, some of which (such as named parentheses) have  been  in
4661       PCRE for some time. This list is with respect to Perl 5.10:
4662
4663       (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
4664       strings, each alternative branch of a lookbehind assertion can match  a
4665       different  length  of  string.  Perl requires them all to have the same
4666       length.
4667
4668       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
4669       meta-character matches only at the very end of the string.
4670
4671       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
4672       cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
4673       ignored.  (Perl can be made to issue a warning.)
4674
4675       (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
4676       fiers is inverted, that is, by default they are not greedy, but if fol-
4677       lowed by a question mark they are.
4678
4679       (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
4680       tried only at the first matching position in the subject string.
4681
4682       (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
4683       and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
4684       lents.
4685
4686       (g) The \R escape sequence can be restricted to match only CR,  LF,  or
4687       CRLF by the PCRE_BSR_ANYCRLF option.
4688
4689       (h) The callout facility is PCRE-specific.
4690
4691       (i) The partial matching facility is PCRE-specific.
4692
4693       (j) Patterns compiled by PCRE can be saved and re-used at a later time,
4694       even on different hosts that have the other endianness.  However,  this
4695       does not apply to optimized data created by the just-in-time compiler.
4696
4697       (k)    The    alternative    matching    functions    (pcre_dfa_exec(),
4698       pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way  and
4699       are not Perl-compatible.
4700
4701       (l)  PCRE  recognizes some special sequences such as (*CR) at the start
4702       of a pattern that set overall options that cannot be changed within the
4703       pattern.
4704
4705
4706AUTHOR
4707
4708       Philip Hazel
4709       University Computing Service
4710       Cambridge CB2 3QH, England.
4711
4712
4713REVISION
4714
4715       Last updated: 10 November 2013
4716       Copyright (c) 1997-2013 University of Cambridge.
4717------------------------------------------------------------------------------
4718
4719
4720PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
4721
4722
4723
4724NAME
4725       PCRE - Perl-compatible regular expressions
4726
4727PCRE REGULAR EXPRESSION DETAILS
4728
4729       The  syntax and semantics of the regular expressions that are supported
4730       by PCRE are described in detail below. There is a quick-reference  syn-
4731       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
4732       semantics as closely as it can. PCRE  also  supports  some  alternative
4733       regular  expression  syntax (which does not conflict with the Perl syn-
4734       tax) in order to provide some compatibility with regular expressions in
4735       Python, .NET, and Oniguruma.
4736
4737       Perl's  regular expressions are described in its own documentation, and
4738       regular expressions in general are covered in a number of  books,  some
4739       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
4740       Expressions", published by  O'Reilly,  covers  regular  expressions  in
4741       great  detail.  This  description  of  PCRE's  regular  expressions  is
4742       intended as reference material.
4743
4744       This document discusses the patterns that are supported  by  PCRE  when
4745       one    its    main   matching   functions,   pcre_exec()   (8-bit)   or
4746       pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has  alternative
4747       matching  functions,  pcre_dfa_exec()  and pcre[16|32_dfa_exec(), which
4748       match using a different algorithm that is not Perl-compatible. Some  of
4749       the  features  discussed  below  are not available when DFA matching is
4750       used. The advantages and disadvantages of  the  alternative  functions,
4751       and  how  they  differ  from the normal functions, are discussed in the
4752       pcrematching page.
4753
4754
4755SPECIAL START-OF-PATTERN ITEMS
4756
4757       A number of options that can be passed to pcre_compile()  can  also  be
4758       set by special items at the start of a pattern. These are not Perl-com-
4759       patible, but are provided to make these options accessible  to  pattern
4760       writers  who are not able to change the program that processes the pat-
4761       tern. Any number of these items  may  appear,  but  they  must  all  be
4762       together right at the start of the pattern string, and the letters must
4763       be in upper case.
4764
4765   UTF support
4766
4767       The original operation of PCRE was on strings of  one-byte  characters.
4768       However,  there  is  now also support for UTF-8 strings in the original
4769       library, an extra library that supports  16-bit  and  UTF-16  character
4770       strings,  and a third library that supports 32-bit and UTF-32 character
4771       strings. To use these features, PCRE must be built to include appropri-
4772       ate  support. When using UTF strings you must either call the compiling
4773       function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option,  or  the
4774       pattern must start with one of these special sequences:
4775
4776         (*UTF8)
4777         (*UTF16)
4778         (*UTF32)
4779         (*UTF)
4780
4781       (*UTF)  is  a  generic  sequence  that  can  be  used  with  any of the
4782       libraries.  Starting a pattern with such a sequence  is  equivalent  to
4783       setting  the  relevant  option.  How setting a UTF mode affects pattern
4784       matching is mentioned in several places below. There is also a  summary
4785       of features in the pcreunicode page.
4786
4787       Some applications that allow their users to supply patterns may wish to
4788       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
4789       PCRE_NEVER_UTF  option  is  set  at  compile  time, (*UTF) etc. are not
4790       allowed, and their appearance causes an error.
4791
4792   Unicode property support
4793
4794       Another special sequence that may appear at the start of a  pattern  is
4795       (*UCP).   This  has  the same effect as setting the PCRE_UCP option: it
4796       causes sequences such as \d and \w to use Unicode properties to  deter-
4797       mine character types, instead of recognizing only characters with codes
4798       less than 128 via a lookup table.
4799
4800   Disabling auto-possessification
4801
4802       If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
4803       setting  the  PCRE_NO_AUTO_POSSESS  option  at compile time. This stops
4804       PCRE from making quantifiers possessive when what follows cannot  match
4805       the  repeated item. For example, by default a+b is treated as a++b. For
4806       more details, see the pcreapi documentation.
4807
4808   Disabling start-up optimizations
4809
4810       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
4811       setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
4812       time. This disables several  optimizations  for  quickly  reaching  "no
4813       match" results. For more details, see the pcreapi documentation.
4814
4815   Newline conventions
4816
4817       PCRE  supports five different conventions for indicating line breaks in
4818       strings: a single CR (carriage return) character, a  single  LF  (line-
4819       feed) character, the two-character sequence CRLF, any of the three pre-
4820       ceding, or any Unicode newline sequence. The pcreapi page  has  further
4821       discussion  about newlines, and shows how to set the newline convention
4822       in the options arguments for the compiling and matching functions.
4823
4824       It is also possible to specify a newline convention by starting a  pat-
4825       tern string with one of the following five sequences:
4826
4827         (*CR)        carriage return
4828         (*LF)        linefeed
4829         (*CRLF)      carriage return, followed by linefeed
4830         (*ANYCRLF)   any of the three above
4831         (*ANY)       all Unicode newline sequences
4832
4833       These override the default and the options given to the compiling func-
4834       tion. For example, on a Unix system where LF  is  the  default  newline
4835       sequence, the pattern
4836
4837         (*CR)a.b
4838
4839       changes the convention to CR. That pattern matches "a\nb" because LF is
4840       no longer a newline. If more than one of these settings is present, the
4841       last one is used.
4842
4843       The  newline  convention affects where the circumflex and dollar asser-
4844       tions are true. It also affects the interpretation of the dot metachar-
4845       acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
4846       does not affect what the \R escape sequence matches. By  default,  this
4847       is  any Unicode newline sequence, for Perl compatibility. However, this
4848       can be changed; see the description of \R in the section entitled "New-
4849       line  sequences"  below.  A change of \R setting can be combined with a
4850       change of newline convention.
4851
4852   Setting match and recursion limits
4853
4854       The caller of pcre_exec() can set a limit on the number  of  times  the
4855       internal  match() function is called and on the maximum depth of recur-
4856       sive calls. These facilities are provided to catch runaway matches that
4857       are provoked by patterns with huge matching trees (a typical example is
4858       a pattern with nested unlimited repeats) and to avoid  running  out  of
4859       system  stack  by  too  much  recursion.  When  one  of these limits is
4860       reached, pcre_exec() gives an error return. The limits can also be  set
4861       by items at the start of the pattern of the form
4862
4863         (*LIMIT_MATCH=d)
4864         (*LIMIT_RECURSION=d)
4865
4866       where d is any number of decimal digits. However, the value of the set-
4867       ting must be less than the value set (or defaulted) by  the  caller  of
4868       pcre_exec()  for  it  to  have  any effect. In other words, the pattern
4869       writer can lower the limits set by the programmer, but not raise  them.
4870       If  there  is  more  than one setting of one of these limits, the lower
4871       value is used.
4872
4873
4874EBCDIC CHARACTER CODES
4875
4876       PCRE can be compiled to run in an environment that uses EBCDIC  as  its
4877       character code rather than ASCII or Unicode (typically a mainframe sys-
4878       tem). In the sections below, character code values are  ASCII  or  Uni-
4879       code; in an EBCDIC environment these characters may have different code
4880       values, and there are no code points greater than 255.
4881
4882
4883CHARACTERS AND METACHARACTERS
4884
4885       A regular expression is a pattern that is  matched  against  a  subject
4886       string  from  left  to right. Most characters stand for themselves in a
4887       pattern, and match the corresponding characters in the  subject.  As  a
4888       trivial example, the pattern
4889
4890         The quick brown fox
4891
4892       matches a portion of a subject string that is identical to itself. When
4893       caseless matching is specified (the PCRE_CASELESS option), letters  are
4894       matched  independently  of case. In a UTF mode, PCRE always understands
4895       the concept of case for characters whose values are less than  128,  so
4896       caseless  matching  is always possible. For characters with higher val-
4897       ues, the concept of case is supported if PCRE is compiled with  Unicode
4898       property  support,  but  not  otherwise.   If  you want to use caseless
4899       matching for characters 128 and above, you must  ensure  that  PCRE  is
4900       compiled with Unicode property support as well as with UTF support.
4901
4902       The  power  of  regular  expressions  comes from the ability to include
4903       alternatives and repetitions in the pattern. These are encoded  in  the
4904       pattern by the use of metacharacters, which do not stand for themselves
4905       but instead are interpreted in some special way.
4906
4907       There are two different sets of metacharacters: those that  are  recog-
4908       nized  anywhere in the pattern except within square brackets, and those
4909       that are recognized within square brackets.  Outside  square  brackets,
4910       the metacharacters are as follows:
4911
4912         \      general escape character with several uses
4913         ^      assert start of string (or line, in multiline mode)
4914         $      assert end of string (or line, in multiline mode)
4915         .      match any character except newline (by default)
4916         [      start character class definition
4917         |      start of alternative branch
4918         (      start subpattern
4919         )      end subpattern
4920         ?      extends the meaning of (
4921                also 0 or 1 quantifier
4922                also quantifier minimizer
4923         *      0 or more quantifier
4924         +      1 or more quantifier
4925                also "possessive quantifier"
4926         {      start min/max quantifier
4927
4928       Part  of  a  pattern  that is in square brackets is called a "character
4929       class". In a character class the only metacharacters are:
4930
4931         \      general escape character
4932         ^      negate the class, but only if the first character
4933         -      indicates character range
4934         [      POSIX character class (only if followed by POSIX
4935                  syntax)
4936         ]      terminates the character class
4937
4938       The following sections describe the use of each of the metacharacters.
4939
4940
4941BACKSLASH
4942
4943       The backslash character has several uses. Firstly, if it is followed by
4944       a character that is not a number or a letter, it takes away any special
4945       meaning that character may have. This use of  backslash  as  an  escape
4946       character applies both inside and outside character classes.
4947
4948       For  example,  if  you want to match a * character, you write \* in the
4949       pattern.  This escaping action applies whether  or  not  the  following
4950       character  would  otherwise be interpreted as a metacharacter, so it is
4951       always safe to precede a non-alphanumeric  with  backslash  to  specify
4952       that  it stands for itself. In particular, if you want to match a back-
4953       slash, you write \\.
4954
4955       In a UTF mode, only ASCII numbers and letters have any special  meaning
4956       after  a  backslash.  All  other characters (in particular, those whose
4957       codepoints are greater than 127) are treated as literals.
4958
4959       If a pattern is compiled with  the  PCRE_EXTENDED  option,  most  white
4960       space  in the pattern (other than in a character class), and characters
4961       between a # outside a character class and the next newline,  inclusive,
4962       are ignored. An escaping backslash can be used to include a white space
4963       or # character as part of the pattern.
4964
4965       If you want to remove the special meaning from a  sequence  of  charac-
4966       ters,  you can do so by putting them between \Q and \E. This is differ-
4967       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
4968       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
4969       tion. Note the following examples:
4970
4971         Pattern            PCRE matches   Perl matches
4972
4973         \Qabc$xyz\E        abc$xyz        abc followed by the
4974                                             contents of $xyz
4975         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
4976         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
4977
4978       The \Q...\E sequence is recognized both inside  and  outside  character
4979       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
4980       is not followed by \E later in the pattern, the literal  interpretation
4981       continues  to  the  end  of  the pattern (that is, \E is assumed at the
4982       end). If the isolated \Q is inside a character class,  this  causes  an
4983       error, because the character class is not terminated.
4984
4985   Non-printing characters
4986
4987       A second use of backslash provides a way of encoding non-printing char-
4988       acters in patterns in a visible manner. There is no restriction on  the
4989       appearance  of non-printing characters, apart from the binary zero that
4990       terminates a pattern, but when a pattern  is  being  prepared  by  text
4991       editing,  it  is  often  easier  to  use  one  of  the following escape
4992       sequences than the binary character it represents:
4993
4994         \a        alarm, that is, the BEL character (hex 07)
4995         \cx       "control-x", where x is any ASCII character
4996         \e        escape (hex 1B)
4997         \f        form feed (hex 0C)
4998         \n        linefeed (hex 0A)
4999         \r        carriage return (hex 0D)
5000         \t        tab (hex 09)
5001         \0dd      character with octal code 0dd
5002         \ddd      character with octal code ddd, or back reference
5003         \o{ddd..} character with octal code ddd..
5004         \xhh      character with hex code hh
5005         \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
5006         \uhhhh    character with hex code hhhh (JavaScript mode only)
5007
5008       The precise effect of \cx on ASCII characters is as follows: if x is  a
5009       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
5010       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
5011       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
5012       hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
5013       has  a  value greater than 127, a compile-time error occurs. This locks
5014       out non-ASCII characters in all modes.
5015
5016       The \c facility was designed for use with ASCII  characters,  but  with
5017       the  extension  to  Unicode it is even less useful than it once was. It
5018       is, however, recognized when PCRE is compiled  in  EBCDIC  mode,  where
5019       data  items  are always bytes. In this mode, all values are valid after
5020       \c. If the next character is a lower case letter, it  is  converted  to
5021       upper  case.  Then  the  0xc0  bits  of the byte are inverted. Thus \cA
5022       becomes hex 01, as in ASCII (A is C1), but because the  EBCDIC  letters
5023       are  disjoint,  \cZ becomes hex 29 (Z is E9), and other characters also
5024       generate different values.
5025
5026       After \0 up to two further octal digits are read. If  there  are  fewer
5027       than  two  digits,  just  those  that  are  present  are used. Thus the
5028       sequence \0\x\07 specifies two binary zeros followed by a BEL character
5029       (code  value 7). Make sure you supply two digits after the initial zero
5030       if the pattern character that follows is itself an octal digit.
5031
5032       The escape \o must be followed by a sequence of octal digits,  enclosed
5033       in  braces.  An  error occurs if this is not the case. This escape is a
5034       recent addition to Perl; it provides way of specifying  character  code
5035       points  as  octal  numbers  greater than 0777, and it also allows octal
5036       numbers and back references to be unambiguously specified.
5037
5038       For greater clarity and unambiguity, it is best to avoid following \ by
5039       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
5040       ter numbers, and \g{} to specify back references. The  following  para-
5041       graphs describe the old, ambiguous syntax.
5042
5043       The handling of a backslash followed by a digit other than 0 is compli-
5044       cated, and Perl has changed in recent releases, causing  PCRE  also  to
5045       change. Outside a character class, PCRE reads the digit and any follow-
5046       ing digits as a decimal number. If the number is less  than  8,  or  if
5047       there  have been at least that many previous capturing left parentheses
5048       in the expression, the entire sequence is taken as a back reference.  A
5049       description  of how this works is given later, following the discussion
5050       of parenthesized subpatterns.
5051
5052       Inside a character class, or if  the  decimal  number  following  \  is
5053       greater than 7 and there have not been that many capturing subpatterns,
5054       PCRE handles \8 and \9 as the literal characters "8" and "9", and  oth-
5055       erwise re-reads up to three octal digits following the backslash, using
5056       them to generate a data character.  Any  subsequent  digits  stand  for
5057       themselves. For example:
5058
5059         \040   is another way of writing an ASCII space
5060         \40    is the same, provided there are fewer than 40
5061                   previous capturing subpatterns
5062         \7     is always a back reference
5063         \11    might be a back reference, or another way of
5064                   writing a tab
5065         \011   is always a tab
5066         \0113  is a tab followed by the character "3"
5067         \113   might be a back reference, otherwise the
5068                   character with octal code 113
5069         \377   might be a back reference, otherwise
5070                   the value 255 (decimal)
5071         \81    is either a back reference, or the two
5072                   characters "8" and "1"
5073
5074       Note  that octal values of 100 or greater that are specified using this
5075       syntax must not be introduced by a leading zero, because no  more  than
5076       three octal digits are ever read.
5077
5078       By  default, after \x that is not followed by {, from zero to two hexa-
5079       decimal digits are read (letters can be in upper or  lower  case).  Any
5080       number of hexadecimal digits may appear between \x{ and }. If a charac-
5081       ter other than a hexadecimal digit appears between \x{  and  },  or  if
5082       there is no terminating }, an error occurs.
5083
5084       If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
5085       is as just described only when it is followed by two  hexadecimal  dig-
5086       its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
5087       mode, support for code points greater than 256 is provided by \u, which
5088       must  be  followed  by  four hexadecimal digits; otherwise it matches a
5089       literal "u" character.
5090
5091       Characters whose value is less than 256 can be defined by either of the
5092       two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
5093       ence in the way they are handled. For example, \xdc is exactly the same
5094       as \x{dc} (or \u00dc in JavaScript mode).
5095
5096   Constraints on character values
5097
5098       Characters  that  are  specified using octal or hexadecimal numbers are
5099       limited to certain values, as follows:
5100
5101         8-bit non-UTF mode    less than 0x100
5102         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
5103         16-bit non-UTF mode   less than 0x10000
5104         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
5105         32-bit non-UTF mode   less than 0x100000000
5106         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
5107
5108       Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
5109       called "surrogate" codepoints), and 0xffef.
5110
5111   Escape sequences in character classes
5112
5113       All the sequences that define a single character value can be used both
5114       inside and outside character classes. In addition, inside  a  character
5115       class, \b is interpreted as the backspace character (hex 08).
5116
5117       \N  is not allowed in a character class. \B, \R, and \X are not special
5118       inside a character class. Like  other  unrecognized  escape  sequences,
5119       they  are  treated  as  the  literal  characters  "B",  "R", and "X" by
5120       default, but cause an error if the PCRE_EXTRA option is set. Outside  a
5121       character class, these sequences have different meanings.
5122
5123   Unsupported escape sequences
5124
5125       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
5126       handler and used  to  modify  the  case  of  following  characters.  By
5127       default,  PCRE does not support these escape sequences. However, if the
5128       PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and
5129       \u can be used to define a character by code point, as described in the
5130       previous section.
5131
5132   Absolute and relative back references
5133
5134       The sequence \g followed by an unsigned or a negative  number,  option-
5135       ally  enclosed  in braces, is an absolute or relative back reference. A
5136       named back reference can be coded as \g{name}. Back references are dis-
5137       cussed later, following the discussion of parenthesized subpatterns.
5138
5139   Absolute and relative subroutine calls
5140
5141       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
5142       name or a number enclosed either in angle brackets or single quotes, is
5143       an  alternative  syntax for referencing a subpattern as a "subroutine".
5144       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
5145       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
5146       reference; the latter is a subroutine call.
5147
5148   Generic character types
5149
5150       Another use of backslash is for specifying generic character types:
5151
5152         \d     any decimal digit
5153         \D     any character that is not a decimal digit
5154         \h     any horizontal white space character
5155         \H     any character that is not a horizontal white space character
5156         \s     any white space character
5157         \S     any character that is not a white space character
5158         \v     any vertical white space character
5159         \V     any character that is not a vertical white space character
5160         \w     any "word" character
5161         \W     any "non-word" character
5162
5163       There is also the single sequence \N, which matches a non-newline char-
5164       acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
5165       not set. Perl also uses \N to match characters by name; PCRE  does  not
5166       support this.
5167
5168       Each  pair of lower and upper case escape sequences partitions the com-
5169       plete set of characters into two disjoint  sets.  Any  given  character
5170       matches  one, and only one, of each pair. The sequences can appear both
5171       inside and outside character classes. They each match one character  of
5172       the  appropriate  type.  If the current matching point is at the end of
5173       the subject string, all of them fail, because there is no character  to
5174       match.
5175
5176       For  compatibility with Perl, \s did not used to match the VT character
5177       (code 11), which made it different from the the  POSIX  "space"  class.
5178       However,  Perl  added  VT  at  release  5.18, and PCRE followed suit at
5179       release 8.34. The default \s characters are now HT  (9),  LF  (10),  VT
5180       (11),  FF  (12),  CR  (13),  and space (32), which are defined as white
5181       space in the "C" locale. This list may vary if locale-specific matching
5182       is  taking place. For example, in some locales the "non-breaking space"
5183       character (\xA0) is recognized as white space, and  in  others  the  VT
5184       character is not.
5185
5186       A  "word"  character is an underscore or any character that is a letter
5187       or digit.  By default, the definition of letters  and  digits  is  con-
5188       trolled  by PCRE's low-valued character tables, and may vary if locale-
5189       specific matching is taking place (see "Locale support" in the  pcreapi
5190       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
5191       systems, or "french" in Windows, some character codes greater than  127
5192       are  used  for  accented letters, and these are then matched by \w. The
5193       use of locales with Unicode is discouraged.
5194
5195       By default, characters whose code points are  greater  than  127  never
5196       match \d, \s, or \w, and always match \D, \S, and \W, although this may
5197       vary for characters in the range 128-255 when locale-specific  matching
5198       is  happening.   These  escape sequences retain their original meanings
5199       from before Unicode support was available, mainly for  efficiency  rea-
5200       sons.  If  PCRE  is  compiled  with  Unicode  property support, and the
5201       PCRE_UCP option is set, the behaviour is changed so that Unicode  prop-
5202       erties are used to determine character types, as follows:
5203
5204         \d  any character that matches \p{Nd} (decimal digit)
5205         \s  any character that matches \p{Z} or \h or \v
5206         \w  any character that matches \p{L} or \p{N}, plus underscore
5207
5208       The  upper case escapes match the inverse sets of characters. Note that
5209       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
5210       as  well as any Unicode letter, and underscore. Note also that PCRE_UCP
5211       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
5212       Matching these sequences is noticeably slower when PCRE_UCP is set.
5213
5214       The  sequences  \h, \H, \v, and \V are features that were added to Perl
5215       at release 5.10. In contrast to the other sequences, which  match  only
5216       ASCII  characters  by  default,  these always match certain high-valued
5217       code points, whether or not PCRE_UCP is set. The horizontal space char-
5218       acters are:
5219
5220         U+0009     Horizontal tab (HT)
5221         U+0020     Space
5222         U+00A0     Non-break space
5223         U+1680     Ogham space mark
5224         U+180E     Mongolian vowel separator
5225         U+2000     En quad
5226         U+2001     Em quad
5227         U+2002     En space
5228         U+2003     Em space
5229         U+2004     Three-per-em space
5230         U+2005     Four-per-em space
5231         U+2006     Six-per-em space
5232         U+2007     Figure space
5233         U+2008     Punctuation space
5234         U+2009     Thin space
5235         U+200A     Hair space
5236         U+202F     Narrow no-break space
5237         U+205F     Medium mathematical space
5238         U+3000     Ideographic space
5239
5240       The vertical space characters are:
5241
5242         U+000A     Linefeed (LF)
5243         U+000B     Vertical tab (VT)
5244         U+000C     Form feed (FF)
5245         U+000D     Carriage return (CR)
5246         U+0085     Next line (NEL)
5247         U+2028     Line separator
5248         U+2029     Paragraph separator
5249
5250       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
5251       256 are relevant.
5252
5253   Newline sequences
5254
5255       Outside a character class, by default, the escape sequence  \R  matches
5256       any  Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
5257       to the following:
5258
5259         (?>\r\n|\n|\x0b|\f|\r|\x85)
5260
5261       This is an example of an "atomic group", details  of  which  are  given
5262       below.  This particular group matches either the two-character sequence
5263       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
5264       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car-
5265       riage return, U+000D), or NEL (next line,  U+0085).  The  two-character
5266       sequence is treated as a single unit that cannot be split.
5267
5268       In  other modes, two additional characters whose codepoints are greater
5269       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
5270       rator,  U+2029).   Unicode character property support is not needed for
5271       these characters to be recognized.
5272
5273       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
5274       the  complete  set  of  Unicode  line  endings)  by  setting the option
5275       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
5276       (BSR is an abbrevation for "backslash R".) This can be made the default
5277       when PCRE is built; if this is the case, the  other  behaviour  can  be
5278       requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
5279       specify these settings by starting a pattern string  with  one  of  the
5280       following sequences:
5281
5282         (*BSR_ANYCRLF)   CR, LF, or CRLF only
5283         (*BSR_UNICODE)   any Unicode newline sequence
5284
5285       These override the default and the options given to the compiling func-
5286       tion, but they can themselves be  overridden  by  options  given  to  a
5287       matching  function.  Note  that  these  special settings, which are not
5288       Perl-compatible, are recognized only at the very start  of  a  pattern,
5289       and  that  they  must  be  in  upper  case. If more than one of them is
5290       present, the last one is used. They can be combined with  a  change  of
5291       newline convention; for example, a pattern can start with:
5292
5293         (*ANY)(*BSR_ANYCRLF)
5294
5295       They  can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
5296       or (*UCP) special sequences. Inside a character class, \R is treated as
5297       an  unrecognized  escape  sequence,  and  so  matches the letter "R" by
5298       default, but causes an error if PCRE_EXTRA is set.
5299
5300   Unicode character properties
5301
5302       When PCRE is built with Unicode character property support, three addi-
5303       tional  escape sequences that match characters with specific properties
5304       are available.  When in 8-bit non-UTF-8 mode, these  sequences  are  of
5305       course  limited  to  testing  characters whose codepoints are less than
5306       256, but they do work in this mode.  The extra escape sequences are:
5307
5308         \p{xx}   a character with the xx property
5309         \P{xx}   a character without the xx property
5310         \X       a Unicode extended grapheme cluster
5311
5312       The property names represented by xx above are limited to  the  Unicode
5313       script names, the general category properties, "Any", which matches any
5314       character  (including  newline),  and  some  special  PCRE   properties
5315       (described  in the next section).  Other Perl properties such as "InMu-
5316       sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}
5317       does not match any characters, so always causes a match failure.
5318
5319       Sets of Unicode characters are defined as belonging to certain scripts.
5320       A character from one of these sets can be matched using a script  name.
5321       For example:
5322
5323         \p{Greek}
5324         \P{Han}
5325
5326       Those  that are not part of an identified script are lumped together as
5327       "Common". The current list of scripts is:
5328
5329       Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak,  Bengali,
5330       Bopomofo,  Brahmi,  Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
5331       ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
5332       form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
5333       glyphs,  Elbasan,  Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,
5334       Greek,  Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo, Hebrew, Hiragana,
5335       Imperial_Aramaic,    Inherited,     Inscriptional_Pahlavi,     Inscrip-
5336       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
5337       Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha,  Limbu,  Lin-
5338       ear_A,  Linear_B,  Lisu,  Lycian, Lydian, Mahajani, Malayalam, Mandaic,
5339       Manichaean,     Meetei_Mayek,     Mende_Kikakui,      Meroitic_Cursive,
5340       Meroitic_Hieroglyphs,  Miao,  Modi, Mongolian, Mro, Myanmar, Nabataean,
5341       New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,  Old_Italic,   Old_North_Arabian,
5342       Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya,
5343       Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
5344       Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha-
5345       vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri,  Syriac,
5346       Tagalog,  Tagbanwa,  Tai_Le,  Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
5347       Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic,  Vai,  Warang_Citi,
5348       Yi.
5349
5350       Each character has exactly one Unicode general category property, spec-
5351       ified by a two-letter abbreviation. For compatibility with Perl,  nega-
5352       tion  can  be  specified  by including a circumflex between the opening
5353       brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
5354       \P{Lu}.
5355
5356       If only one letter is specified with \p or \P, it includes all the gen-
5357       eral category properties that start with that letter. In this case,  in
5358       the  absence of negation, the curly brackets in the escape sequence are
5359       optional; these two examples have the same effect:
5360
5361         \p{L}
5362         \pL
5363
5364       The following general category property codes are supported:
5365
5366         C     Other
5367         Cc    Control
5368         Cf    Format
5369         Cn    Unassigned
5370         Co    Private use
5371         Cs    Surrogate
5372
5373         L     Letter
5374         Ll    Lower case letter
5375         Lm    Modifier letter
5376         Lo    Other letter
5377         Lt    Title case letter
5378         Lu    Upper case letter
5379
5380         M     Mark
5381         Mc    Spacing mark
5382         Me    Enclosing mark
5383         Mn    Non-spacing mark
5384
5385         N     Number
5386         Nd    Decimal number
5387         Nl    Letter number
5388         No    Other number
5389
5390         P     Punctuation
5391         Pc    Connector punctuation
5392         Pd    Dash punctuation
5393         Pe    Close punctuation
5394         Pf    Final punctuation
5395         Pi    Initial punctuation
5396         Po    Other punctuation
5397         Ps    Open punctuation
5398
5399         S     Symbol
5400         Sc    Currency symbol
5401         Sk    Modifier symbol
5402         Sm    Mathematical symbol
5403         So    Other symbol
5404
5405         Z     Separator
5406         Zl    Line separator
5407         Zp    Paragraph separator
5408         Zs    Space separator
5409
5410       The special property L& is also supported: it matches a character  that
5411       has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
5412       classified as a modifier or "other".
5413
5414       The Cs (Surrogate) property applies only to  characters  in  the  range
5415       U+D800  to U+DFFF. Such characters are not valid in Unicode strings and
5416       so cannot be tested by PCRE, unless  UTF  validity  checking  has  been
5417       turned    off    (see    the    discussion    of    PCRE_NO_UTF8_CHECK,
5418       PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page).  Perl
5419       does not support the Cs property.
5420
5421       The  long  synonyms  for  property  names  that  Perl supports (such as
5422       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
5423       any of these properties with "Is".
5424
5425       No character that is in the Unicode table has the Cn (unassigned) prop-
5426       erty.  Instead, this property is assumed for any code point that is not
5427       in the Unicode table.
5428
5429       Specifying  caseless  matching  does not affect these escape sequences.
5430       For example, \p{Lu} always matches only upper  case  letters.  This  is
5431       different from the behaviour of current versions of Perl.
5432
5433       Matching  characters  by Unicode property is not fast, because PCRE has
5434       to do a multistage table lookup in order to find  a  character's  prop-
5435       erty. That is why the traditional escape sequences such as \d and \w do
5436       not use Unicode properties in PCRE by default, though you can make them
5437       do  so  by  setting the PCRE_UCP option or by starting the pattern with
5438       (*UCP).
5439
5440   Extended grapheme clusters
5441
5442       The \X escape matches any number of Unicode  characters  that  form  an
5443       "extended grapheme cluster", and treats the sequence as an atomic group
5444       (see below).  Up to and including release 8.31, PCRE  matched  an  ear-
5445       lier, simpler definition that was equivalent to
5446
5447         (?>\PM\pM*)
5448
5449       That  is,  it matched a character without the "mark" property, followed
5450       by zero or more characters with the "mark"  property.  Characters  with
5451       the  "mark"  property are typically non-spacing accents that affect the
5452       preceding character.
5453
5454       This simple definition was extended in Unicode to include more  compli-
5455       cated  kinds of composite character by giving each character a grapheme
5456       breaking property, and creating rules  that  use  these  properties  to
5457       define  the  boundaries  of  extended grapheme clusters. In releases of
5458       PCRE later than 8.31, \X matches one of these clusters.
5459
5460       \X always matches at least one character. Then it  decides  whether  to
5461       add additional characters according to the following rules for ending a
5462       cluster:
5463
5464       1. End at the end of the subject string.
5465
5466       2. Do not end between CR and LF; otherwise end after any control  char-
5467       acter.
5468
5469       3.  Do  not  break  Hangul (a Korean script) syllable sequences. Hangul
5470       characters are of five types: L, V, T, LV, and LVT. An L character  may
5471       be  followed by an L, V, LV, or LVT character; an LV or V character may
5472       be followed by a V or T character; an LVT or T character may be follwed
5473       only by a T character.
5474
5475       4.  Do not end before extending characters or spacing marks. Characters
5476       with the "mark" property always have  the  "extend"  grapheme  breaking
5477       property.
5478
5479       5. Do not end after prepend characters.
5480
5481       6. Otherwise, end the cluster.
5482
5483   PCRE's additional properties
5484
5485       As  well  as the standard Unicode properties described above, PCRE sup-
5486       ports four more that make it possible  to  convert  traditional  escape
5487       sequences  such as \w and \s to use Unicode properties. PCRE uses these
5488       non-standard, non-Perl properties internally when PCRE_UCP is set. How-
5489       ever, they may also be used explicitly. These properties are:
5490
5491         Xan   Any alphanumeric character
5492         Xps   Any POSIX space character
5493         Xsp   Any Perl space character
5494         Xwd   Any Perl "word" character
5495
5496       Xan  matches  characters that have either the L (letter) or the N (num-
5497       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
5498       form  feed,  or carriage return, and any other character that has the Z
5499       (separator) property.  Xsp is the same as Xps; it used to exclude  ver-
5500       tical  tab,  for Perl compatibility, but Perl changed, and so PCRE fol-
5501       lowed at release 8.34. Xwd matches the same  characters  as  Xan,  plus
5502       underscore.
5503
5504       There  is another non-standard property, Xuc, which matches any charac-
5505       ter that can be represented by a Universal Character Name  in  C++  and
5506       other  programming  languages.  These are the characters $, @, ` (grave
5507       accent), and all characters with Unicode code points  greater  than  or
5508       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
5509       most base (ASCII) characters are excluded. (Universal  Character  Names
5510       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
5511       Note that the Xuc property does not match these sequences but the char-
5512       acters that they represent.)
5513
5514   Resetting the match start
5515
5516       The  escape sequence \K causes any previously matched characters not to
5517       be included in the final matched sequence. For example, the pattern:
5518
5519         foo\Kbar
5520
5521       matches "foobar", but reports that it has matched "bar".  This  feature
5522       is  similar  to  a lookbehind assertion (described below).  However, in
5523       this case, the part of the subject before the real match does not  have
5524       to  be of fixed length, as lookbehind assertions do. The use of \K does
5525       not interfere with the setting of captured  substrings.   For  example,
5526       when the pattern
5527
5528         (foo)\Kbar
5529
5530       matches "foobar", the first substring is still set to "foo".
5531
5532       Perl  documents  that  the  use  of  \K  within assertions is "not well
5533       defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
5534       assertions,  but  is  ignored  in negative assertions. Note that when a
5535       pattern such as (?=ab\K) matches, the reported start of the  match  can
5536       be greater than the end of the match.
5537
5538   Simple assertions
5539
5540       The  final use of backslash is for certain simple assertions. An asser-
5541       tion specifies a condition that has to be met at a particular point  in
5542       a  match, without consuming any characters from the subject string. The
5543       use of subpatterns for more complicated assertions is described  below.
5544       The backslashed assertions are:
5545
5546         \b     matches at a word boundary
5547         \B     matches when not at a word boundary
5548         \A     matches at the start of the subject
5549         \Z     matches at the end of the subject
5550                 also matches before a newline at the end of the subject
5551         \z     matches only at the end of the subject
5552         \G     matches at the first matching position in the subject
5553
5554       Inside  a  character  class, \b has a different meaning; it matches the
5555       backspace character. If any other of  these  assertions  appears  in  a
5556       character  class, by default it matches the corresponding literal char-
5557       acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
5558       PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-
5559       ated instead.
5560
5561       A word boundary is a position in the subject string where  the  current
5562       character  and  the previous character do not both match \w or \W (i.e.
5563       one matches \w and the other matches \W), or the start or  end  of  the
5564       string  if  the  first or last character matches \w, respectively. In a
5565       UTF mode, the meanings of \w and \W  can  be  changed  by  setting  the
5566       PCRE_UCP  option. When this is done, it also affects \b and \B. Neither
5567       PCRE nor Perl has a separate "start of word" or "end of  word"  metase-
5568       quence.  However,  whatever follows \b normally determines which it is.
5569       For example, the fragment \ba matches "a" at the start of a word.
5570
5571       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
5572       and dollar (described in the next section) in that they only ever match
5573       at the very start and end of the subject string, whatever  options  are
5574       set.  Thus,  they are independent of multiline mode. These three asser-
5575       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
5576       affect  only the behaviour of the circumflex and dollar metacharacters.
5577       However, if the startoffset argument of pcre_exec() is non-zero,  indi-
5578       cating that matching is to start at a point other than the beginning of
5579       the subject, \A can never match. The difference between \Z  and  \z  is
5580       that \Z matches before a newline at the end of the string as well as at
5581       the very end, whereas \z matches only at the end.
5582
5583       The \G assertion is true only when the current matching position is  at
5584       the  start point of the match, as specified by the startoffset argument
5585       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
5586       non-zero.  By calling pcre_exec() multiple times with appropriate argu-
5587       ments, you can mimic Perl's /g option, and it is in this kind of imple-
5588       mentation where \G can be useful.
5589
5590       Note,  however,  that  PCRE's interpretation of \G, as the start of the
5591       current match, is subtly different from Perl's, which defines it as the
5592       end  of  the  previous  match. In Perl, these can be different when the
5593       previously matched string was empty. Because PCRE does just  one  match
5594       at a time, it cannot reproduce this behaviour.
5595
5596       If  all  the alternatives of a pattern begin with \G, the expression is
5597       anchored to the starting match position, and the "anchored" flag is set
5598       in the compiled regular expression.
5599
5600
5601CIRCUMFLEX AND DOLLAR
5602
5603       The  circumflex  and  dollar  metacharacters are zero-width assertions.
5604       That is, they test for a particular condition being true  without  con-
5605       suming any characters from the subject string.
5606
5607       Outside a character class, in the default matching mode, the circumflex
5608       character is an assertion that is true only  if  the  current  matching
5609       point  is  at the start of the subject string. If the startoffset argu-
5610       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
5611       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
5612       has an entirely different meaning (see below).
5613
5614       Circumflex need not be the first character of the pattern if  a  number
5615       of  alternatives are involved, but it should be the first thing in each
5616       alternative in which it appears if the pattern is ever  to  match  that
5617       branch.  If all possible alternatives start with a circumflex, that is,
5618       if the pattern is constrained to match only at the start  of  the  sub-
5619       ject,  it  is  said  to be an "anchored" pattern. (There are also other
5620       constructs that can cause a pattern to be anchored.)
5621
5622       The dollar character is an assertion that is true only if  the  current
5623       matching  point  is  at  the  end of the subject string, or immediately
5624       before a newline at the end of the string (by default). Note,  however,
5625       that  it  does  not  actually match the newline. Dollar need not be the
5626       last character of the pattern if a number of alternatives are involved,
5627       but  it should be the last item in any branch in which it appears. Dol-
5628       lar has no special meaning in a character class.
5629
5630       The meaning of dollar can be changed so that it  matches  only  at  the
5631       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
5632       compile time. This does not affect the \Z assertion.
5633
5634       The meanings of the circumflex and dollar characters are changed if the
5635       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
5636       matches immediately after internal newlines as well as at the start  of
5637       the  subject  string.  It  does not match after a newline that ends the
5638       string. A dollar matches before any newlines in the string, as well  as
5639       at  the very end, when PCRE_MULTILINE is set. When newline is specified
5640       as the two-character sequence CRLF, isolated CR and  LF  characters  do
5641       not indicate newlines.
5642
5643       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
5644       (where \n represents a newline) in multiline mode, but  not  otherwise.
5645       Consequently,  patterns  that  are anchored in single line mode because
5646       all branches start with ^ are not anchored in  multiline  mode,  and  a
5647       match  for  circumflex  is  possible  when  the startoffset argument of
5648       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
5649       PCRE_MULTILINE is set.
5650
5651       Note  that  the sequences \A, \Z, and \z can be used to match the start
5652       and end of the subject in both modes, and if all branches of a  pattern
5653       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
5654       set.
5655
5656
5657FULL STOP (PERIOD, DOT) AND \N
5658
5659       Outside a character class, a dot in the pattern matches any one charac-
5660       ter  in  the subject string except (by default) a character that signi-
5661       fies the end of a line.
5662
5663       When a line ending is defined as a single character, dot never  matches
5664       that  character; when the two-character sequence CRLF is used, dot does
5665       not match CR if it is immediately followed  by  LF,  but  otherwise  it
5666       matches  all characters (including isolated CRs and LFs). When any Uni-
5667       code line endings are being recognized, dot does not match CR or LF  or
5668       any of the other line ending characters.
5669
5670       The  behaviour  of  dot  with regard to newlines can be changed. If the
5671       PCRE_DOTALL option is set, a dot matches  any  one  character,  without
5672       exception. If the two-character sequence CRLF is present in the subject
5673       string, it takes two dots to match it.
5674
5675       The handling of dot is entirely independent of the handling of  circum-
5676       flex  and  dollar,  the  only relationship being that they both involve
5677       newlines. Dot has no special meaning in a character class.
5678
5679       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
5680       affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
5681       character except one that signifies the end of a line. Perl  also  uses
5682       \N to match characters by name; PCRE does not support this.
5683
5684
5685MATCHING A SINGLE DATA UNIT
5686
5687       Outside  a character class, the escape sequence \C matches any one data
5688       unit, whether or not a UTF mode is set. In the 8-bit library, one  data
5689       unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
5690       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
5691       line-ending  characters.  The  feature  is provided in Perl in order to
5692       match individual bytes in UTF-8 mode, but it is unclear how it can use-
5693       fully  be  used.  Because  \C breaks up characters into individual data
5694       units, matching one unit with \C in a UTF mode means that the  rest  of
5695       the string may start with a malformed UTF character. This has undefined
5696       results, because PCRE assumes that it is dealing with valid UTF strings
5697       (and  by  default  it checks this at the start of processing unless the
5698       PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or  PCRE_NO_UTF32_CHECK  option
5699       is used).
5700
5701       PCRE  does  not  allow \C to appear in lookbehind assertions (described
5702       below) in a UTF mode, because this would make it impossible  to  calcu-
5703       late the length of the lookbehind.
5704
5705       In general, the \C escape sequence is best avoided. However, one way of
5706       using it that avoids the problem of malformed UTF characters is to  use
5707       a  lookahead to check the length of the next character, as in this pat-
5708       tern, which could be used with a UTF-8 string (ignore white  space  and
5709       line breaks):
5710
5711         (?| (?=[\x00-\x7f])(\C) |
5712             (?=[\x80-\x{7ff}])(\C)(\C) |
5713             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
5714             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
5715
5716       A  group  that starts with (?| resets the capturing parentheses numbers
5717       in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The
5718       assertions  at  the start of each branch check the next UTF-8 character
5719       for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
5720       character's  individual bytes are then captured by the appropriate num-
5721       ber of groups.
5722
5723
5724SQUARE BRACKETS AND CHARACTER CLASSES
5725
5726       An opening square bracket introduces a character class, terminated by a
5727       closing square bracket. A closing square bracket on its own is not spe-
5728       cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
5729       a lone closing square bracket causes a compile-time error. If a closing
5730       square bracket is required as a member of the class, it should  be  the
5731       first  data  character  in  the  class (after an initial circumflex, if
5732       present) or escaped with a backslash.
5733
5734       A character class matches a single character in the subject. In  a  UTF
5735       mode,  the  character  may  be  more than one data unit long. A matched
5736       character must be in the set of characters defined by the class, unless
5737       the  first  character in the class definition is a circumflex, in which
5738       case the subject character must not be in the set defined by the class.
5739       If  a  circumflex is actually required as a member of the class, ensure
5740       it is not the first character, or escape it with a backslash.
5741
5742       For example, the character class [aeiou] matches any lower case  vowel,
5743       while  [^aeiou]  matches  any character that is not a lower case vowel.
5744       Note that a circumflex is just a convenient notation for specifying the
5745       characters  that  are in the class by enumerating those that are not. A
5746       class that starts with a circumflex is not an assertion; it still  con-
5747       sumes  a  character  from the subject string, and therefore it fails if
5748       the current pointer is at the end of the string.
5749
5750       In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
5751       (0xffff)  can be included in a class as a literal string of data units,
5752       or by using the \x{ escaping mechanism.
5753
5754       When caseless matching is set, any letters in a  class  represent  both
5755       their  upper  case  and lower case versions, so for example, a caseless
5756       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
5757       match  "A", whereas a caseful version would. In a UTF mode, PCRE always
5758       understands the concept of case for characters whose  values  are  less
5759       than  128, so caseless matching is always possible. For characters with
5760       higher values, the concept of case is supported  if  PCRE  is  compiled
5761       with  Unicode  property support, but not otherwise.  If you want to use
5762       caseless matching in a UTF mode for characters 128 and above, you  must
5763       ensure  that  PCRE is compiled with Unicode property support as well as
5764       with UTF support.
5765
5766       Characters that might indicate line breaks are  never  treated  in  any
5767       special  way  when  matching  character  classes,  whatever line-ending
5768       sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
5769       PCRE_MULTILINE options is used. A class such as [^a] always matches one
5770       of these characters.
5771
5772       The minus (hyphen) character can be used to specify a range of  charac-
5773       ters  in  a  character  class.  For  example,  [d-m] matches any letter
5774       between d and m, inclusive. If a  minus  character  is  required  in  a
5775       class,  it  must  be  escaped  with a backslash or appear in a position
5776       where it cannot be interpreted as indicating a range, typically as  the
5777       first or last character in the class, or immediately after a range. For
5778       example, [b-d-z] matches letters in the range b to d, a hyphen  charac-
5779       ter, or z.
5780
5781       It is not possible to have the literal character "]" as the end charac-
5782       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
5783       two  characters ("W" and "-") followed by a literal string "46]", so it
5784       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
5785       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
5786       preted as a class containing a range followed by two other  characters.
5787       The  octal or hexadecimal representation of "]" can also be used to end
5788       a range.
5789
5790       An error is generated if a POSIX character  class  (see  below)  or  an
5791       escape  sequence other than one that defines a single character appears
5792       at a point where a range ending character  is  expected.  For  example,
5793       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
5794
5795       Ranges  operate in the collating sequence of character values. They can
5796       also  be  used  for  characters  specified  numerically,  for   example
5797       [\000-\037].  Ranges  can include any characters that are valid for the
5798       current mode.
5799
5800       If a range that includes letters is used when caseless matching is set,
5801       it matches the letters in either case. For example, [W-c] is equivalent
5802       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
5803       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
5804       accented E characters in both cases. In UTF modes,  PCRE  supports  the
5805       concept  of  case for characters with values greater than 128 only when
5806       it is compiled with Unicode property support.
5807
5808       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
5809       \w, and \W may appear in a character class, and add the characters that
5810       they match to the class. For example, [\dABCDEF] matches any  hexadeci-
5811       mal  digit.  In  UTF modes, the PCRE_UCP option affects the meanings of
5812       \d, \s, \w and their upper case partners, just as  it  does  when  they
5813       appear  outside a character class, as described in the section entitled
5814       "Generic character types" above. The escape sequence \b has a different
5815       meaning  inside  a character class; it matches the backspace character.
5816       The sequences \B, \N, \R, and \X are not  special  inside  a  character
5817       class.  Like  any other unrecognized escape sequences, they are treated
5818       as the literal characters "B", "N", "R", and "X" by default, but  cause
5819       an error if the PCRE_EXTRA option is set.
5820
5821       A  circumflex  can  conveniently  be used with the upper case character
5822       types to specify a more restricted set of characters than the  matching
5823       lower  case  type.  For example, the class [^\W_] matches any letter or
5824       digit, but not underscore, whereas [\w] includes underscore. A positive
5825       character class should be read as "something OR something OR ..." and a
5826       negative class as "NOT something AND NOT something AND NOT ...".
5827
5828       The only metacharacters that are recognized in  character  classes  are
5829       backslash,  hyphen  (only  where  it can be interpreted as specifying a
5830       range), circumflex (only at the start), opening  square  bracket  (only
5831       when  it can be interpreted as introducing a POSIX class name, or for a
5832       special compatibility feature - see the next  two  sections),  and  the
5833       terminating  closing  square  bracket.  However,  escaping  other  non-
5834       alphanumeric characters does no harm.
5835
5836
5837POSIX CHARACTER CLASSES
5838
5839       Perl supports the POSIX notation for character classes. This uses names
5840       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
5841       supports this notation. For example,
5842
5843         [01[:alpha:]%]
5844
5845       matches "0", "1", any alphabetic character, or "%". The supported class
5846       names are:
5847
5848         alnum    letters and digits
5849         alpha    letters
5850         ascii    character codes 0 - 127
5851         blank    space or tab only
5852         cntrl    control characters
5853         digit    decimal digits (same as \d)
5854         graph    printing characters, excluding space
5855         lower    lower case letters
5856         print    printing characters, including space
5857         punct    printing characters, excluding letters and digits and space
5858         space    white space (the same as \s from PCRE 8.34)
5859         upper    upper case letters
5860         word     "word" characters (same as \w)
5861         xdigit   hexadecimal digits
5862
5863       The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
5864       CR (13), and space (32). If locale-specific matching is  taking  place,
5865       the  list  of  space characters may be different; there may be fewer or
5866       more of them. "Space" used to be different to \s, which did not include
5867       VT, for Perl compatibility.  However, Perl changed at release 5.18, and
5868       PCRE followed at release 8.34.  "Space" and \s now match the  same  set
5869       of characters.
5870
5871       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
5872       from Perl 5.8. Another Perl extension is negation, which  is  indicated
5873       by a ^ character after the colon. For example,
5874
5875         [12[:^digit:]]
5876
5877       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
5878       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5879       these are not supported, and an error is given if they are encountered.
5880
5881       By default, characters with values greater than 128 do not match any of
5882       the POSIX character classes. However, if the PCRE_UCP option is  passed
5883       to  pcre_compile(),  some  of  the  classes are changed so that Unicode
5884       character properties are used. This is achieved  by  replacing  certain
5885       POSIX classes by other sequences, as follows:
5886
5887         [:alnum:]  becomes  \p{Xan}
5888         [:alpha:]  becomes  \p{L}
5889         [:blank:]  becomes  \h
5890         [:digit:]  becomes  \p{Nd}
5891         [:lower:]  becomes  \p{Ll}
5892         [:space:]  becomes  \p{Xps}
5893         [:upper:]  becomes  \p{Lu}
5894         [:word:]   becomes  \p{Xwd}
5895
5896       Negated  versions, such as [:^alpha:] use \P instead of \p. Three other
5897       POSIX classes are handled specially in UCP mode:
5898
5899       [:graph:] This matches characters that have glyphs that mark  the  page
5900                 when printed. In Unicode property terms, it matches all char-
5901                 acters with the L, M, N, P, S, or Cf properties, except for:
5902
5903                   U+061C           Arabic Letter Mark
5904                   U+180E           Mongolian Vowel Separator
5905                   U+2066 - U+2069  Various "isolate"s
5906
5907
5908       [:print:] This matches the same  characters  as  [:graph:]  plus  space
5909                 characters  that  are  not controls, that is, characters with
5910                 the Zs property.
5911
5912       [:punct:] This matches all characters that have the Unicode P (punctua-
5913                 tion)  property,  plus those characters whose code points are
5914                 less than 128 that have the S (Symbol) property.
5915
5916       The other POSIX classes are unchanged, and match only  characters  with
5917       code points less than 128.
5918
5919
5920COMPATIBILITY FEATURE FOR WORD BOUNDARIES
5921
5922       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
5923       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
5924       and "end of word". PCRE treats these items as follows:
5925
5926         [[:<:]]  is converted to  \b(?=\w)
5927         [[:>:]]  is converted to  \b(?<=\w)
5928
5929       Only these exact character sequences are recognized. A sequence such as
5930       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
5931       support  is not compatible with Perl. It is provided to help migrations
5932       from other environments, and is best not used in any new patterns. Note
5933       that  \b matches at the start and the end of a word (see "Simple asser-
5934       tions" above), and in a Perl-style pattern the preceding  or  following
5935       character  normally  shows  which  is  wanted, without the need for the
5936       assertions that are used above in order to give exactly the  POSIX  be-
5937       haviour.
5938
5939
5940VERTICAL BAR
5941
5942       Vertical  bar characters are used to separate alternative patterns. For
5943       example, the pattern
5944
5945         gilbert|sullivan
5946
5947       matches either "gilbert" or "sullivan". Any number of alternatives  may
5948       appear,  and  an  empty  alternative  is  permitted (matching the empty
5949       string). The matching process tries each alternative in turn, from left
5950       to  right, and the first one that succeeds is used. If the alternatives
5951       are within a subpattern (defined below), "succeeds" means matching  the
5952       rest of the main pattern as well as the alternative in the subpattern.
5953
5954
5955INTERNAL OPTION SETTING
5956
5957       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
5958       PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
5959       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
5960       between "(?" and ")".  The option letters are
5961
5962         i  for PCRE_CASELESS
5963         m  for PCRE_MULTILINE
5964         s  for PCRE_DOTALL
5965         x  for PCRE_EXTENDED
5966
5967       For example, (?im) sets caseless, multiline matching. It is also possi-
5968       ble to unset these options by preceding the letter with a hyphen, and a
5969       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
5970       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
5971       is also permitted. If a  letter  appears  both  before  and  after  the
5972       hyphen, the option is unset.
5973
5974       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
5975       can be changed in the same way as the Perl-compatible options by  using
5976       the characters J, U and X respectively.
5977
5978       When  one  of  these  option  changes occurs at top level (that is, not
5979       inside subpattern parentheses), the change applies to the remainder  of
5980       the pattern that follows. If the change is placed right at the start of
5981       a pattern, PCRE extracts it into the global options (and it will there-
5982       fore show up in data extracted by the pcre_fullinfo() function).
5983
5984       An  option  change  within a subpattern (see below for a description of
5985       subpatterns) affects only that part of the subpattern that follows  it,
5986       so
5987
5988         (a(?i)b)c
5989
5990       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
5991       used).  By this means, options can be made to have  different  settings
5992       in  different parts of the pattern. Any changes made in one alternative
5993       do carry on into subsequent branches within the  same  subpattern.  For
5994       example,
5995
5996         (a(?i)b|c)
5997
5998       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
5999       first branch is abandoned before the option setting.  This  is  because
6000       the  effects  of option settings happen at compile time. There would be
6001       some very weird behaviour otherwise.
6002
6003       Note: There are other PCRE-specific options that  can  be  set  by  the
6004       application  when  the  compiling  or matching functions are called. In
6005       some cases the pattern can contain special leading  sequences  such  as
6006       (*CRLF)  to  override  what  the  application  has set or what has been
6007       defaulted.  Details  are  given  in  the  section   entitled   "Newline
6008       sequences"  above.  There  are also the (*UTF8), (*UTF16),(*UTF32), and
6009       (*UCP) leading sequences that can be used to set UTF and Unicode  prop-
6010       erty  modes;  they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
6011       PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF)  sequence
6012       is  a  generic version that can be used with any of the libraries. How-
6013       ever, the application can set the PCRE_NEVER_UTF  option,  which  locks
6014       out the use of the (*UTF) sequences.
6015
6016
6017SUBPATTERNS
6018
6019       Subpatterns are delimited by parentheses (round brackets), which can be
6020       nested.  Turning part of a pattern into a subpattern does two things:
6021
6022       1. It localizes a set of alternatives. For example, the pattern
6023
6024         cat(aract|erpillar|)
6025
6026       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
6027       it would match "cataract", "erpillar" or an empty string.
6028
6029       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
6030       that, when the whole pattern  matches,  that  portion  of  the  subject
6031       string that matched the subpattern is passed back to the caller via the
6032       ovector argument of the matching function. (This applies  only  to  the
6033       traditional  matching functions; the DFA matching functions do not sup-
6034       port capturing.)
6035
6036       Opening parentheses are counted from left to right (starting from 1) to
6037       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
6038       string "the red king" is matched against the pattern
6039
6040         the ((red|white) (king|queen))
6041
6042       the captured substrings are "red king", "red", and "king", and are num-
6043       bered 1, 2, and 3, respectively.
6044
6045       The  fact  that  plain  parentheses  fulfil two functions is not always
6046       helpful.  There are often times when a grouping subpattern is  required
6047       without  a capturing requirement. If an opening parenthesis is followed
6048       by a question mark and a colon, the subpattern does not do any  captur-
6049       ing,  and  is  not  counted when computing the number of any subsequent
6050       capturing subpatterns. For example, if the string "the white queen"  is
6051       matched against the pattern
6052
6053         the ((?:red|white) (king|queen))
6054
6055       the captured substrings are "white queen" and "queen", and are numbered
6056       1 and 2. The maximum number of capturing subpatterns is 65535.
6057
6058       As a convenient shorthand, if any option settings are required  at  the
6059       start  of  a  non-capturing  subpattern,  the option letters may appear
6060       between the "?" and the ":". Thus the two patterns
6061
6062         (?i:saturday|sunday)
6063         (?:(?i)saturday|sunday)
6064
6065       match exactly the same set of strings. Because alternative branches are
6066       tried  from  left  to right, and options are not reset until the end of
6067       the subpattern is reached, an option setting in one branch does  affect
6068       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
6069       "Saturday".
6070
6071
6072DUPLICATE SUBPATTERN NUMBERS
6073
6074       Perl 5.10 introduced a feature whereby each alternative in a subpattern
6075       uses  the same numbers for its capturing parentheses. Such a subpattern
6076       starts with (?| and is itself a non-capturing subpattern. For  example,
6077       consider this pattern:
6078
6079         (?|(Sat)ur|(Sun))day
6080
6081       Because  the two alternatives are inside a (?| group, both sets of cap-
6082       turing parentheses are numbered one. Thus, when  the  pattern  matches,
6083       you  can  look  at captured substring number one, whichever alternative
6084       matched. This construct is useful when you want to  capture  part,  but
6085       not all, of one of a number of alternatives. Inside a (?| group, paren-
6086       theses are numbered as usual, but the number is reset at the  start  of
6087       each  branch.  The numbers of any capturing parentheses that follow the
6088       subpattern start after the highest number used in any branch. The  fol-
6089       lowing example is taken from the Perl documentation. The numbers under-
6090       neath show in which buffer the captured content will be stored.
6091
6092         # before  ---------------branch-reset----------- after
6093         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
6094         # 1            2         2  3        2     3     4
6095
6096       A back reference to a numbered subpattern uses the  most  recent  value
6097       that  is  set  for that number by any subpattern. The following pattern
6098       matches "abcabc" or "defdef":
6099
6100         /(?|(abc)|(def))\1/
6101
6102       In contrast, a subroutine call to a numbered subpattern  always  refers
6103       to  the  first  one in the pattern with the given number. The following
6104       pattern matches "abcabc" or "defabc":
6105
6106         /(?|(abc)|(def))(?1)/
6107
6108       If a condition test for a subpattern's having matched refers to a  non-
6109       unique  number, the test is true if any of the subpatterns of that num-
6110       ber have matched.
6111
6112       An alternative approach to using this "branch reset" feature is to  use
6113       duplicate named subpatterns, as described in the next section.
6114
6115
6116NAMED SUBPATTERNS
6117
6118       Identifying  capturing  parentheses  by number is simple, but it can be
6119       very hard to keep track of the numbers in complicated  regular  expres-
6120       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
6121       change. To help with this difficulty, PCRE supports the naming of  sub-
6122       patterns. This feature was not added to Perl until release 5.10. Python
6123       had the feature earlier, and PCRE introduced it at release  4.0,  using
6124       the  Python syntax. PCRE now supports both the Perl and the Python syn-
6125       tax. Perl allows identically numbered  subpatterns  to  have  different
6126       names, but PCRE does not.
6127
6128       In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
6129       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
6130       to  capturing parentheses from other parts of the pattern, such as back
6131       references, recursion, and conditions, can be made by name as  well  as
6132       by number.
6133
6134       Names  consist of up to 32 alphanumeric characters and underscores, but
6135       must start with a non-digit.  Named  capturing  parentheses  are  still
6136       allocated  numbers  as  well as names, exactly as if the names were not
6137       present. The PCRE API provides function calls for extracting the  name-
6138       to-number  translation  table  from a compiled pattern. There is also a
6139       convenience function for extracting a captured substring by name.
6140
6141       By default, a name must be unique within a pattern, but it is  possible
6142       to relax this constraint by setting the PCRE_DUPNAMES option at compile
6143       time. (Duplicate names are also always permitted for  subpatterns  with
6144       the  same  number, set up as described in the previous section.) Dupli-
6145       cate names can be useful for patterns where only one  instance  of  the
6146       named  parentheses  can  match. Suppose you want to match the name of a
6147       weekday, either as a 3-letter abbreviation or as the full name, and  in
6148       both cases you want to extract the abbreviation. This pattern (ignoring
6149       the line breaks) does the job:
6150
6151         (?<DN>Mon|Fri|Sun)(?:day)?|
6152         (?<DN>Tue)(?:sday)?|
6153         (?<DN>Wed)(?:nesday)?|
6154         (?<DN>Thu)(?:rsday)?|
6155         (?<DN>Sat)(?:urday)?
6156
6157       There are five capturing substrings, but only one is ever set  after  a
6158       match.  (An alternative way of solving this problem is to use a "branch
6159       reset" subpattern, as described in the previous section.)
6160
6161       The convenience function for extracting the data by  name  returns  the
6162       substring  for  the first (and in this example, the only) subpattern of
6163       that name that matched. This saves searching  to  find  which  numbered
6164       subpattern it was.
6165
6166       If  you  make  a  back  reference to a non-unique named subpattern from
6167       elsewhere in the pattern, the subpatterns to which the name refers  are
6168       checked  in  the order in which they appear in the overall pattern. The
6169       first one that is set is used for the reference. For example, this pat-
6170       tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
6171
6172         (?:(?<n>foo)|(?<n>bar))\k<n>
6173
6174
6175       If you make a subroutine call to a non-unique named subpattern, the one
6176       that corresponds to the first occurrence of the name is  used.  In  the
6177       absence of duplicate numbers (see the previous section) this is the one
6178       with the lowest number.
6179
6180       If you use a named reference in a condition test (see the section about
6181       conditions below), either to check whether a subpattern has matched, or
6182       to check for recursion, all subpatterns with the same name are  tested.
6183       If  the condition is true for any one of them, the overall condition is
6184       true. This is the same behaviour as  testing  by  number.  For  further
6185       details  of  the  interfaces  for  handling  named subpatterns, see the
6186       pcreapi documentation.
6187
6188       Warning: You cannot use different names to distinguish between two sub-
6189       patterns  with  the same number because PCRE uses only the numbers when
6190       matching. For this reason, an error is given at compile time if differ-
6191       ent  names  are given to subpatterns with the same number. However, you
6192       can always give the same name to subpatterns with the same number, even
6193       when PCRE_DUPNAMES is not set.
6194
6195
6196REPETITION
6197
6198       Repetition  is  specified  by  quantifiers, which can follow any of the
6199       following items:
6200
6201         a literal data character
6202         the dot metacharacter
6203         the \C escape sequence
6204         the \X escape sequence
6205         the \R escape sequence
6206         an escape such as \d or \pL that matches a single character
6207         a character class
6208         a back reference (see next section)
6209         a parenthesized subpattern (including assertions)
6210         a subroutine call to a subpattern (recursive or otherwise)
6211
6212       The general repetition quantifier specifies a minimum and maximum  num-
6213       ber  of  permitted matches, by giving the two numbers in curly brackets
6214       (braces), separated by a comma. The numbers must be  less  than  65536,
6215       and the first must be less than or equal to the second. For example:
6216
6217         z{2,4}
6218
6219       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
6220       special character. If the second number is omitted, but  the  comma  is
6221       present,  there  is  no upper limit; if the second number and the comma
6222       are both omitted, the quantifier specifies an exact number of  required
6223       matches. Thus
6224
6225         [aeiou]{3,}
6226
6227       matches at least 3 successive vowels, but may match many more, while
6228
6229         \d{8}
6230
6231       matches  exactly  8  digits. An opening curly bracket that appears in a
6232       position where a quantifier is not allowed, or one that does not  match
6233       the  syntax of a quantifier, is taken as a literal character. For exam-
6234       ple, {,6} is not a quantifier, but a literal string of four characters.
6235
6236       In UTF modes, quantifiers apply to characters rather than to individual
6237       data  units. Thus, for example, \x{100}{2} matches two characters, each
6238       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
6239       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
6240       which may be several data units long (and  they  may  be  of  different
6241       lengths).
6242
6243       The quantifier {0} is permitted, causing the expression to behave as if
6244       the previous item and the quantifier were not present. This may be use-
6245       ful  for  subpatterns that are referenced as subroutines from elsewhere
6246       in the pattern (but see also the section entitled "Defining subpatterns
6247       for  use  by  reference only" below). Items other than subpatterns that
6248       have a {0} quantifier are omitted from the compiled pattern.
6249
6250       For convenience, the three most common quantifiers have  single-charac-
6251       ter abbreviations:
6252
6253         *    is equivalent to {0,}
6254         +    is equivalent to {1,}
6255         ?    is equivalent to {0,1}
6256
6257       It  is  possible  to construct infinite loops by following a subpattern
6258       that can match no characters with a quantifier that has no upper limit,
6259       for example:
6260
6261         (a?)*
6262
6263       Earlier versions of Perl and PCRE used to give an error at compile time
6264       for such patterns. However, because there are cases where this  can  be
6265       useful,  such  patterns  are now accepted, but if any repetition of the
6266       subpattern does in fact match no characters, the loop is forcibly  bro-
6267       ken.
6268
6269       By  default,  the quantifiers are "greedy", that is, they match as much
6270       as possible (up to the maximum  number  of  permitted  times),  without
6271       causing  the  rest of the pattern to fail. The classic example of where
6272       this gives problems is in trying to match comments in C programs. These
6273       appear  between  /*  and  */ and within the comment, individual * and /
6274       characters may appear. An attempt to match C comments by  applying  the
6275       pattern
6276
6277         /\*.*\*/
6278
6279       to the string
6280
6281         /* first comment */  not comment  /* second comment */
6282
6283       fails,  because it matches the entire string owing to the greediness of
6284       the .*  item.
6285
6286       However, if a quantifier is followed by a question mark, it  ceases  to
6287       be greedy, and instead matches the minimum number of times possible, so
6288       the pattern
6289
6290         /\*.*?\*/
6291
6292       does the right thing with the C comments. The meaning  of  the  various
6293       quantifiers  is  not  otherwise  changed,  just the preferred number of
6294       matches.  Do not confuse this use of question mark with its  use  as  a
6295       quantifier  in its own right. Because it has two uses, it can sometimes
6296       appear doubled, as in
6297
6298         \d??\d
6299
6300       which matches one digit by preference, but can match two if that is the
6301       only way the rest of the pattern matches.
6302
6303       If  the PCRE_UNGREEDY option is set (an option that is not available in
6304       Perl), the quantifiers are not greedy by default, but  individual  ones
6305       can  be  made  greedy  by following them with a question mark. In other
6306       words, it inverts the default behaviour.
6307
6308       When a parenthesized subpattern is quantified  with  a  minimum  repeat
6309       count  that is greater than 1 or with a limited maximum, more memory is
6310       required for the compiled pattern, in proportion to  the  size  of  the
6311       minimum or maximum.
6312
6313       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
6314       alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
6315       the  pattern  is  implicitly anchored, because whatever follows will be
6316       tried against every character position in the subject string, so  there
6317       is  no  point  in  retrying the overall match at any position after the
6318       first. PCRE normally treats such a pattern as though it  were  preceded
6319       by \A.
6320
6321       In  cases  where  it  is known that the subject string contains no new-
6322       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
6323       mization, or alternatively using ^ to indicate anchoring explicitly.
6324
6325       However,  there  are  some cases where the optimization cannot be used.
6326       When .*  is inside capturing parentheses that are the subject of a back
6327       reference elsewhere in the pattern, a match at the start may fail where
6328       a later one succeeds. Consider, for example:
6329
6330         (.*)abc\1
6331
6332       If the subject is "xyz123abc123" the match point is the fourth  charac-
6333       ter. For this reason, such a pattern is not implicitly anchored.
6334
6335       Another  case where implicit anchoring is not applied is when the lead-
6336       ing .* is inside an atomic group. Once again, a match at the start  may
6337       fail where a later one succeeds. Consider this pattern:
6338
6339         (?>.*?a)b
6340
6341       It  matches "ab" in the subject "aab". The use of the backtracking con-
6342       trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
6343
6344       When a capturing subpattern is repeated, the value captured is the sub-
6345       string that matched the final iteration. For example, after
6346
6347         (tweedle[dume]{3}\s*)+
6348
6349       has matched "tweedledum tweedledee" the value of the captured substring
6350       is "tweedledee". However, if there are  nested  capturing  subpatterns,
6351       the  corresponding captured values may have been set in previous itera-
6352       tions. For example, after
6353
6354         /(a|(b))+/
6355
6356       matches "aba" the value of the second captured substring is "b".
6357
6358
6359ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
6360
6361       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
6362       repetition,  failure  of what follows normally causes the repeated item
6363       to be re-evaluated to see if a different number of repeats  allows  the
6364       rest  of  the pattern to match. Sometimes it is useful to prevent this,
6365       either to change the nature of the match, or to cause it  fail  earlier
6366       than  it otherwise might, when the author of the pattern knows there is
6367       no point in carrying on.
6368
6369       Consider, for example, the pattern \d+foo when applied to  the  subject
6370       line
6371
6372         123456bar
6373
6374       After matching all 6 digits and then failing to match "foo", the normal
6375       action of the matcher is to try again with only 5 digits  matching  the
6376       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
6377       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
6378       the  means for specifying that once a subpattern has matched, it is not
6379       to be re-evaluated in this way.
6380
6381       If we use atomic grouping for the previous example, the  matcher  gives
6382       up  immediately  on failing to match "foo" the first time. The notation
6383       is a kind of special parenthesis, starting with (?> as in this example:
6384
6385         (?>\d+)foo
6386
6387       This kind of parenthesis "locks up" the  part of the  pattern  it  con-
6388       tains  once  it  has matched, and a failure further into the pattern is
6389       prevented from backtracking into it. Backtracking past it  to  previous
6390       items, however, works as normal.
6391
6392       An  alternative  description  is that a subpattern of this type matches
6393       the string of characters that an  identical  standalone  pattern  would
6394       match, if anchored at the current point in the subject string.
6395
6396       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
6397       such as the above example can be thought of as a maximizing repeat that
6398       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
6399       pared to adjust the number of digits they match in order  to  make  the
6400       rest of the pattern match, (?>\d+) can only match an entire sequence of
6401       digits.
6402
6403       Atomic groups in general can of course contain arbitrarily  complicated
6404       subpatterns,  and  can  be  nested. However, when the subpattern for an
6405       atomic group is just a single repeated item, as in the example above, a
6406       simpler  notation,  called  a "possessive quantifier" can be used. This
6407       consists of an additional + character  following  a  quantifier.  Using
6408       this notation, the previous example can be rewritten as
6409
6410         \d++foo
6411
6412       Note that a possessive quantifier can be used with an entire group, for
6413       example:
6414
6415         (abc|xyz){2,3}+
6416
6417       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
6418       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
6419       simpler forms of atomic group. However, there is no difference  in  the
6420       meaning  of  a  possessive  quantifier and the equivalent atomic group,
6421       though there may be a performance  difference;  possessive  quantifiers
6422       should be slightly faster.
6423
6424       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
6425       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
6426       edition of his book. Mike McCloskey liked it, so implemented it when he
6427       built Sun's Java package, and PCRE copied it from there. It  ultimately
6428       found its way into Perl at release 5.10.
6429
6430       PCRE has an optimization that automatically "possessifies" certain sim-
6431       ple pattern constructs. For example, the sequence  A+B  is  treated  as
6432       A++B  because  there is no point in backtracking into a sequence of A's
6433       when B must follow.
6434
6435       When a pattern contains an unlimited repeat inside  a  subpattern  that
6436       can  itself  be  repeated  an  unlimited number of times, the use of an
6437       atomic group is the only way to avoid some  failing  matches  taking  a
6438       very long time indeed. The pattern
6439
6440         (\D+|<\d+>)*[!?]
6441
6442       matches  an  unlimited number of substrings that either consist of non-
6443       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
6444       matches, it runs quickly. However, if it is applied to
6445
6446         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
6447
6448       it  takes  a  long  time  before reporting failure. This is because the
6449       string can be divided between the internal \D+ repeat and the  external
6450       *  repeat  in  a  large  number of ways, and all have to be tried. (The
6451       example uses [!?] rather than a single character at  the  end,  because
6452       both  PCRE  and  Perl have an optimization that allows for fast failure
6453       when a single character is used. They remember the last single  charac-
6454       ter  that  is required for a match, and fail early if it is not present
6455       in the string.) If the pattern is changed so that  it  uses  an  atomic
6456       group, like this:
6457
6458         ((?>\D+)|<\d+>)*[!?]
6459
6460       sequences of non-digits cannot be broken, and failure happens quickly.
6461
6462
6463BACK REFERENCES
6464
6465       Outside a character class, a backslash followed by a digit greater than
6466       0 (and possibly further digits) is a back reference to a capturing sub-
6467       pattern  earlier  (that is, to its left) in the pattern, provided there
6468       have been that many previous capturing left parentheses.
6469
6470       However, if the decimal number following the backslash is less than 10,
6471       it  is  always  taken  as a back reference, and causes an error only if
6472       there are not that many capturing left parentheses in the  entire  pat-
6473       tern.  In  other words, the parentheses that are referenced need not be
6474       to the left of the reference for numbers less than 10. A "forward  back
6475       reference"  of  this  type can make sense when a repetition is involved
6476       and the subpattern to the right has participated in an  earlier  itera-
6477       tion.
6478
6479       It  is  not  possible to have a numerical "forward back reference" to a
6480       subpattern whose number is 10 or  more  using  this  syntax  because  a
6481       sequence  such  as  \50 is interpreted as a character defined in octal.
6482       See the subsection entitled "Non-printing characters" above for further
6483       details  of  the  handling of digits following a backslash. There is no
6484       such problem when named parentheses are used. A back reference  to  any
6485       subpattern is possible using named parentheses (see below).
6486
6487       Another  way  of  avoiding  the ambiguity inherent in the use of digits
6488       following a backslash is to use the \g  escape  sequence.  This  escape
6489       must be followed by an unsigned number or a negative number, optionally
6490       enclosed in braces. These examples are all identical:
6491
6492         (ring), \1
6493         (ring), \g1
6494         (ring), \g{1}
6495
6496       An unsigned number specifies an absolute reference without the  ambigu-
6497       ity that is present in the older syntax. It is also useful when literal
6498       digits follow the reference. A negative number is a relative reference.
6499       Consider this example:
6500
6501         (abc(def)ghi)\g{-1}
6502
6503       The sequence \g{-1} is a reference to the most recently started captur-
6504       ing subpattern before \g, that is, is it equivalent to \2 in this exam-
6505       ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
6506       references can be helpful in long patterns, and also in  patterns  that
6507       are  created  by  joining  together  fragments  that contain references
6508       within themselves.
6509
6510       A back reference matches whatever actually matched the  capturing  sub-
6511       pattern  in  the  current subject string, rather than anything matching
6512       the subpattern itself (see "Subpatterns as subroutines" below for a way
6513       of doing that). So the pattern
6514
6515         (sens|respons)e and \1ibility
6516
6517       matches  "sense and sensibility" and "response and responsibility", but
6518       not "sense and responsibility". If caseful matching is in force at  the
6519       time  of the back reference, the case of letters is relevant. For exam-
6520       ple,
6521
6522         ((?i)rah)\s+\1
6523
6524       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
6525       original capturing subpattern is matched caselessly.
6526
6527       There  are  several  different ways of writing back references to named
6528       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
6529       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
6530       unified back reference syntax, in which \g can be used for both numeric
6531       and  named  references,  is  also supported. We could rewrite the above
6532       example in any of the following ways:
6533
6534         (?<p1>(?i)rah)\s+\k<p1>
6535         (?'p1'(?i)rah)\s+\k{p1}
6536         (?P<p1>(?i)rah)\s+(?P=p1)
6537         (?<p1>(?i)rah)\s+\g{p1}
6538
6539       A subpattern that is referenced by  name  may  appear  in  the  pattern
6540       before or after the reference.
6541
6542       There  may be more than one back reference to the same subpattern. If a
6543       subpattern has not actually been used in a particular match,  any  back
6544       references to it always fail by default. For example, the pattern
6545
6546         (a|(bc))\2
6547
6548       always  fails  if  it starts to match "a" rather than "bc". However, if
6549       the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
6550       ence to an unset value matches an empty string.
6551
6552       Because  there may be many capturing parentheses in a pattern, all dig-
6553       its following a backslash are taken as part of a potential back  refer-
6554       ence  number.   If  the  pattern continues with a digit character, some
6555       delimiter must  be  used  to  terminate  the  back  reference.  If  the
6556       PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the
6557       \g{ syntax or an empty comment (see "Comments" below) can be used.
6558
6559   Recursive back references
6560
6561       A back reference that occurs inside the parentheses to which it  refers
6562       fails  when  the subpattern is first used, so, for example, (a\1) never
6563       matches.  However, such references can be useful inside  repeated  sub-
6564       patterns. For example, the pattern
6565
6566         (a|b\1)+
6567
6568       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
6569       ation of the subpattern,  the  back  reference  matches  the  character
6570       string  corresponding  to  the previous iteration. In order for this to
6571       work, the pattern must be such that the first iteration does  not  need
6572       to  match the back reference. This can be done using alternation, as in
6573       the example above, or by a quantifier with a minimum of zero.
6574
6575       Back references of this type cause the group that they reference to  be
6576       treated  as  an atomic group.  Once the whole group has been matched, a
6577       subsequent matching failure cannot cause backtracking into  the  middle
6578       of the group.
6579
6580
6581ASSERTIONS
6582
6583       An  assertion  is  a  test on the characters following or preceding the
6584       current matching point that does not actually consume  any  characters.
6585       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
6586       described above.
6587
6588       More complicated assertions are coded as  subpatterns.  There  are  two
6589       kinds:  those  that  look  ahead of the current position in the subject
6590       string, and those that look  behind  it.  An  assertion  subpattern  is
6591       matched  in  the  normal way, except that it does not cause the current
6592       matching position to be changed.
6593
6594       Assertion subpatterns are not capturing subpatterns. If such an  asser-
6595       tion  contains  capturing  subpatterns within it, these are counted for
6596       the purposes of numbering the capturing subpatterns in the  whole  pat-
6597       tern.  However,  substring  capturing  is carried out only for positive
6598       assertions. (Perl sometimes, but not always, does do capturing in nega-
6599       tive assertions.)
6600
6601       For  compatibility  with  Perl,  assertion subpatterns may be repeated;
6602       though it makes no sense to assert the same thing  several  times,  the
6603       side  effect  of  capturing  parentheses may occasionally be useful. In
6604       practice, there only three cases:
6605
6606       (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during
6607       matching.   However,  it  may  contain internal capturing parenthesized
6608       groups that are called from elsewhere via the subroutine mechanism.
6609
6610       (2) If quantifier is {0,n} where n is greater than zero, it is  treated
6611       as  if  it  were  {0,1}.  At run time, the rest of the pattern match is
6612       tried with and without the assertion, the order depending on the greed-
6613       iness of the quantifier.
6614
6615       (3)  If  the minimum repetition is greater than zero, the quantifier is
6616       ignored.  The assertion is obeyed just  once  when  encountered  during
6617       matching.
6618
6619   Lookahead assertions
6620
6621       Lookahead assertions start with (?= for positive assertions and (?! for
6622       negative assertions. For example,
6623
6624         \w+(?=;)
6625
6626       matches a word followed by a semicolon, but does not include the  semi-
6627       colon in the match, and
6628
6629         foo(?!bar)
6630
6631       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
6632       that the apparently similar pattern
6633
6634         (?!foo)bar
6635
6636       does not find an occurrence of "bar"  that  is  preceded  by  something
6637       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
6638       the assertion (?!foo) is always true when the next three characters are
6639       "bar". A lookbehind assertion is needed to achieve the other effect.
6640
6641       If you want to force a matching failure at some point in a pattern, the
6642       most convenient way to do it is  with  (?!)  because  an  empty  string
6643       always  matches, so an assertion that requires there not to be an empty
6644       string must always fail.  The backtracking control verb (*FAIL) or (*F)
6645       is a synonym for (?!).
6646
6647   Lookbehind assertions
6648
6649       Lookbehind  assertions start with (?<= for positive assertions and (?<!
6650       for negative assertions. For example,
6651
6652         (?<!foo)bar
6653
6654       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
6655       contents  of  a  lookbehind  assertion are restricted such that all the
6656       strings it matches must have a fixed length. However, if there are sev-
6657       eral  top-level  alternatives,  they  do  not all have to have the same
6658       fixed length. Thus
6659
6660         (?<=bullock|donkey)
6661
6662       is permitted, but
6663
6664         (?<!dogs?|cats?)
6665
6666       causes an error at compile time. Branches that match  different  length
6667       strings  are permitted only at the top level of a lookbehind assertion.
6668       This is an extension compared with Perl, which requires all branches to
6669       match the same length of string. An assertion such as
6670
6671         (?<=ab(c|de))
6672
6673       is  not  permitted,  because  its single top-level branch can match two
6674       different lengths, but it is acceptable to PCRE if rewritten to use two
6675       top-level branches:
6676
6677         (?<=abc|abde)
6678
6679       In  some  cases, the escape sequence \K (see above) can be used instead
6680       of a lookbehind assertion to get round the fixed-length restriction.
6681
6682       The implementation of lookbehind assertions is, for  each  alternative,
6683       to  temporarily  move the current position back by the fixed length and
6684       then try to match. If there are insufficient characters before the cur-
6685       rent position, the assertion fails.
6686
6687       In  a UTF mode, PCRE does not allow the \C escape (which matches a sin-
6688       gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
6689       because  it  makes it impossible to calculate the length of the lookbe-
6690       hind. The \X and \R escapes, which can match different numbers of  data
6691       units, are also not permitted.
6692
6693       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
6694       lookbehinds, as long as the subpattern matches a  fixed-length  string.
6695       Recursion, however, is not supported.
6696
6697       Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
6698       assertions to specify efficient matching of fixed-length strings at the
6699       end of subject strings. Consider a simple pattern such as
6700
6701         abcd$
6702
6703       when  applied  to  a  long string that does not match. Because matching
6704       proceeds from left to right, PCRE will look for each "a" in the subject
6705       and  then  see  if what follows matches the rest of the pattern. If the
6706       pattern is specified as
6707
6708         ^.*abcd$
6709
6710       the initial .* matches the entire string at first, but when this  fails
6711       (because there is no following "a"), it backtracks to match all but the
6712       last character, then all but the last two characters, and so  on.  Once
6713       again  the search for "a" covers the entire string, from right to left,
6714       so we are no better off. However, if the pattern is written as
6715
6716         ^.*+(?<=abcd)
6717
6718       there can be no backtracking for the .*+ item; it can  match  only  the
6719       entire  string.  The subsequent lookbehind assertion does a single test
6720       on the last four characters. If it fails, the match fails  immediately.
6721       For  long  strings, this approach makes a significant difference to the
6722       processing time.
6723
6724   Using multiple assertions
6725
6726       Several assertions (of any sort) may occur in succession. For example,
6727
6728         (?<=\d{3})(?<!999)foo
6729
6730       matches "foo" preceded by three digits that are not "999". Notice  that
6731       each  of  the  assertions is applied independently at the same point in
6732       the subject string. First there is a  check  that  the  previous  three
6733       characters  are  all  digits,  and  then there is a check that the same
6734       three characters are not "999".  This pattern does not match "foo" pre-
6735       ceded  by  six  characters,  the first of which are digits and the last
6736       three of which are not "999". For example, it  doesn't  match  "123abc-
6737       foo". A pattern to do that is
6738
6739         (?<=\d{3}...)(?<!999)foo
6740
6741       This  time  the  first assertion looks at the preceding six characters,
6742       checking that the first three are digits, and then the second assertion
6743       checks that the preceding three characters are not "999".
6744
6745       Assertions can be nested in any combination. For example,
6746
6747         (?<=(?<!foo)bar)baz
6748
6749       matches  an occurrence of "baz" that is preceded by "bar" which in turn
6750       is not preceded by "foo", while
6751
6752         (?<=\d{3}(?!999)...)foo
6753
6754       is another pattern that matches "foo" preceded by three digits and  any
6755       three characters that are not "999".
6756
6757
6758CONDITIONAL SUBPATTERNS
6759
6760       It  is possible to cause the matching process to obey a subpattern con-
6761       ditionally or to choose between two alternative subpatterns,  depending
6762       on  the result of an assertion, or whether a specific capturing subpat-
6763       tern has already been matched. The two possible  forms  of  conditional
6764       subpattern are:
6765
6766         (?(condition)yes-pattern)
6767         (?(condition)yes-pattern|no-pattern)
6768
6769       If  the  condition is satisfied, the yes-pattern is used; otherwise the
6770       no-pattern (if present) is used. If there are more  than  two  alterna-
6771       tives  in  the subpattern, a compile-time error occurs. Each of the two
6772       alternatives may itself contain nested subpatterns of any form, includ-
6773       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
6774       applies only at the level of the condition. This pattern fragment is an
6775       example where the alternatives are complex:
6776
6777         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
6778
6779
6780       There  are  four  kinds of condition: references to subpatterns, refer-
6781       ences to recursion, a pseudo-condition called DEFINE, and assertions.
6782
6783   Checking for a used subpattern by number
6784
6785       If the text between the parentheses consists of a sequence  of  digits,
6786       the condition is true if a capturing subpattern of that number has pre-
6787       viously matched. If there is more than one  capturing  subpattern  with
6788       the  same  number  (see  the earlier section about duplicate subpattern
6789       numbers), the condition is true if any of them have matched. An  alter-
6790       native  notation is to precede the digits with a plus or minus sign. In
6791       this case, the subpattern number is relative rather than absolute.  The
6792       most  recently opened parentheses can be referenced by (?(-1), the next
6793       most recent by (?(-2), and so on. Inside loops it can also  make  sense
6794       to refer to subsequent groups. The next parentheses to be opened can be
6795       referenced as (?(+1), and so on. (The value zero in any of these  forms
6796       is not used; it provokes a compile-time error.)
6797
6798       Consider  the  following  pattern, which contains non-significant white
6799       space to make it more readable (assume the PCRE_EXTENDED option) and to
6800       divide it into three parts for ease of discussion:
6801
6802         ( \( )?    [^()]+    (?(1) \) )
6803
6804       The  first  part  matches  an optional opening parenthesis, and if that
6805       character is present, sets it as the first captured substring. The sec-
6806       ond  part  matches one or more characters that are not parentheses. The
6807       third part is a conditional subpattern that tests whether  or  not  the
6808       first  set  of  parentheses  matched.  If they did, that is, if subject
6809       started with an opening parenthesis, the condition is true, and so  the
6810       yes-pattern  is  executed and a closing parenthesis is required. Other-
6811       wise, since no-pattern is not present, the subpattern matches  nothing.
6812       In  other  words,  this  pattern matches a sequence of non-parentheses,
6813       optionally enclosed in parentheses.
6814
6815       If you were embedding this pattern in a larger one,  you  could  use  a
6816       relative reference:
6817
6818         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
6819
6820       This  makes  the  fragment independent of the parentheses in the larger
6821       pattern.
6822
6823   Checking for a used subpattern by name
6824
6825       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
6826       used  subpattern  by  name.  For compatibility with earlier versions of
6827       PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
6828       also recognized.
6829
6830       Rewriting the above example to use a named subpattern gives this:
6831
6832         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
6833
6834       If  the  name used in a condition of this kind is a duplicate, the test
6835       is applied to all subpatterns of the same name, and is true if any  one
6836       of them has matched.
6837
6838   Checking for pattern recursion
6839
6840       If the condition is the string (R), and there is no subpattern with the
6841       name R, the condition is true if a recursive call to the whole  pattern
6842       or any subpattern has been made. If digits or a name preceded by amper-
6843       sand follow the letter R, for example:
6844
6845         (?(R3)...) or (?(R&name)...)
6846
6847       the condition is true if the most recent recursion is into a subpattern
6848       whose number or name is given. This condition does not check the entire
6849       recursion stack. If the name used in a condition  of  this  kind  is  a
6850       duplicate, the test is applied to all subpatterns of the same name, and
6851       is true if any one of them is the most recent recursion.
6852
6853       At "top level", all these recursion test  conditions  are  false.   The
6854       syntax for recursive patterns is described below.
6855
6856   Defining subpatterns for use by reference only
6857
6858       If  the  condition  is  the string (DEFINE), and there is no subpattern
6859       with the name DEFINE, the condition is  always  false.  In  this  case,
6860       there  may  be  only  one  alternative  in the subpattern. It is always
6861       skipped if control reaches this point  in  the  pattern;  the  idea  of
6862       DEFINE  is that it can be used to define subroutines that can be refer-
6863       enced from elsewhere. (The use of subroutines is described below.)  For
6864       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
6865       could be written like this (ignore white space and line breaks):
6866
6867         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
6868         \b (?&byte) (\.(?&byte)){3} \b
6869
6870       The first part of the pattern is a DEFINE group inside which a  another
6871       group  named "byte" is defined. This matches an individual component of
6872       an IPv4 address (a number less than 256). When  matching  takes  place,
6873       this  part  of  the pattern is skipped because DEFINE acts like a false
6874       condition. The rest of the pattern uses references to the  named  group
6875       to  match the four dot-separated components of an IPv4 address, insist-
6876       ing on a word boundary at each end.
6877
6878   Assertion conditions
6879
6880       If the condition is not in any of the above  formats,  it  must  be  an
6881       assertion.   This may be a positive or negative lookahead or lookbehind
6882       assertion. Consider  this  pattern,  again  containing  non-significant
6883       white space, and with the two alternatives on the second line:
6884
6885         (?(?=[^a-z]*[a-z])
6886         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
6887
6888       The  condition  is  a  positive  lookahead  assertion  that  matches an
6889       optional sequence of non-letters followed by a letter. In other  words,
6890       it  tests  for the presence of at least one letter in the subject. If a
6891       letter is found, the subject is matched against the first  alternative;
6892       otherwise  it  is  matched  against  the  second.  This pattern matches
6893       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
6894       letters and dd are digits.
6895
6896
6897COMMENTS
6898
6899       There are two ways of including comments in patterns that are processed
6900       by PCRE. In both cases, the start of the comment must not be in a char-
6901       acter class, nor in the middle of any other sequence of related charac-
6902       ters such as (?: or a subpattern name or number.  The  characters  that
6903       make up a comment play no part in the pattern matching.
6904
6905       The  sequence (?# marks the start of a comment that continues up to the
6906       next closing parenthesis. Nested parentheses are not permitted. If  the
6907       PCRE_EXTENDED option is set, an unescaped # character also introduces a
6908       comment, which in this case continues to  immediately  after  the  next
6909       newline  character  or character sequence in the pattern. Which charac-
6910       ters are interpreted as newlines is controlled by the options passed to
6911       a  compiling function or by a special sequence at the start of the pat-
6912       tern, as described in the section entitled "Newline conventions" above.
6913       Note that the end of this type of comment is a literal newline sequence
6914       in the pattern; escape sequences that happen to represent a newline  do
6915       not  count.  For  example,  consider this pattern when PCRE_EXTENDED is
6916       set, and the default newline convention is in force:
6917
6918         abc #comment \n still comment
6919
6920       On encountering the # character, pcre_compile()  skips  along,  looking
6921       for  a newline in the pattern. The sequence \n is still literal at this
6922       stage, so it does not terminate the comment. Only an  actual  character
6923       with the code value 0x0a (the default newline) does so.
6924
6925
6926RECURSIVE PATTERNS
6927
6928       Consider  the problem of matching a string in parentheses, allowing for
6929       unlimited nested parentheses. Without the use of  recursion,  the  best
6930       that  can  be  done  is  to use a pattern that matches up to some fixed
6931       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
6932       depth.
6933
6934       For some time, Perl has provided a facility that allows regular expres-
6935       sions to recurse (amongst other things). It does this by  interpolating
6936       Perl  code in the expression at run time, and the code can refer to the
6937       expression itself. A Perl pattern using code interpolation to solve the
6938       parentheses problem can be created like this:
6939
6940         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
6941
6942       The (?p{...}) item interpolates Perl code at run time, and in this case
6943       refers recursively to the pattern in which it appears.
6944
6945       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
6946       it  supports  special  syntax  for recursion of the entire pattern, and
6947       also for individual subpattern recursion.  After  its  introduction  in
6948       PCRE  and  Python,  this  kind of recursion was subsequently introduced
6949       into Perl at release 5.10.
6950
6951       A special item that consists of (? followed by a  number  greater  than
6952       zero  and  a  closing parenthesis is a recursive subroutine call of the
6953       subpattern of the given number, provided that  it  occurs  inside  that
6954       subpattern.  (If  not,  it is a non-recursive subroutine call, which is
6955       described in the next section.) The special item  (?R)  or  (?0)  is  a
6956       recursive call of the entire regular expression.
6957
6958       This  PCRE  pattern  solves  the nested parentheses problem (assume the
6959       PCRE_EXTENDED option is set so that white space is ignored):
6960
6961         \( ( [^()]++ | (?R) )* \)
6962
6963       First it matches an opening parenthesis. Then it matches any number  of
6964       substrings  which  can  either  be  a sequence of non-parentheses, or a
6965       recursive match of the pattern itself (that is, a  correctly  parenthe-
6966       sized substring).  Finally there is a closing parenthesis. Note the use
6967       of a possessive quantifier to avoid backtracking into sequences of non-
6968       parentheses.
6969
6970       If  this  were  part of a larger pattern, you would not want to recurse
6971       the entire pattern, so instead you could use this:
6972
6973         ( \( ( [^()]++ | (?1) )* \) )
6974
6975       We have put the pattern into parentheses, and caused the  recursion  to
6976       refer to them instead of the whole pattern.
6977
6978       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
6979       tricky. This is made easier by the use of relative references.  Instead
6980       of (?1) in the pattern above you can write (?-2) to refer to the second
6981       most recently opened parentheses  preceding  the  recursion.  In  other
6982       words,  a  negative  number counts capturing parentheses leftwards from
6983       the point at which it is encountered.
6984
6985       It is also possible to refer to  subsequently  opened  parentheses,  by
6986       writing  references  such  as (?+2). However, these cannot be recursive
6987       because the reference is not inside the  parentheses  that  are  refer-
6988       enced.  They are always non-recursive subroutine calls, as described in
6989       the next section.
6990
6991       An alternative approach is to use named parentheses instead.  The  Perl
6992       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
6993       supported. We could rewrite the above example as follows:
6994
6995         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
6996
6997       If there is more than one subpattern with the same name,  the  earliest
6998       one is used.
6999
7000       This  particular  example pattern that we have been looking at contains
7001       nested unlimited repeats, and so the use of a possessive quantifier for
7002       matching strings of non-parentheses is important when applying the pat-
7003       tern to strings that do not match. For example, when  this  pattern  is
7004       applied to
7005
7006         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
7007
7008       it  yields  "no  match" quickly. However, if a possessive quantifier is
7009       not used, the match runs for a very long time indeed because there  are
7010       so  many  different  ways the + and * repeats can carve up the subject,
7011       and all have to be tested before failure can be reported.
7012
7013       At the end of a match, the values of capturing  parentheses  are  those
7014       from  the outermost level. If you want to obtain intermediate values, a
7015       callout function can be used (see below and the pcrecallout  documenta-
7016       tion). If the pattern above is matched against
7017
7018         (ab(cd)ef)
7019
7020       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
7021       which is the last value taken on at the top level. If a capturing  sub-
7022       pattern  is  not  matched at the top level, its final captured value is
7023       unset, even if it was (temporarily) set at a deeper  level  during  the
7024       matching process.
7025
7026       If  there are more than 15 capturing parentheses in a pattern, PCRE has
7027       to obtain extra memory to store data during a recursion, which it  does
7028       by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
7029       can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
7030
7031       Do not confuse the (?R) item with the condition (R),  which  tests  for
7032       recursion.   Consider  this pattern, which matches text in angle brack-
7033       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
7034       brackets  (that is, when recursing), whereas any characters are permit-
7035       ted at the outer level.
7036
7037         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
7038
7039       In this pattern, (?(R) is the start of a conditional  subpattern,  with
7040       two  different  alternatives for the recursive and non-recursive cases.
7041       The (?R) item is the actual recursive call.
7042
7043   Differences in recursion processing between PCRE and Perl
7044
7045       Recursion processing in PCRE differs from Perl in two  important  ways.
7046       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
7047       always treated as an atomic group. That is, once it has matched some of
7048       the subject string, it is never re-entered, even if it contains untried
7049       alternatives and there is a subsequent matching failure.  This  can  be
7050       illustrated  by the following pattern, which purports to match a palin-
7051       dromic string that contains an odd number of characters  (for  example,
7052       "a", "aba", "abcba", "abcdcba"):
7053
7054         ^(.|(.)(?1)\2)$
7055
7056       The idea is that it either matches a single character, or two identical
7057       characters surrounding a sub-palindrome. In Perl, this  pattern  works;
7058       in  PCRE  it  does  not if the pattern is longer than three characters.
7059       Consider the subject string "abcba":
7060
7061       At the top level, the first character is matched, but as it is  not  at
7062       the end of the string, the first alternative fails; the second alterna-
7063       tive is taken and the recursion kicks in. The recursive call to subpat-
7064       tern  1  successfully  matches the next character ("b"). (Note that the
7065       beginning and end of line tests are not part of the recursion).
7066
7067       Back at the top level, the next character ("c") is compared  with  what
7068       subpattern  2 matched, which was "a". This fails. Because the recursion
7069       is treated as an atomic group, there are now  no  backtracking  points,
7070       and  so  the  entire  match fails. (Perl is able, at this point, to re-
7071       enter the recursion and try the second alternative.)  However,  if  the
7072       pattern is written with the alternatives in the other order, things are
7073       different:
7074
7075         ^((.)(?1)\2|.)$
7076
7077       This time, the recursing alternative is tried first, and  continues  to
7078       recurse  until  it runs out of characters, at which point the recursion
7079       fails. But this time we do have  another  alternative  to  try  at  the
7080       higher  level.  That  is  the  big difference: in the previous case the
7081       remaining alternative is at a deeper recursion level, which PCRE cannot
7082       use.
7083
7084       To  change  the pattern so that it matches all palindromic strings, not
7085       just those with an odd number of characters, it is tempting  to  change
7086       the pattern to this:
7087
7088         ^((.)(?1)\2|.?)$
7089
7090       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
7091       When a deeper recursion has matched a single character,  it  cannot  be
7092       entered  again  in  order  to match an empty string. The solution is to
7093       separate the two cases, and write out the odd and even cases as  alter-
7094       natives at the higher level:
7095
7096         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
7097
7098       If  you  want  to match typical palindromic phrases, the pattern has to
7099       ignore all non-word characters, which can be done like this:
7100
7101         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
7102
7103       If run with the PCRE_CASELESS option, this pattern matches phrases such
7104       as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
7105       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
7106       ing  into  sequences of non-word characters. Without this, PCRE takes a
7107       great deal longer (ten times or more) to  match  typical  phrases,  and
7108       Perl takes so long that you think it has gone into a loop.
7109
7110       WARNING:  The  palindrome-matching patterns above work only if the sub-
7111       ject string does not start with a palindrome that is shorter  than  the
7112       entire  string.  For example, although "abcba" is correctly matched, if
7113       the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
7114       then  fails at top level because the end of the string does not follow.
7115       Once again, it cannot jump back into the recursion to try other  alter-
7116       natives, so the entire match fails.
7117
7118       The  second  way  in which PCRE and Perl differ in their recursion pro-
7119       cessing is in the handling of captured values. In Perl, when a  subpat-
7120       tern  is  called recursively or as a subpattern (see the next section),
7121       it has no access to any values that were captured  outside  the  recur-
7122       sion,  whereas  in  PCRE  these values can be referenced. Consider this
7123       pattern:
7124
7125         ^(.)(\1|a(?2))
7126
7127       In PCRE, this pattern matches "bab". The  first  capturing  parentheses
7128       match  "b",  then in the second group, when the back reference \1 fails
7129       to match "b", the second alternative matches "a" and then recurses.  In
7130       the  recursion,  \1 does now match "b" and so the whole match succeeds.
7131       In Perl, the pattern fails to match because inside the  recursive  call
7132       \1 cannot access the externally set value.
7133
7134
7135SUBPATTERNS AS SUBROUTINES
7136
7137       If  the  syntax for a recursive subpattern call (either by number or by
7138       name) is used outside the parentheses to which it refers,  it  operates
7139       like  a subroutine in a programming language. The called subpattern may
7140       be defined before or after the reference. A numbered reference  can  be
7141       absolute or relative, as in these examples:
7142
7143         (...(absolute)...)...(?2)...
7144         (...(relative)...)...(?-1)...
7145         (...(?+1)...(relative)...
7146
7147       An earlier example pointed out that the pattern
7148
7149         (sens|respons)e and \1ibility
7150
7151       matches  "sense and sensibility" and "response and responsibility", but
7152       not "sense and responsibility". If instead the pattern
7153
7154         (sens|respons)e and (?1)ibility
7155
7156       is used, it does match "sense and responsibility" as well as the  other
7157       two  strings.  Another  example  is  given  in the discussion of DEFINE
7158       above.
7159
7160       All subroutine calls, whether recursive or not, are always  treated  as
7161       atomic  groups. That is, once a subroutine has matched some of the sub-
7162       ject string, it is never re-entered, even if it contains untried alter-
7163       natives  and  there  is  a  subsequent  matching failure. Any capturing
7164       parentheses that are set during the subroutine  call  revert  to  their
7165       previous values afterwards.
7166
7167       Processing  options  such as case-independence are fixed when a subpat-
7168       tern is defined, so if it is used as a subroutine, such options  cannot
7169       be changed for different calls. For example, consider this pattern:
7170
7171         (abc)(?i:(?-1))
7172
7173       It  matches  "abcabc". It does not match "abcABC" because the change of
7174       processing option does not affect the called subpattern.
7175
7176
7177ONIGURUMA SUBROUTINE SYNTAX
7178
7179       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
7180       name or a number enclosed either in angle brackets or single quotes, is
7181       an alternative syntax for referencing a  subpattern  as  a  subroutine,
7182       possibly  recursively. Here are two of the examples used above, rewrit-
7183       ten using this syntax:
7184
7185         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
7186         (sens|respons)e and \g'1'ibility
7187
7188       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
7189       plus or a minus sign it is taken as a relative reference. For example:
7190
7191         (abc)(?i:\g<-1>)
7192
7193       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
7194       synonymous. The former is a back reference; the latter is a  subroutine
7195       call.
7196
7197
7198CALLOUTS
7199
7200       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
7201       Perl code to be obeyed in the middle of matching a regular  expression.
7202       This makes it possible, amongst other things, to extract different sub-
7203       strings that match the same pair of parentheses when there is a repeti-
7204       tion.
7205
7206       PCRE provides a similar feature, but of course it cannot obey arbitrary
7207       Perl code. The feature is called "callout". The caller of PCRE provides
7208       an  external function by putting its entry point in the global variable
7209       pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit  or  32-bit
7210       library).   By default, this variable contains NULL, which disables all
7211       calling out.
7212
7213       Within a regular expression, (?C) indicates the  points  at  which  the
7214       external  function  is  to be called. If you want to identify different
7215       callout points, you can put a number less than 256 after the letter  C.
7216       The  default  value is zero.  For example, this pattern has two callout
7217       points:
7218
7219         (?C1)abc(?C2)def
7220
7221       If the PCRE_AUTO_CALLOUT flag is passed to a compiling function,  call-
7222       outs  are automatically installed before each item in the pattern. They
7223       are all numbered 255. If there is a conditional group  in  the  pattern
7224       whose condition is an assertion, an additional callout is inserted just
7225       before the condition. An explicit callout may also be set at this posi-
7226       tion, as in this example:
7227
7228         (?(?C9)(?=a)abc|def)
7229
7230       Note that this applies only to assertion conditions, not to other types
7231       of condition.
7232
7233       During matching, when PCRE reaches a callout point, the external  func-
7234       tion  is  called.  It  is  provided with the number of the callout, the
7235       position in the pattern, and, optionally, one item of  data  originally
7236       supplied  by  the caller of the matching function. The callout function
7237       may cause matching to proceed, to backtrack, or to fail altogether.
7238
7239       By default, PCRE implements a number of optimizations at  compile  time
7240       and  matching  time, and one side-effect is that sometimes callouts are
7241       skipped. If you need all possible callouts to happen, you need  to  set
7242       options  that  disable  the relevant optimizations. More details, and a
7243       complete description of the interface  to  the  callout  function,  are
7244       given in the pcrecallout documentation.
7245
7246
7247BACKTRACKING CONTROL
7248
7249       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
7250       which are still described in the Perl  documentation  as  "experimental
7251       and  subject to change or removal in a future version of Perl". It goes
7252       on to say: "Their usage in production code should  be  noted  to  avoid
7253       problems  during upgrades." The same remarks apply to the PCRE features
7254       described in this section.
7255
7256       The new verbs make use of what was previously invalid syntax: an  open-
7257       ing parenthesis followed by an asterisk. They are generally of the form
7258       (*VERB) or (*VERB:NAME). Some may take either form,  possibly  behaving
7259       differently  depending  on  whether or not a name is present. A name is
7260       any sequence of characters that does not include a closing parenthesis.
7261       The maximum length of name is 255 in the 8-bit library and 65535 in the
7262       16-bit and 32-bit libraries. If the name is  empty,  that  is,  if  the
7263       closing  parenthesis immediately follows the colon, the effect is as if
7264       the colon were not there.  Any number of these verbs  may  occur  in  a
7265       pattern.
7266
7267       Since  these  verbs  are  specifically related to backtracking, most of
7268       them can be used only when the pattern is to be matched  using  one  of
7269       the  traditional  matching  functions, because these use a backtracking
7270       algorithm. With the exception of (*FAIL), which behaves like a  failing
7271       negative  assertion,  the  backtracking control verbs cause an error if
7272       encountered by a DFA matching function.
7273
7274       The behaviour of these verbs in repeated  groups,  assertions,  and  in
7275       subpatterns called as subroutines (whether or not recursively) is docu-
7276       mented below.
7277
7278   Optimizations that affect backtracking verbs
7279
7280       PCRE contains some optimizations that are used to speed up matching  by
7281       running some checks at the start of each match attempt. For example, it
7282       may know the minimum length of matching subject, or that  a  particular
7283       character must be present. When one of these optimizations bypasses the
7284       running of a match,  any  included  backtracking  verbs  will  not,  of
7285       course, be processed. You can suppress the start-of-match optimizations
7286       by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
7287       pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
7288       There is more discussion of this option in the section entitled "Option
7289       bits for pcre_exec()" in the pcreapi documentation.
7290
7291       Experiments  with  Perl  suggest that it too has similar optimizations,
7292       sometimes leading to anomalous results.
7293
7294   Verbs that act immediately
7295
7296       The following verbs act as soon as they are encountered. They  may  not
7297       be followed by a name.
7298
7299          (*ACCEPT)
7300
7301       This  verb causes the match to end successfully, skipping the remainder
7302       of the pattern. However, when it is inside a subpattern that is  called
7303       as  a  subroutine, only that subpattern is ended successfully. Matching
7304       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
7305       tive  assertion,  the  assertion succeeds; in a negative assertion, the
7306       assertion fails.
7307
7308       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
7309       tured. For example:
7310
7311         A((?:A|B(*ACCEPT)|C)D)
7312
7313       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
7314       tured by the outer parentheses.
7315
7316         (*FAIL) or (*F)
7317
7318       This verb causes a matching failure, forcing backtracking to occur.  It
7319       is  equivalent to (?!) but easier to read. The Perl documentation notes
7320       that it is probably useful only when combined  with  (?{})  or  (??{}).
7321       Those  are,  of course, Perl features that are not present in PCRE. The
7322       nearest equivalent is the callout feature, as for example in this  pat-
7323       tern:
7324
7325         a+(?C)(*FAIL)
7326
7327       A  match  with the string "aaaa" always fails, but the callout is taken
7328       before each backtrack happens (in this example, 10 times).
7329
7330   Recording which path was taken
7331
7332       There is one verb whose main purpose  is  to  track  how  a  match  was
7333       arrived  at,  though  it  also  has a secondary use in conjunction with
7334       advancing the match starting point (see (*SKIP) below).
7335
7336         (*MARK:NAME) or (*:NAME)
7337
7338       A name is always  required  with  this  verb.  There  may  be  as  many
7339       instances  of  (*MARK) as you like in a pattern, and their names do not
7340       have to be unique.
7341
7342       When a match succeeds, the name of the  last-encountered  (*MARK:NAME),
7343       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
7344       the caller as  described  in  the  section  entitled  "Extra  data  for
7345       pcre_exec()"  in  the  pcreapi  documentation.  Here  is  an example of
7346       pcretest output, where the /K modifier requests the retrieval and  out-
7347       putting of (*MARK) data:
7348
7349           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
7350         data> XY
7351          0: XY
7352         MK: A
7353         XZ
7354          0: XZ
7355         MK: B
7356
7357       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
7358       ple it indicates which of the two alternatives matched. This is a  more
7359       efficient  way of obtaining this information than putting each alterna-
7360       tive in its own capturing parentheses.
7361
7362       If a verb with a name is encountered in a positive  assertion  that  is
7363       true,  the  name  is recorded and passed back if it is the last-encoun-
7364       tered. This does not happen for negative assertions or failing positive
7365       assertions.
7366
7367       After  a  partial match or a failed match, the last encountered name in
7368       the entire match process is returned. For example:
7369
7370           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
7371         data> XP
7372         No match, mark = B
7373
7374       Note that in this unanchored example the  mark  is  retained  from  the
7375       match attempt that started at the letter "X" in the subject. Subsequent
7376       match attempts starting at "P" and then with an empty string do not get
7377       as far as the (*MARK) item, but nevertheless do not reset it.
7378
7379       If  you  are  interested  in  (*MARK)  values after failed matches, you
7380       should probably set the PCRE_NO_START_OPTIMIZE option  (see  above)  to
7381       ensure that the match is always attempted.
7382
7383   Verbs that act after backtracking
7384
7385       The following verbs do nothing when they are encountered. Matching con-
7386       tinues with what follows, but if there is no subsequent match,  causing
7387       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
7388       cannot pass to the left of the verb. However, when one of  these  verbs
7389       appears inside an atomic group or an assertion that is true, its effect
7390       is confined to that group, because once the  group  has  been  matched,
7391       there  is never any backtracking into it. In this situation, backtrack-
7392       ing can "jump back" to the left of the entire atomic  group  or  asser-
7393       tion.  (Remember  also,  as  stated  above, that this localization also
7394       applies in subroutine calls.)
7395
7396       These verbs differ in exactly what kind of failure  occurs  when  back-
7397       tracking  reaches  them.  The behaviour described below is what happens
7398       when the verb is not in a subroutine or an assertion.  Subsequent  sec-
7399       tions cover these special cases.
7400
7401         (*COMMIT)
7402
7403       This  verb, which may not be followed by a name, causes the whole match
7404       to fail outright if there is a later matching failure that causes back-
7405       tracking  to  reach  it.  Even if the pattern is unanchored, no further
7406       attempts to find a match by advancing the starting point take place. If
7407       (*COMMIT)  is  the  only backtracking verb that is encountered, once it
7408       has been passed pcre_exec() is committed to finding a match at the cur-
7409       rent starting point, or not at all. For example:
7410
7411         a+(*COMMIT)b
7412
7413       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
7414       of dynamic anchor, or "I've started, so I must finish." The name of the
7415       most  recently passed (*MARK) in the path is passed back when (*COMMIT)
7416       forces a match failure.
7417
7418       If there is more than one backtracking verb in a pattern,  a  different
7419       one  that  follows  (*COMMIT) may be triggered first, so merely passing
7420       (*COMMIT) during a match does not always guarantee that a match must be
7421       at this starting point.
7422
7423       Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
7424       anchor, unless PCRE's start-of-match optimizations are turned  off,  as
7425       shown in this output from pcretest:
7426
7427           re> /(*COMMIT)abc/
7428         data> xyzabc
7429          0: abc
7430         data> xyzabc\Y
7431         No match
7432
7433       For this pattern, PCRE knows that any match must start with "a", so the
7434       optimization skips along the subject to "a" before applying the pattern
7435       to  the first set of data. The match attempt then succeeds. In the sec-
7436       ond set of data, the escape sequence \Y is interpreted by the  pcretest
7437       program.  It  causes  the  PCRE_NO_START_OPTIMIZE option to be set when
7438       pcre_exec() is called.  This disables the optimization that skips along
7439       to the first character. The pattern is now applied starting at "x", and
7440       so the (*COMMIT) causes the match to  fail  without  trying  any  other
7441       starting points.
7442
7443         (*PRUNE) or (*PRUNE:NAME)
7444
7445       This  verb causes the match to fail at the current starting position in
7446       the subject if there is a later matching failure that causes backtrack-
7447       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
7448       advance to the next starting character then happens.  Backtracking  can
7449       occur  as  usual to the left of (*PRUNE), before it is reached, or when
7450       matching to the right of (*PRUNE), but if there  is  no  match  to  the
7451       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
7452       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
7453       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
7454       any other way. In an anchored pattern (*PRUNE) has the same  effect  as
7455       (*COMMIT).
7456
7457       The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
7458       (*MARK:NAME)(*PRUNE).  It is like (*MARK:NAME)  in  that  the  name  is
7459       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
7460       searches only for names set with (*MARK).
7461
7462         (*SKIP)
7463
7464       This verb, when given without a name, is like (*PRUNE), except that  if
7465       the  pattern  is unanchored, the "bumpalong" advance is not to the next
7466       character, but to the position in the subject where (*SKIP) was encoun-
7467       tered.  (*SKIP)  signifies that whatever text was matched leading up to
7468       it cannot be part of a successful match. Consider:
7469
7470         a+(*SKIP)b
7471
7472       If the subject is "aaaac...",  after  the  first  match  attempt  fails
7473       (starting  at  the  first  character in the string), the starting point
7474       skips on to start the next attempt at "c". Note that a possessive quan-
7475       tifer  does not have the same effect as this example; although it would
7476       suppress backtracking  during  the  first  match  attempt,  the  second
7477       attempt  would  start at the second character instead of skipping on to
7478       "c".
7479
7480         (*SKIP:NAME)
7481
7482       When (*SKIP) has an associated name, its behaviour is modified. When it
7483       is triggered, the previous path through the pattern is searched for the
7484       most recent (*MARK) that has the  same  name.  If  one  is  found,  the
7485       "bumpalong" advance is to the subject position that corresponds to that
7486       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
7487       a matching name is found, the (*SKIP) is ignored.
7488
7489       Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
7490       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
7491
7492         (*THEN) or (*THEN:NAME)
7493
7494       This verb causes a skip to the next innermost  alternative  when  back-
7495       tracking  reaches  it.  That  is,  it  cancels any further backtracking
7496       within the current alternative. Its name  comes  from  the  observation
7497       that it can be used for a pattern-based if-then-else block:
7498
7499         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
7500
7501       If  the COND1 pattern matches, FOO is tried (and possibly further items
7502       after the end of the group if FOO succeeds); on  failure,  the  matcher
7503       skips  to  the second alternative and tries COND2, without backtracking
7504       into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
7505       quently  BAZ fails, there are no more alternatives, so there is a back-
7506       track to whatever came before the  entire  group.  If  (*THEN)  is  not
7507       inside an alternation, it acts like (*PRUNE).
7508
7509       The    behaviour   of   (*THEN:NAME)   is   the   not   the   same   as
7510       (*MARK:NAME)(*THEN).  It is like  (*MARK:NAME)  in  that  the  name  is
7511       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
7512       searches only for names set with (*MARK).
7513
7514       A subpattern that does not contain a | character is just a part of  the
7515       enclosing  alternative;  it  is  not a nested alternation with only one
7516       alternative. The effect of (*THEN) extends beyond such a subpattern  to
7517       the  enclosing alternative. Consider this pattern, where A, B, etc. are
7518       complex pattern fragments that do not contain any | characters at  this
7519       level:
7520
7521         A (B(*THEN)C) | D
7522
7523       If  A and B are matched, but there is a failure in C, matching does not
7524       backtrack into A; instead it moves to the next alternative, that is, D.
7525       However,  if the subpattern containing (*THEN) is given an alternative,
7526       it behaves differently:
7527
7528         A (B(*THEN)C | (*FAIL)) | D
7529
7530       The effect of (*THEN) is now confined to the inner subpattern. After  a
7531       failure in C, matching moves to (*FAIL), which causes the whole subpat-
7532       tern to fail because there are no more alternatives  to  try.  In  this
7533       case, matching does now backtrack into A.
7534
7535       Note  that  a  conditional  subpattern  is not considered as having two
7536       alternatives, because only one is ever used.  In  other  words,  the  |
7537       character in a conditional subpattern has a different meaning. Ignoring
7538       white space, consider:
7539
7540         ^.*? (?(?=a) a | b(*THEN)c )
7541
7542       If the subject is "ba", this pattern does not  match.  Because  .*?  is
7543       ungreedy,  it  initially  matches  zero characters. The condition (?=a)
7544       then fails, the character "b" is matched,  but  "c"  is  not.  At  this
7545       point,  matching does not backtrack to .*? as might perhaps be expected
7546       from the presence of the | character.  The  conditional  subpattern  is
7547       part of the single alternative that comprises the whole pattern, and so
7548       the match fails. (If there was a backtrack into  .*?,  allowing  it  to
7549       match "b", the match would succeed.)
7550
7551       The  verbs just described provide four different "strengths" of control
7552       when subsequent matching fails. (*THEN) is the weakest, carrying on the
7553       match  at  the next alternative. (*PRUNE) comes next, failing the match
7554       at the current starting position, but allowing an advance to  the  next
7555       character  (for an unanchored pattern). (*SKIP) is similar, except that
7556       the advance may be more than one character. (*COMMIT) is the strongest,
7557       causing the entire match to fail.
7558
7559   More than one backtracking verb
7560
7561       If  more  than  one  backtracking verb is present in a pattern, the one
7562       that is backtracked onto first acts. For example,  consider  this  pat-
7563       tern, where A, B, etc. are complex pattern fragments:
7564
7565         (A(*COMMIT)B(*THEN)C|ABD)
7566
7567       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
7568       match to fail. However, if A and B match, but C fails, the backtrack to
7569       (*THEN)  causes  the next alternative (ABD) to be tried. This behaviour
7570       is consistent, but is not always the same as Perl's. It means  that  if
7571       two  or  more backtracking verbs appear in succession, all the the last
7572       of them has no effect. Consider this example:
7573
7574         ...(*COMMIT)(*PRUNE)...
7575
7576       If there is a matching failure to the right, backtracking onto (*PRUNE)
7577       causes  it to be triggered, and its action is taken. There can never be
7578       a backtrack onto (*COMMIT).
7579
7580   Backtracking verbs in repeated groups
7581
7582       PCRE differs from  Perl  in  its  handling  of  backtracking  verbs  in
7583       repeated groups. For example, consider:
7584
7585         /(a(*COMMIT)b)+ac/
7586
7587       If  the  subject  is  "abac",  Perl matches, but PCRE fails because the
7588       (*COMMIT) in the second repeat of the group acts.
7589
7590   Backtracking verbs in assertions
7591
7592       (*FAIL) in an assertion has its normal effect: it forces  an  immediate
7593       backtrack.
7594
7595       (*ACCEPT) in a positive assertion causes the assertion to succeed with-
7596       out any further processing. In a negative assertion,  (*ACCEPT)  causes
7597       the assertion to fail without any further processing.
7598
7599       The  other  backtracking verbs are not treated specially if they appear
7600       in a positive assertion. In  particular,  (*THEN)  skips  to  the  next
7601       alternative  in  the  innermost  enclosing group that has alternations,
7602       whether or not this is within the assertion.
7603
7604       Negative assertions are, however, different, in order  to  ensure  that
7605       changing  a  positive  assertion  into a negative assertion changes its
7606       result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
7607       ative assertion to be true, without considering any further alternative
7608       branches in the assertion.  Backtracking into (*THEN) causes it to skip
7609       to  the next enclosing alternative within the assertion (the normal be-
7610       haviour), but if the assertion  does  not  have  such  an  alternative,
7611       (*THEN) behaves like (*PRUNE).
7612
7613   Backtracking verbs in subroutines
7614
7615       These  behaviours  occur whether or not the subpattern is called recur-
7616       sively.  Perl's treatment of subroutines is different in some cases.
7617
7618       (*FAIL) in a subpattern called as a subroutine has its  normal  effect:
7619       it forces an immediate backtrack.
7620
7621       (*ACCEPT)  in a subpattern called as a subroutine causes the subroutine
7622       match to succeed without any further processing. Matching then  contin-
7623       ues after the subroutine call.
7624
7625       (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
7626       cause the subroutine match to fail.
7627
7628       (*THEN) skips to the next alternative in the innermost enclosing  group
7629       within  the subpattern that has alternatives. If there is no such group
7630       within the subpattern, (*THEN) causes the subroutine match to fail.
7631
7632
7633SEE ALSO
7634
7635       pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),
7636       pcre16(3), pcre32(3).
7637
7638
7639AUTHOR
7640
7641       Philip Hazel
7642       University Computing Service
7643       Cambridge CB2 3QH, England.
7644
7645
7646REVISION
7647
7648       Last updated: 08 January 2014
7649       Copyright (c) 1997-2014 University of Cambridge.
7650------------------------------------------------------------------------------
7651
7652
7653PCRESYNTAX(3)              Library Functions Manual              PCRESYNTAX(3)
7654
7655
7656
7657NAME
7658       PCRE - Perl-compatible regular expressions
7659
7660PCRE REGULAR EXPRESSION SYNTAX SUMMARY
7661
7662       The  full syntax and semantics of the regular expressions that are sup-
7663       ported by PCRE are described in  the  pcrepattern  documentation.  This
7664       document contains a quick-reference summary of the syntax.
7665
7666
7667QUOTING
7668
7669         \x         where x is non-alphanumeric is a literal x
7670         \Q...\E    treat enclosed characters as literal
7671
7672
7673CHARACTERS
7674
7675         \a         alarm, that is, the BEL character (hex 07)
7676         \cx        "control-x", where x is any ASCII character
7677         \e         escape (hex 1B)
7678         \f         form feed (hex 0C)
7679         \n         newline (hex 0A)
7680         \r         carriage return (hex 0D)
7681         \t         tab (hex 09)
7682         \0dd       character with octal code 0dd
7683         \ddd       character with octal code ddd, or backreference
7684         \o{ddd..}  character with octal code ddd..
7685         \xhh       character with hex code hh
7686         \x{hhh..}  character with hex code hhh..
7687
7688       Note that \0dd is always an octal code, and that \8 and \9 are the lit-
7689       eral characters "8" and "9".
7690
7691
7692CHARACTER TYPES
7693
7694         .          any character except newline;
7695                      in dotall mode, any character whatsoever
7696         \C         one data unit, even in UTF mode (best avoided)
7697         \d         a decimal digit
7698         \D         a character that is not a decimal digit
7699         \h         a horizontal white space character
7700         \H         a character that is not a horizontal white space character
7701         \N         a character that is not a newline
7702         \p{xx}     a character with the xx property
7703         \P{xx}     a character without the xx property
7704         \R         a newline sequence
7705         \s         a white space character
7706         \S         a character that is not a white space character
7707         \v         a vertical white space character
7708         \V         a character that is not a vertical white space character
7709         \w         a "word" character
7710         \W         a "non-word" character
7711         \X         a Unicode extended grapheme cluster
7712
7713       By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8
7714       mode  or  in  the 16- bit and 32-bit libraries. However, if locale-spe-
7715       cific matching is happening, \s and \w may also match  characters  with
7716       code  points  in  the range 128-255. If the PCRE_UCP option is set, the
7717       behaviour of these escape sequences is changed to use  Unicode  proper-
7718       ties and they match many more characters.
7719
7720
7721GENERAL CATEGORY PROPERTIES FOR \p and \P
7722
7723         C          Other
7724         Cc         Control
7725         Cf         Format
7726         Cn         Unassigned
7727         Co         Private use
7728         Cs         Surrogate
7729
7730         L          Letter
7731         Ll         Lower case letter
7732         Lm         Modifier letter
7733         Lo         Other letter
7734         Lt         Title case letter
7735         Lu         Upper case letter
7736         L&         Ll, Lu, or Lt
7737
7738         M          Mark
7739         Mc         Spacing mark
7740         Me         Enclosing mark
7741         Mn         Non-spacing mark
7742
7743         N          Number
7744         Nd         Decimal number
7745         Nl         Letter number
7746         No         Other number
7747
7748         P          Punctuation
7749         Pc         Connector punctuation
7750         Pd         Dash punctuation
7751         Pe         Close punctuation
7752         Pf         Final punctuation
7753         Pi         Initial punctuation
7754         Po         Other punctuation
7755         Ps         Open punctuation
7756
7757         S          Symbol
7758         Sc         Currency symbol
7759         Sk         Modifier symbol
7760         Sm         Mathematical symbol
7761         So         Other symbol
7762
7763         Z          Separator
7764         Zl         Line separator
7765         Zp         Paragraph separator
7766         Zs         Space separator
7767
7768
7769PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
7770
7771         Xan        Alphanumeric: union of properties L and N
7772         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
7773         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
7774         Xuc        Univerally-named character: one that can be
7775                      represented by a Universal Character Name
7776         Xwd        Perl word: property Xan or underscore
7777
7778       Perl and POSIX space are now the same. Perl added VT to its space char-
7779       acter set at release 5.18 and PCRE changed at release 8.34.
7780
7781
7782SCRIPT NAMES FOR \p AND \P
7783
7784       Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak,  Bengali,
7785       Bopomofo,  Brahmi,  Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
7786       ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
7787       form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
7788       glyphs,  Elbasan,  Ethiopic,  Georgian,  Glagolitic,  Gothic,  Grantha,
7789       Greek,  Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo, Hebrew, Hiragana,
7790       Imperial_Aramaic,    Inherited,     Inscriptional_Pahlavi,     Inscrip-
7791       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
7792       Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha,  Limbu,  Lin-
7793       ear_A,  Linear_B,  Lisu,  Lycian, Lydian, Mahajani, Malayalam, Mandaic,
7794       Manichaean,     Meetei_Mayek,     Mende_Kikakui,      Meroitic_Cursive,
7795       Meroitic_Hieroglyphs,  Miao,  Modi, Mongolian, Mro, Myanmar, Nabataean,
7796       New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,  Old_Italic,   Old_North_Arabian,
7797       Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya,
7798       Pahawh_Hmong,    Palmyrene,    Pau_Cin_Hau,    Phags_Pa,    Phoenician,
7799       Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha-
7800       vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri,  Syriac,
7801       Tagalog,  Tagbanwa,  Tai_Le,  Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
7802       Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic,  Vai,  Warang_Citi,
7803       Yi.
7804
7805
7806CHARACTER CLASSES
7807
7808         [...]       positive character class
7809         [^...]      negative character class
7810         [x-y]       range (can be used for hex characters)
7811         [[:xxx:]]   positive POSIX named set
7812         [[:^xxx:]]  negative POSIX named set
7813
7814         alnum       alphanumeric
7815         alpha       alphabetic
7816         ascii       0-127
7817         blank       space or tab
7818         cntrl       control character
7819         digit       decimal digit
7820         graph       printing, excluding space
7821         lower       lower case letter
7822         print       printing, including space
7823         punct       printing, excluding alphanumeric
7824         space       white space
7825         upper       upper case letter
7826         word        same as \w
7827         xdigit      hexadecimal digit
7828
7829       In  PCRE,  POSIX character set names recognize only ASCII characters by
7830       default, but some of them use Unicode properties if  PCRE_UCP  is  set.
7831       You can use \Q...\E inside a character class.
7832
7833
7834QUANTIFIERS
7835
7836         ?           0 or 1, greedy
7837         ?+          0 or 1, possessive
7838         ??          0 or 1, lazy
7839         *           0 or more, greedy
7840         *+          0 or more, possessive
7841         *?          0 or more, lazy
7842         +           1 or more, greedy
7843         ++          1 or more, possessive
7844         +?          1 or more, lazy
7845         {n}         exactly n
7846         {n,m}       at least n, no more than m, greedy
7847         {n,m}+      at least n, no more than m, possessive
7848         {n,m}?      at least n, no more than m, lazy
7849         {n,}        n or more, greedy
7850         {n,}+       n or more, possessive
7851         {n,}?       n or more, lazy
7852
7853
7854ANCHORS AND SIMPLE ASSERTIONS
7855
7856         \b          word boundary
7857         \B          not a word boundary
7858         ^           start of subject
7859                      also after internal newline in multiline mode
7860         \A          start of subject
7861         $           end of subject
7862                      also before newline at end of subject
7863                      also before internal newline in multiline mode
7864         \Z          end of subject
7865                      also before newline at end of subject
7866         \z          end of subject
7867         \G          first matching position in subject
7868
7869
7870MATCH POINT RESET
7871
7872         \K          reset start of match
7873
7874       \K is honoured in positive assertions, but ignored in negative ones.
7875
7876
7877ALTERNATION
7878
7879         expr|expr|expr...
7880
7881
7882CAPTURING
7883
7884         (...)           capturing group
7885         (?<name>...)    named capturing group (Perl)
7886         (?'name'...)    named capturing group (Perl)
7887         (?P<name>...)   named capturing group (Python)
7888         (?:...)         non-capturing group
7889         (?|...)         non-capturing group; reset group numbers for
7890                          capturing groups in each alternative
7891
7892
7893ATOMIC GROUPS
7894
7895         (?>...)         atomic, non-capturing group
7896
7897
7898COMMENT
7899
7900         (?#....)        comment (not nestable)
7901
7902
7903OPTION SETTING
7904
7905         (?i)            caseless
7906         (?J)            allow duplicate names
7907         (?m)            multiline
7908         (?s)            single line (dotall)
7909         (?U)            default ungreedy (lazy)
7910         (?x)            extended (ignore white space)
7911         (?-...)         unset option(s)
7912
7913       The  following  are  recognized  only at the very start of a pattern or
7914       after one of the newline or \R options with similar syntax.  More  than
7915       one of them may appear.
7916
7917         (*LIMIT_MATCH=d) set the match limit to d (decimal number)
7918         (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
7919         (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
7920         (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
7921         (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
7922         (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
7923         (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
7924         (*UTF)          set appropriate UTF mode for the library in use
7925         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
7926
7927       Note  that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
7928       the limits set by the caller of pcre_exec(), not increase them.
7929
7930
7931NEWLINE CONVENTION
7932
7933       These are recognized only at the very start of  the  pattern  or  after
7934       option settings with a similar syntax.
7935
7936         (*CR)           carriage return only
7937         (*LF)           linefeed only
7938         (*CRLF)         carriage return followed by linefeed
7939         (*ANYCRLF)      all three of the above
7940         (*ANY)          any Unicode newline sequence
7941
7942
7943WHAT \R MATCHES
7944
7945       These  are  recognized  only  at the very start of the pattern or after
7946       option setting with a similar syntax.
7947
7948         (*BSR_ANYCRLF)  CR, LF, or CRLF
7949         (*BSR_UNICODE)  any Unicode newline sequence
7950
7951
7952LOOKAHEAD AND LOOKBEHIND ASSERTIONS
7953
7954         (?=...)         positive look ahead
7955         (?!...)         negative look ahead
7956         (?<=...)        positive look behind
7957         (?<!...)        negative look behind
7958
7959       Each top-level branch of a look behind must be of a fixed length.
7960
7961
7962BACKREFERENCES
7963
7964         \n              reference by number (can be ambiguous)
7965         \gn             reference by number
7966         \g{n}           reference by number
7967         \g{-n}          relative reference by number
7968         \k<name>        reference by name (Perl)
7969         \k'name'        reference by name (Perl)
7970         \g{name}        reference by name (Perl)
7971         \k{name}        reference by name (.NET)
7972         (?P=name)       reference by name (Python)
7973
7974
7975SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
7976
7977         (?R)            recurse whole pattern
7978         (?n)            call subpattern by absolute number
7979         (?+n)           call subpattern by relative number
7980         (?-n)           call subpattern by relative number
7981         (?&name)        call subpattern by name (Perl)
7982         (?P>name)       call subpattern by name (Python)
7983         \g<name>        call subpattern by name (Oniguruma)
7984         \g'name'        call subpattern by name (Oniguruma)
7985         \g<n>           call subpattern by absolute number (Oniguruma)
7986         \g'n'           call subpattern by absolute number (Oniguruma)
7987         \g<+n>          call subpattern by relative number (PCRE extension)
7988         \g'+n'          call subpattern by relative number (PCRE extension)
7989         \g<-n>          call subpattern by relative number (PCRE extension)
7990         \g'-n'          call subpattern by relative number (PCRE extension)
7991
7992
7993CONDITIONAL PATTERNS
7994
7995         (?(condition)yes-pattern)
7996         (?(condition)yes-pattern|no-pattern)
7997
7998         (?(n)...        absolute reference condition
7999         (?(+n)...       relative reference condition
8000         (?(-n)...       relative reference condition
8001         (?(<name>)...   named reference condition (Perl)
8002         (?('name')...   named reference condition (Perl)
8003         (?(name)...     named reference condition (PCRE)
8004         (?(R)...        overall recursion condition
8005         (?(Rn)...       specific group recursion condition
8006         (?(R&name)...   specific recursion condition
8007         (?(DEFINE)...   define subpattern for reference
8008         (?(assert)...   assertion condition
8009
8010
8011BACKTRACKING CONTROL
8012
8013       The following act immediately they are reached:
8014
8015         (*ACCEPT)       force successful match
8016         (*FAIL)         force backtrack; synonym (*F)
8017         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
8018
8019       The following act only when a subsequent match failure causes  a  back-
8020       track to reach them. They all force a match failure, but they differ in
8021       what happens afterwards. Those that advance the start-of-match point do
8022       so only if the pattern is not anchored.
8023
8024         (*COMMIT)       overall failure, no advance of starting point
8025         (*PRUNE)        advance to next starting character
8026         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
8027         (*SKIP)         advance to current matching position
8028         (*SKIP:NAME)    advance to position corresponding to an earlier
8029                         (*MARK:NAME); if not found, the (*SKIP) is ignored
8030         (*THEN)         local failure, backtrack to next alternation
8031         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
8032
8033
8034CALLOUTS
8035
8036         (?C)      callout
8037         (?Cn)     callout with data n
8038
8039
8040SEE ALSO
8041
8042       pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
8043
8044
8045AUTHOR
8046
8047       Philip Hazel
8048       University Computing Service
8049       Cambridge CB2 3QH, England.
8050
8051
8052REVISION
8053
8054       Last updated: 08 January 2014
8055       Copyright (c) 1997-2014 University of Cambridge.
8056------------------------------------------------------------------------------
8057
8058
8059PCREUNICODE(3)             Library Functions Manual             PCREUNICODE(3)
8060
8061
8062
8063NAME
8064       PCRE - Perl-compatible regular expressions
8065
8066UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT
8067
8068       As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30)
8069       and UTF-32 (from release 8.32), by means of two  additional  libraries.
8070       They can be built as well as, or instead of, the 8-bit library.
8071
8072
8073UTF-8 SUPPORT
8074
8075       In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
8076       with UTF support, and, in addition, you must call  pcre_compile()  with
8077       the  PCRE_UTF8 option flag, or the pattern must start with the sequence
8078       (*UTF8) or (*UTF). When either of these is the case, both  the  pattern
8079       and  any  subject  strings  that  are matched against it are treated as
8080       UTF-8 strings instead of strings of individual 1-byte characters.
8081
8082
8083UTF-16 AND UTF-32 SUPPORT
8084
8085       In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit
8086       or  32-bit  library  with  UTF support, and, in addition, you must call
8087       pcre16_compile() or pcre32_compile() with the PCRE_UTF16 or  PCRE_UTF32
8088       option flag, as appropriate. Alternatively, the pattern must start with
8089       the sequence (*UTF16), (*UTF32), as appropriate, or (*UTF),  which  can
8090       be used with either library. When UTF mode is set, both the pattern and
8091       any subject strings that are matched against it are treated  as  UTF-16
8092       or  UTF-32  strings  instead  of strings of individual 16-bit or 32-bit
8093       characters.
8094
8095
8096UTF SUPPORT OVERHEAD
8097
8098       If you compile PCRE with UTF support, but do not use it  at  run  time,
8099       the  library will be a bit bigger, but the additional run time overhead
8100       is limited to  testing  the  PCRE_UTF[8|16|32]  flag  occasionally,  so
8101       should not be very big.
8102
8103
8104UNICODE PROPERTY SUPPORT
8105
8106       If PCRE is built with Unicode character property support (which implies
8107       UTF support), the escape sequences \p{..}, \P{..}, and \X can be  used.
8108       The  available properties that can be tested are limited to the general
8109       category properties such as Lu for an upper case letter  or  Nd  for  a
8110       decimal number, the Unicode script names such as Arabic or Han, and the
8111       derived properties Any and L&. Full lists is given in  the  pcrepattern
8112       and  pcresyntax  documentation. Only the short names for properties are
8113       supported. For example, \p{L}  matches  a  letter.  Its  Perl  synonym,
8114       \p{Letter},  is  not  supported.  Furthermore, in Perl, many properties
8115       may optionally be prefixed by "Is", for compatibility  with  Perl  5.6.
8116       PCRE does not support this.
8117
8118   Validity of UTF-8 strings
8119
8120       When  you  set  the PCRE_UTF8 flag, the byte strings passed as patterns
8121       and subjects are (by default) checked for validity on entry to the rel-
8122       evant functions. The entire string is checked before any other process-
8123       ing takes place. From release 7.3 of PCRE, the check is  according  the
8124       rules of RFC 3629, which are themselves derived from the Unicode speci-
8125       fication. Earlier releases of PCRE followed  the  rules  of  RFC  2279,
8126       which  allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The
8127       current check allows only values in the range U+0 to U+10FFFF,  exclud-
8128       ing  the  surrogate area. (From release 8.33 the so-called "non-charac-
8129       ter" code points are no longer excluded because Unicode corrigendum  #9
8130       makes it clear that they should not be.)
8131
8132       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
8133       UTF-16, where they are used in pairs to encode codepoints  with  values
8134       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
8135       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
8136       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
8137       unfortunately messes up UTF-8 and UTF-32.)
8138
8139       If an invalid UTF-8 string is passed to PCRE, an error return is given.
8140       At  compile  time, the only additional information is the offset to the
8141       first byte of the failing character. The run-time functions pcre_exec()
8142       and  pcre_dfa_exec() also pass back this information, as well as a more
8143       detailed reason code if the caller has provided memory in which  to  do
8144       this.
8145
8146       In  some  situations, you may already know that your strings are valid,
8147       and therefore want to skip these checks in  order  to  improve  perfor-
8148       mance,  for  example in the case of a long subject string that is being
8149       scanned repeatedly.  If you set the PCRE_NO_UTF8_CHECK flag at  compile
8150       time  or  at  run  time, PCRE assumes that the pattern or subject it is
8151       given (respectively) contains only valid UTF-8 codes. In this case,  it
8152       does not diagnose an invalid UTF-8 string.
8153
8154       Note  that  passing  PCRE_NO_UTF8_CHECK to pcre_compile() just disables
8155       the check for the pattern; it does not also apply to  subject  strings.
8156       If  you  want  to  disable the check for a subject string you must pass
8157       this option to pcre_exec() or pcre_dfa_exec().
8158
8159       If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the
8160       result is undefined and your program may crash.
8161
8162   Validity of UTF-16 strings
8163
8164       When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
8165       are passed as patterns and subjects are (by default) checked for valid-
8166       ity  on entry to the relevant functions. Values other than those in the
8167       surrogate range U+D800 to U+DFFF are independent code points. Values in
8168       the surrogate range must be used in pairs in the correct manner.
8169
8170       If  an  invalid  UTF-16  string  is  passed to PCRE, an error return is
8171       given. At compile time, the only additional information is  the  offset
8172       to the first data unit of the failing character. The run-time functions
8173       pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
8174       well  as  a more detailed reason code if the caller has provided memory
8175       in which to do this.
8176
8177       In some situations, you may already know that your strings  are  valid,
8178       and  therefore  want  to  skip these checks in order to improve perfor-
8179       mance. If you set the PCRE_NO_UTF16_CHECK flag at compile  time  or  at
8180       run time, PCRE assumes that the pattern or subject it is given (respec-
8181       tively) contains only valid UTF-16 sequences. In this case, it does not
8182       diagnose  an  invalid  UTF-16 string.  However, if an invalid string is
8183       passed, the result is undefined.
8184
8185   Validity of UTF-32 strings
8186
8187       When you set the PCRE_UTF32 flag, the strings of 32-bit data units that
8188       are passed as patterns and subjects are (by default) checked for valid-
8189       ity on entry to the relevant functions.  This check allows only  values
8190       in  the  range  U+0 to U+10FFFF, excluding the surrogate area U+D800 to
8191       U+DFFF.
8192
8193       If an invalid UTF-32 string is passed  to  PCRE,  an  error  return  is
8194       given.  At  compile time, the only additional information is the offset
8195       to the first data unit of the failing character. The run-time functions
8196       pcre32_exec() and pcre32_dfa_exec() also pass back this information, as
8197       well as a more detailed reason code if the caller has  provided  memory
8198       in which to do this.
8199
8200       In  some  situations, you may already know that your strings are valid,
8201       and therefore want to skip these checks in  order  to  improve  perfor-
8202       mance.  If  you  set the PCRE_NO_UTF32_CHECK flag at compile time or at
8203       run time, PCRE assumes that the pattern or subject it is given (respec-
8204       tively) contains only valid UTF-32 sequences. In this case, it does not
8205       diagnose an invalid UTF-32 string.  However, if an  invalid  string  is
8206       passed, the result is undefined.
8207
8208   General comments about UTF modes
8209
8210       1.  Codepoints  less  than  256  can be specified in patterns by either
8211       braced or unbraced hexadecimal escape sequences (for example, \x{b3} or
8212       \xb3). Larger values have to use braced sequences.
8213
8214       2.  Octal  numbers  up  to  \777 are recognized, and in UTF-8 mode they
8215       match two-byte characters for values greater than \177.
8216
8217       3. Repeat quantifiers apply to complete UTF characters, not to individ-
8218       ual data units, for example: \x{100}{3}.
8219
8220       4.  The dot metacharacter matches one UTF character instead of a single
8221       data unit.
8222
8223       5. The escape sequence \C can be used to match a single byte  in  UTF-8
8224       mode,  or  a single 16-bit data unit in UTF-16 mode, or a single 32-bit
8225       data unit in UTF-32 mode, but its use can lead to some strange  effects
8226       because  it  breaks up multi-unit characters (see the description of \C
8227       in the pcrepattern documentation). The use of \C is  not  supported  in
8228       the  alternative  matching  function  pcre[16|32]_dfa_exec(), nor is it
8229       supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If
8230       JIT  optimization  is  requested for a UTF pattern that contains \C, it
8231       will not succeed, and so the matching will be carried out by the normal
8232       interpretive function.
8233
8234       6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
8235       test characters of any code value, but, by default, the characters that
8236       PCRE  recognizes  as digits, spaces, or word characters remain the same
8237       set as in non-UTF mode, all with values less  than  256.  This  remains
8238       true  even  when  PCRE  is  built  to include Unicode property support,
8239       because to do otherwise would slow down PCRE in many common cases. Note
8240       in  particular that this applies to \b and \B, because they are defined
8241       in terms of \w and \W. If you really want to test for a wider sense of,
8242       say,  "digit",  you  can  use  explicit  Unicode property tests such as
8243       \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
8244       character  escapes  work is changed so that Unicode properties are used
8245       to determine which characters match. There are more details in the sec-
8246       tion on generic character types in the pcrepattern documentation.
8247
8248       7.  Similarly,  characters that match the POSIX named character classes
8249       are all low-valued characters, unless the PCRE_UCP option is set.
8250
8251       8. However, the horizontal and vertical white  space  matching  escapes
8252       (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
8253       whether or not PCRE_UCP is set.
8254
8255       9. Case-insensitive matching applies only to  characters  whose  values
8256       are  less than 128, unless PCRE is built with Unicode property support.
8257       A few Unicode characters such as Greek sigma have more than  two  code-
8258       points that are case-equivalent. Up to and including PCRE release 8.31,
8259       only one-to-one case mappings were supported, but later releases  (with
8260       Unicode  property  support) do treat as case-equivalent all versions of
8261       characters such as Greek sigma.
8262
8263
8264AUTHOR
8265
8266       Philip Hazel
8267       University Computing Service
8268       Cambridge CB2 3QH, England.
8269
8270
8271REVISION
8272
8273       Last updated: 27 February 2013
8274       Copyright (c) 1997-2013 University of Cambridge.
8275------------------------------------------------------------------------------
8276
8277
8278PCREJIT(3)                 Library Functions Manual                 PCREJIT(3)
8279
8280
8281
8282NAME
8283       PCRE - Perl-compatible regular expressions
8284
8285PCRE JUST-IN-TIME COMPILER SUPPORT
8286
8287       Just-in-time  compiling  is a heavyweight optimization that can greatly
8288       speed up pattern matching. However, it comes at the cost of extra  pro-
8289       cessing before the match is performed. Therefore, it is of most benefit
8290       when the same pattern is going to be matched many times. This does  not
8291       necessarily  mean  many calls of a matching function; if the pattern is
8292       not anchored, matching attempts may take place many  times  at  various
8293       positions  in  the  subject, even for a single call.  Therefore, if the
8294       subject string is very long, it may still pay to use  JIT  for  one-off
8295       matches.
8296
8297       JIT  support  applies  only to the traditional Perl-compatible matching
8298       function.  It does not apply when the DFA matching  function  is  being
8299       used. The code for this support was written by Zoltan Herczeg.
8300
8301
83028-BIT, 16-BIT AND 32-BIT SUPPORT
8303
8304       JIT  support  is available for all of the 8-bit, 16-bit and 32-bit PCRE
8305       libraries. To keep this documentation simple, only the 8-bit  interface
8306       is described in what follows. If you are using the 16-bit library, sub-
8307       stitute the  16-bit  functions  and  16-bit  structures  (for  example,
8308       pcre16_jit_stack  instead  of  pcre_jit_stack).  If  you  are using the
8309       32-bit library, substitute the 32-bit functions and  32-bit  structures
8310       (for example, pcre32_jit_stack instead of pcre_jit_stack).
8311
8312
8313AVAILABILITY OF JIT SUPPORT
8314
8315       JIT  support  is  an  optional  feature of PCRE. The "configure" option
8316       --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
8317       built  if  you want to use JIT. The support is limited to the following
8318       hardware platforms:
8319
8320         ARM v5, v7, and Thumb2
8321         Intel x86 32-bit and 64-bit
8322         MIPS 32-bit
8323         Power PC 32-bit and 64-bit
8324         SPARC 32-bit (experimental)
8325
8326       If --enable-jit is set on an unsupported platform, compilation fails.
8327
8328       A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
8329       port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
8330       option. The result is 1 when JIT is available, and  0  otherwise.  How-
8331       ever, a simple program does not need to check this in order to use JIT.
8332       The normal API is implemented in a way that falls back to the interpre-
8333       tive code if JIT is not available. For programs that need the best pos-
8334       sible performance, there is also a "fast path"  API  that  is  JIT-spe-
8335       cific.
8336
8337       If  your program may sometimes be linked with versions of PCRE that are
8338       older than 8.20, but you want to use JIT when it is available, you  can
8339       test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
8340       macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
8341
8342
8343SIMPLE USE OF JIT
8344
8345       You have to do two things to make use of the JIT support  in  the  sim-
8346       plest way:
8347
8348         (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
8349             each compiled pattern, and pass the resulting pcre_extra block to
8350             pcre_exec().
8351
8352         (2) Use pcre_free_study() to free the pcre_extra block when it is
8353             no  longer  needed,  instead  of  just  freeing it yourself. This
8354       ensures that
8355             any JIT data is also freed.
8356
8357       For a program that may be linked with pre-8.20 versions  of  PCRE,  you
8358       can insert
8359
8360         #ifndef PCRE_STUDY_JIT_COMPILE
8361         #define PCRE_STUDY_JIT_COMPILE 0
8362         #endif
8363
8364       so  that  no  option  is passed to pcre_study(), and then use something
8365       like this to free the study data:
8366
8367         #ifdef PCRE_CONFIG_JIT
8368             pcre_free_study(study_ptr);
8369         #else
8370             pcre_free(study_ptr);
8371         #endif
8372
8373       PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate  code  for
8374       complete  matches.  If  you  want  to  run  partial  matches  using the
8375       PCRE_PARTIAL_HARD or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(),  you
8376       should  set  one  or  both  of the following options in addition to, or
8377       instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
8378
8379         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
8380         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
8381
8382       The JIT compiler generates different optimized code  for  each  of  the
8383       three  modes  (normal, soft partial, hard partial). When pcre_exec() is
8384       called, the appropriate code is run if it is available. Otherwise,  the
8385       pattern is matched using interpretive code.
8386
8387       In  some circumstances you may need to call additional functions. These
8388       are described in the  section  entitled  "Controlling  the  JIT  stack"
8389       below.
8390
8391       If  JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc. are
8392       ignored, and no JIT data is created. Otherwise, the compiled pattern is
8393       passed  to the JIT compiler, which turns it into machine code that exe-
8394       cutes much faster than the normal interpretive code.  When  pcre_exec()
8395       is  passed  a  pcre_extra block containing a pointer to JIT code of the
8396       appropriate mode (normal or hard/soft  partial),  it  obeys  that  code
8397       instead  of  running  the interpreter. The result is identical, but the
8398       compiled JIT code runs much faster.
8399
8400       There are some pcre_exec() options that are not supported for JIT  exe-
8401       cution.  There  are  also  some  pattern  items that JIT cannot handle.
8402       Details are given below. In both cases, execution  automatically  falls
8403       back  to  the  interpretive  code.  If you want to know whether JIT was
8404       actually used for a particular match, you  should  arrange  for  a  JIT
8405       callback  function  to  be  set up as described in the section entitled
8406       "Controlling the JIT stack" below, even if you do not need to supply  a
8407       non-default  JIT stack. Such a callback function is called whenever JIT
8408       code is about to be obeyed. If the execution options are not right  for
8409       JIT execution, the callback function is not obeyed.
8410
8411       If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
8412       ated. You can find out if JIT execution is available after  studying  a
8413       pattern  by  calling  pcre_fullinfo()  with the PCRE_INFO_JIT option. A
8414       result of 1 means that JIT compilation was successful. A  result  of  0
8415       means that JIT support is not available, or the pattern was not studied
8416       with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not  able  to
8417       handle the pattern.
8418
8419       Once a pattern has been studied, with or without JIT, it can be used as
8420       many times as you like for matching different subject strings.
8421
8422
8423UNSUPPORTED OPTIONS AND PATTERN ITEMS
8424
8425       The only pcre_exec() options that are supported for JIT  execution  are
8426       PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT-
8427       BOL,  PCRE_NOTEOL,  PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,   PCRE_PAR-
8428       TIAL_HARD, and PCRE_PARTIAL_SOFT.
8429
8430       The  only  unsupported  pattern items are \C (match a single data unit)
8431       when running in a UTF mode, and a callout immediately before an  asser-
8432       tion condition in a conditional group.
8433
8434
8435RETURN VALUES FROM JIT EXECUTION
8436
8437       When  a  pattern  is matched using JIT execution, the return values are
8438       the same as those given by the interpretive pcre_exec() code, with  the
8439       addition  of  one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means
8440       that the memory used for the JIT stack was insufficient. See  "Control-
8441       ling the JIT stack" below for a discussion of JIT stack usage. For com-
8442       patibility with the interpretive pcre_exec() code, no  more  than  two-
8443       thirds  of  the ovector argument is used for passing back captured sub-
8444       strings.
8445
8446       The error code PCRE_ERROR_MATCHLIMIT is returned by  the  JIT  code  if
8447       searching  a  very large pattern tree goes on for too long, as it is in
8448       the same circumstance when JIT is not used, but the details of  exactly
8449       what  is  counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error
8450       code is never returned by JIT execution.
8451
8452
8453SAVING AND RESTORING COMPILED PATTERNS
8454
8455       The code that is generated by the  JIT  compiler  is  architecture-spe-
8456       cific,  and  is also position dependent. For those reasons it cannot be
8457       saved (in a file or database) and restored later like the bytecode  and
8458       other  data  of  a compiled pattern. Saving and restoring compiled pat-
8459       terns is not something many people do. More detail about this  facility
8460       is  given in the pcreprecompile documentation. It should be possible to
8461       run pcre_study() on a saved and restored pattern, and thereby  recreate
8462       the  JIT  data, but because JIT compilation uses significant resources,
8463       it is probably not worth doing this; you might as  well  recompile  the
8464       original pattern.
8465
8466
8467CONTROLLING THE JIT STACK
8468
8469       When the compiled JIT code runs, it needs a block of memory to use as a
8470       stack.  By default, it uses 32K on the  machine  stack.  However,  some
8471       large   or   complicated  patterns  need  more  than  this.  The  error
8472       PCRE_ERROR_JIT_STACKLIMIT is given when  there  is  not  enough  stack.
8473       Three  functions  are provided for managing blocks of memory for use as
8474       JIT stacks. There is further discussion about the use of JIT stacks  in
8475       the section entitled "JIT stack FAQ" below.
8476
8477       The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
8478       are a starting size and a maximum size, and it returns a pointer to  an
8479       opaque  structure of type pcre_jit_stack, or NULL if there is an error.
8480       The pcre_jit_stack_free() function can be used to free a stack that  is
8481       no  longer  needed.  (For  the technically minded: the address space is
8482       allocated by mmap or VirtualAlloc.)
8483
8484       JIT uses far less memory for recursion than the interpretive code,  and
8485       a  maximum  stack size of 512K to 1M should be more than enough for any
8486       pattern.
8487
8488       The pcre_assign_jit_stack() function specifies  which  stack  JIT  code
8489       should use. Its arguments are as follows:
8490
8491         pcre_extra         *extra
8492         pcre_jit_callback  callback
8493         void               *data
8494
8495       The  extra  argument  must  be  the  result  of studying a pattern with
8496       PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
8497       other two options:
8498
8499         (1) If callback is NULL and data is NULL, an internal 32K block
8500             on the machine stack is used.
8501
8502         (2) If callback is NULL and data is not NULL, data must be
8503             a valid JIT stack, the result of calling pcre_jit_stack_alloc().
8504
8505         (3) If callback is not NULL, it must point to a function that is
8506             called with data as an argument at the start of matching, in
8507             order to set up a JIT stack. If the return from the callback
8508             function is NULL, the internal 32K stack is used; otherwise the
8509             return value must be a valid JIT stack, the result of calling
8510             pcre_jit_stack_alloc().
8511
8512       A  callback function is obeyed whenever JIT code is about to be run; it
8513       is not obeyed when pcre_exec() is called with options that  are  incom-
8514       patible for JIT execution. A callback function can therefore be used to
8515       determine whether a match operation was  executed  by  JIT  or  by  the
8516       interpreter.
8517
8518       You may safely use the same JIT stack for more than one pattern (either
8519       by assigning directly or by callback), as long as the patterns are  all
8520       matched  sequentially in the same thread. In a multithread application,
8521       if you do not specify a JIT stack, or if you assign or pass  back  NULL
8522       from  a  callback, that is thread-safe, because each thread has its own
8523       machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
8524       stack,  this  must  be  a  different  stack for each thread so that the
8525       application is thread-safe.
8526
8527       Strictly speaking, even more is allowed. You can assign the  same  non-
8528       NULL  stack  to any number of patterns as long as they are not used for
8529       matching by multiple threads at the same time.  For  example,  you  can
8530       assign  the same stack to all compiled patterns, and use a global mutex
8531       in the callback to wait until the stack is available for use.  However,
8532       this is an inefficient solution, and not recommended.
8533
8534       This  is a suggestion for how a multithreaded program that needs to set
8535       up non-default JIT stacks might operate:
8536
8537         During thread initalization
8538           thread_local_var = pcre_jit_stack_alloc(...)
8539
8540         During thread exit
8541           pcre_jit_stack_free(thread_local_var)
8542
8543         Use a one-line callback function
8544           return thread_local_var
8545
8546       All the functions described in this section do nothing if  JIT  is  not
8547       available,  and  pcre_assign_jit_stack()  does nothing unless the extra
8548       argument is non-NULL and points to  a  pcre_extra  block  that  is  the
8549       result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
8550
8551
8552JIT STACK FAQ
8553
8554       (1) Why do we need JIT stacks?
8555
8556       PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack
8557       where the local data of the current node is pushed before checking  its
8558       child nodes.  Allocating real machine stack on some platforms is diffi-
8559       cult. For example, the stack chain needs to be updated every time if we
8560       extend  the  stack  on  PowerPC.  Although it is possible, its updating
8561       time overhead decreases performance. So we do the recursion in memory.
8562
8563       (2) Why don't we simply allocate blocks of memory with malloc()?
8564
8565       Modern operating systems have a  nice  feature:  they  can  reserve  an
8566       address space instead of allocating memory. We can safely allocate mem-
8567       ory pages inside this address space, so the stack  could  grow  without
8568       moving memory data (this is important because of pointers). Thus we can
8569       allocate 1M address space, and use only a single memory  page  (usually
8570       4K)  if  that is enough. However, we can still grow up to 1M anytime if
8571       needed.
8572
8573       (3) Who "owns" a JIT stack?
8574
8575       The owner of the stack is the user program, not the JIT studied pattern
8576       or  anything else. The user program must ensure that if a stack is used
8577       by pcre_exec(), (that is, it is assigned to the pattern currently  run-
8578       ning), that stack must not be used by any other threads (to avoid over-
8579       writing the same memory area). The best practice for multithreaded pro-
8580       grams  is  to  allocate  a stack for each thread, and return this stack
8581       through the JIT callback function.
8582
8583       (4) When should a JIT stack be freed?
8584
8585       You can free a JIT stack at any time, as long as it will not be used by
8586       pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a
8587       pointer is set. There is no reference counting or any other magic.  You
8588       can  free  the  patterns  and stacks in any order, anytime. Just do not
8589       call pcre_exec() with a pattern pointing to an already freed stack,  as
8590       that  will cause SEGFAULT. (Also, do not free a stack currently used by
8591       pcre_exec() in another thread). You can also replace the  stack  for  a
8592       pattern  at  any  time.  You  can  even  free the previous stack before
8593       assigning a replacement.
8594
8595       (5) Should I allocate/free a  stack  every  time  before/after  calling
8596       pcre_exec()?
8597
8598       No,  because  this  is  too  costly in terms of resources. However, you
8599       could implement some clever idea which release the stack if it  is  not
8600       used  in  let's  say  two minutes. The JIT callback can help to achieve
8601       this without keeping a list of the currently JIT studied patterns.
8602
8603       (6) OK, the stack is for long term memory allocation. But what  happens
8604       if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
8605       until the stack is freed?
8606
8607       Especially on embedded sytems, it might be a good idea to release  mem-
8608       ory  sometimes  without  freeing the stack. There is no API for this at
8609       the moment.  Probably a function call which returns with the  currently
8610       allocated  memory for any stack and another which allows releasing mem-
8611       ory (shrinking the stack) would be a good idea if someone needs this.
8612
8613       (7) This is too much of a headache. Isn't there any better solution for
8614       JIT stack handling?
8615
8616       No,  thanks to Windows. If POSIX threads were used everywhere, we could
8617       throw out this complicated API.
8618
8619
8620EXAMPLE CODE
8621
8622       This is a single-threaded example that specifies a  JIT  stack  without
8623       using a callback.
8624
8625         int rc;
8626         int ovector[30];
8627         pcre *re;
8628         pcre_extra *extra;
8629         pcre_jit_stack *jit_stack;
8630
8631         re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
8632         /* Check for errors */
8633         extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
8634         jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
8635         /* Check for error (NULL) */
8636         pcre_assign_jit_stack(extra, NULL, jit_stack);
8637         rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
8638         /* Check results */
8639         pcre_free(re);
8640         pcre_free_study(extra);
8641         pcre_jit_stack_free(jit_stack);
8642
8643
8644JIT FAST PATH API
8645
8646       Because  the  API  described  above falls back to interpreted execution
8647       when JIT is not available, it is convenient for programs that are writ-
8648       ten  for  general  use  in  many environments. However, calling JIT via
8649       pcre_exec() does have a performance impact. Programs that  are  written
8650       for  use  where  JIT  is known to be available, and which need the best
8651       possible performance, can instead use a "fast path"  API  to  call  JIT
8652       execution  directly  instead of calling pcre_exec() (obviously only for
8653       patterns that have been successfully studied by JIT).
8654
8655       The fast path function is called pcre_jit_exec(), and it takes  exactly
8656       the  same  arguments  as pcre_exec(), plus one additional argument that
8657       must point to a JIT stack. The JIT stack arrangements  described  above
8658       do not apply. The return values are the same as for pcre_exec().
8659
8660       When  you  call  pcre_exec(), as well as testing for invalid options, a
8661       number of other sanity checks are performed on the arguments. For exam-
8662       ple,  if  the  subject  pointer  is NULL, or its length is negative, an
8663       immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is  set,  a
8664       UTF  subject  string is tested for validity. In the interests of speed,
8665       these checks do not happen on the JIT fast path, and if invalid data is
8666       passed, the result is undefined.
8667
8668       Bypassing  the  sanity  checks  and  the  pcre_exec() wrapping can give
8669       speedups of more than 10%.
8670
8671
8672SEE ALSO
8673
8674       pcreapi(3)
8675
8676
8677AUTHOR
8678
8679       Philip Hazel (FAQ by Zoltan Herczeg)
8680       University Computing Service
8681       Cambridge CB2 3QH, England.
8682
8683
8684REVISION
8685
8686       Last updated: 17 March 2013
8687       Copyright (c) 1997-2013 University of Cambridge.
8688------------------------------------------------------------------------------
8689
8690
8691PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
8692
8693
8694
8695NAME
8696       PCRE - Perl-compatible regular expressions
8697
8698PARTIAL MATCHING IN PCRE
8699
8700       In normal use of PCRE, if the subject string that is passed to a match-
8701       ing function matches as far as it goes, but is too short to  match  the
8702       entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
8703       where it might be helpful to distinguish this case from other cases  in
8704       which there is no match.
8705
8706       Consider, for example, an application where a human is required to type
8707       in data for a field with specific formatting requirements.  An  example
8708       might be a date in the form ddmmmyy, defined by this pattern:
8709
8710         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
8711
8712       If the application sees the user's keystrokes one by one, and can check
8713       that what has been typed so far is potentially valid,  it  is  able  to
8714       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
8715       reflecting the character that has been typed, for example. This immedi-
8716       ate  feedback is likely to be a better user interface than a check that
8717       is delayed until the entire string has been entered.  Partial  matching
8718       can  also be useful when the subject string is very long and is not all
8719       available at once.
8720
8721       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
8722       PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
8723       matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
8724       onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
8725       options is whether or not a partial match is preferred to  an  alterna-
8726       tive complete match, though the details differ between the two types of
8727       matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
8728       precedence.
8729
8730       If  you  want to use partial matching with just-in-time optimized code,
8731       you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
8732       or both of these options:
8733
8734         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
8735         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
8736
8737       PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
8738       partial matches on the same pattern. If the appropriate JIT study  mode
8739       has not been set for a match, the interpretive matching code is used.
8740
8741       Setting a partial matching option disables two of PCRE's standard opti-
8742       mizations. PCRE remembers the last literal data unit in a pattern,  and
8743       abandons  matching  immediately  if  it  is  not present in the subject
8744       string. This optimization cannot be used  for  a  subject  string  that
8745       might  match only partially. If the pattern was studied, PCRE knows the
8746       minimum length of a matching string, and does not  bother  to  run  the
8747       matching  function  on  shorter strings. This optimization is also dis-
8748       abled for partial matching.
8749
8750
8751PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()
8752
8753       A  partial   match   occurs   during   a   call   to   pcre_exec()   or
8754       pcre[16|32]_exec()  when  the end of the subject string is reached suc-
8755       cessfully, but matching cannot continue  because  more  characters  are
8756       needed.   However, at least one character in the subject must have been
8757       inspected. This character need not  form  part  of  the  final  matched
8758       string;  lookbehind  assertions and the \K escape sequence provide ways
8759       of inspecting characters before the start of a matched  substring.  The
8760       requirement  for  inspecting  at  least one character exists because an
8761       empty string can always be matched; without such  a  restriction  there
8762       would  always  be  a partial match of an empty string at the end of the
8763       subject.
8764
8765       If there are at least two slots in the offsets vector  when  a  partial
8766       match  is returned, the first slot is set to the offset of the earliest
8767       character that was inspected. For convenience, the second offset points
8768       to the end of the subject so that a substring can easily be identified.
8769       If there are at least three slots in the offsets vector, the third slot
8770       is set to the offset of the character where matching started.
8771
8772       For the majority of patterns, the contents of the first and third slots
8773       will be the same. However, for patterns that contain lookbehind  asser-
8774       tions, or begin with \b or \B, characters before the one where matching
8775       started may have been inspected while carrying out the match. For exam-
8776       ple, consider this pattern:
8777
8778         /(?<=abc)123/
8779
8780       This pattern matches "123", but only if it is preceded by "abc". If the
8781       subject string is "xyzabc12", the first two  offsets  after  a  partial
8782       match  are for the substring "abc12", because all these characters were
8783       inspected. However, the third offset is set to 6, because that  is  the
8784       offset where matching began.
8785
8786       What happens when a partial match is identified depends on which of the
8787       two partial matching options are set.
8788
8789   PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
8790
8791       If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
8792       identifies a partial match, the partial match is remembered, but match-
8793       ing continues as normal, and other  alternatives  in  the  pattern  are
8794       tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
8795       returned instead of PCRE_ERROR_NOMATCH.
8796
8797       This option is "soft" because it prefers a complete match over  a  par-
8798       tial  match.   All the various matching items in a pattern behave as if
8799       the subject string is potentially complete. For example, \z, \Z, and  $
8800       match  at  the end of the subject, as normal, and for \b and \B the end
8801       of the subject is treated as a non-alphanumeric.
8802
8803       If there is more than one partial match, the first one that  was  found
8804       provides the data that is returned. Consider this pattern:
8805
8806         /123\w+X|dogY/
8807
8808       If  this is matched against the subject string "abc123dog", both alter-
8809       natives fail to match, but the end of the  subject  is  reached  during
8810       matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
8811       and 9, identifying "123dog" as the first partial match that was  found.
8812       (In  this  example, there are two partial matches, because "dog" on its
8813       own partially matches the second alternative.)
8814
8815   PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
8816
8817       If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
8818       PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
8819       without continuing to search for possible complete matches. This option
8820       is "hard" because it prefers an earlier partial match over a later com-
8821       plete match. For this reason, the assumption is made that  the  end  of
8822       the  supplied  subject  string may not be the true end of the available
8823       data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
8824       subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
8825       character in the subject has been inspected.
8826
8827       Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
8828       strings  are checked for validity. Normally, an invalid sequence causes
8829       the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
8830       special  case  of  a  truncated  character  at  the end of the subject,
8831       PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
8832       PCRE_PARTIAL_HARD is set.
8833
8834   Comparing hard and soft partial matching
8835
8836       The  difference  between the two partial matching options can be illus-
8837       trated by a pattern such as:
8838
8839         /dog(sbody)?/
8840
8841       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
8842       the  longer  string  if  possible). If it is matched against the string
8843       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
8844       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
8845       On the other hand, if the pattern is made ungreedy the result  is  dif-
8846       ferent:
8847
8848         /dog(sbody)??/
8849
8850       In  this  case  the  result  is always a complete match because that is
8851       found first, and matching never  continues  after  finding  a  complete
8852       match. It might be easier to follow this explanation by thinking of the
8853       two patterns like this:
8854
8855         /dog(sbody)?/    is the same as  /dogsbody|dog/
8856         /dog(sbody)??/   is the same as  /dog|dogsbody/
8857
8858       The second pattern will never match "dogsbody", because it will  always
8859       find the shorter match first.
8860
8861
8862PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
8863
8864       The DFA functions move along the subject string character by character,
8865       without backtracking, searching for  all  possible  matches  simultane-
8866       ously.  If the end of the subject is reached before the end of the pat-
8867       tern, there is the possibility of a partial match, again provided  that
8868       at least one character has been inspected.
8869
8870       When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
8871       there have been no complete matches. Otherwise,  the  complete  matches
8872       are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
8873       takes precedence over any complete matches. The portion of  the  string
8874       that  was  inspected when the longest partial match was found is set as
8875       the first matching string, provided there are at least two slots in the
8876       offsets vector.
8877
8878       Because  the  DFA functions always search for all possible matches, and
8879       there is no difference between greedy and  ungreedy  repetition,  their
8880       behaviour  is  different  from  the  standard  functions when PCRE_PAR-
8881       TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
8882       ungreedy pattern shown above:
8883
8884         /dog(sbody)??/
8885
8886       Whereas  the  standard functions stop as soon as they find the complete
8887       match for "dog", the DFA functions also  find  the  partial  match  for
8888       "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
8889
8890
8891PARTIAL MATCHING AND WORD BOUNDARIES
8892
8893       If  a  pattern ends with one of sequences \b or \B, which test for word
8894       boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
8895       intuitive results. Consider this pattern:
8896
8897         /\bcat\b/
8898
8899       This matches "cat", provided there is a word boundary at either end. If
8900       the subject string is "the cat", the comparison of the final "t" with a
8901       following  character  cannot  take  place, so a partial match is found.
8902       However, normal matching carries on, and \b matches at the end  of  the
8903       subject  when  the  last  character is a letter, so a complete match is
8904       found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
8905       PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
8906       then the partial match takes precedence.
8907
8908
8909FORMERLY RESTRICTED PATTERNS
8910
8911       For releases of PCRE prior to 8.00, because of the way certain internal
8912       optimizations   were  implemented  in  the  pcre_exec()  function,  the
8913       PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
8914       used  with all patterns. From release 8.00 onwards, the restrictions no
8915       longer apply, and partial matching with can be requested for  any  pat-
8916       tern.
8917
8918       Items that were formerly restricted were repeated single characters and
8919       repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
8920       not  conform  to  the restrictions, pcre_exec() returned the error code
8921       PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
8922       PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
8923       pattern can be used for partial matching now always returns 1.
8924
8925
8926EXAMPLE OF PARTIAL MATCHING USING PCRETEST
8927
8928       If the escape sequence \P is present  in  a  pcretest  data  line,  the
8929       PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
8930       pcretest that uses the date example quoted above:
8931
8932           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
8933         data> 25jun04\P
8934          0: 25jun04
8935          1: jun
8936         data> 25dec3\P
8937         Partial match: 23dec3
8938         data> 3ju\P
8939         Partial match: 3ju
8940         data> 3juj\P
8941         No match
8942         data> j\P
8943         No match
8944
8945       The first data string is matched  completely,  so  pcretest  shows  the
8946       matched  substrings.  The  remaining four strings do not match the com-
8947       plete pattern, but the first two are partial matches. Similar output is
8948       obtained if DFA matching is used.
8949
8950       If  the escape sequence \P is present more than once in a pcretest data
8951       line, the PCRE_PARTIAL_HARD option is set for the match.
8952
8953
8954MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
8955
8956       When a partial match has been found using a DFA matching  function,  it
8957       is  possible to continue the match by providing additional subject data
8958       and calling the function again with the same compiled  regular  expres-
8959       sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
8960       same working space as before, because this is where details of the pre-
8961       vious  partial  match  are  stored.  Here is an example using pcretest,
8962       using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
8963       specifies the use of the DFA matching function):
8964
8965           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
8966         data> 23ja\P\D
8967         Partial match: 23ja
8968         data> n05\R\D
8969          0: n05
8970
8971       The  first  call has "23ja" as the subject, and requests partial match-
8972       ing; the second call  has  "n05"  as  the  subject  for  the  continued
8973       (restarted)  match.   Notice  that when the match is complete, only the
8974       last part is shown; PCRE does  not  retain  the  previously  partially-
8975       matched  string. It is up to the calling program to do that if it needs
8976       to.
8977
8978       That means that, for an unanchored pattern, if a continued match fails,
8979       it  is  not  possible  to  try  again at a new starting point. All this
8980       facility is capable of doing is  continuing  with  the  previous  match
8981       attempt.  In  the previous example, if the second set of data is "ug23"
8982       the result is no match, even though there would be a match for  "aug23"
8983       if  the entire string were given at once. Depending on the application,
8984       this may or may not be what you want.  The only way to allow for start-
8985       ing  again  at  the next character is to retain the matched part of the
8986       subject and try a new complete match.
8987
8988       You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
8989       PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
8990       This facility can be used to pass very long subject strings to the  DFA
8991       matching functions.
8992
8993
8994MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()
8995
8996       From  release 8.00, the standard matching functions can also be used to
8997       do multi-segment matching. Unlike the DFA functions, it is not possible
8998       to  restart the previous match with a new segment of data. Instead, new
8999       data must be added to the previous subject string, and the entire match
9000       re-run,  starting from the point where the partial match occurred. Ear-
9001       lier data can be discarded.
9002
9003       It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
9004       not  treat the end of a segment as the end of the subject when matching
9005       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
9006       dates:
9007
9008           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
9009         data> The date is 23ja\P\P
9010         Partial match: 23ja
9011
9012       At  this stage, an application could discard the text preceding "23ja",
9013       add on text from the next  segment,  and  call  the  matching  function
9014       again.  Unlike  the  DFA matching functions, the entire matching string
9015       must always be available, and the complete matching process occurs  for
9016       each call, so more memory and more processing time is needed.
9017
9018       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
9019       with \b or \B, the string that is returned for a partial match includes
9020       characters  that precede the start of what would be returned for a com-
9021       plete match, because it contains all the characters that were inspected
9022       during the partial match.
9023
9024
9025ISSUES WITH MULTI-SEGMENT MATCHING
9026
9027       Certain types of pattern may give problems with multi-segment matching,
9028       whichever matching function is used.
9029
9030       1. If the pattern contains a test for the beginning of a line, you need
9031       to  pass  the  PCRE_NOTBOL  option when the subject string for any call
9032       does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
9033       option, but in practice when doing multi-segment matching you should be
9034       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
9035
9036       2. Lookbehind assertions that have already been obeyed are catered  for
9037       in the offsets that are returned for a partial match. However a lookbe-
9038       hind assertion later in the pattern could require even earlier  charac-
9039       ters   to  be  inspected.  You  can  handle  this  case  by  using  the
9040       PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
9041       pcre[16|32]_fullinfo()  functions  to  obtain the length of the longest
9042       lookbehind in the pattern. This length  is  given  in  characters,  not
9043       bytes.  If  you  always retain at least that many characters before the
9044       partially matched string, all should be  well.  (Of  course,  near  the
9045       start of the subject, fewer characters may be present; in that case all
9046       characters should be retained.)
9047
9048       From release 8.33, there is a more accurate way of deciding which char-
9049       acters  to  retain.  Instead  of  subtracting the length of the longest
9050       lookbehind from the  earliest  inspected  character  (offsets[0]),  the
9051       match  start  position  (offsets[2]) should be used, and the next match
9052       attempt started at the offsets[2] character by setting the  startoffset
9053       argument of pcre_exec() or pcre_dfa_exec().
9054
9055       For  example, if the pattern "(?<=123)abc" is partially matched against
9056       the string "xx123a", the three offset values returned are 2, 6, and  5.
9057       This  indicates  that  the  matching  process that gave a partial match
9058       started at offset 5, but the characters "123a" were all inspected.  The
9059       maximum  lookbehind  for  that pattern is 3, so taking that away from 5
9060       shows that we need only keep "123a", and the next match attempt can  be
9061       started at offset 3 (that is, at "a") when further characters have been
9062       added. When the match start is not the  earliest  inspected  character,
9063       pcretest shows it explicitly:
9064
9065           re> "(?<=123)abc"
9066         data> xx123a\P\P
9067         Partial match at offset 5: 123a
9068
9069       3.  Because a partial match must always contain at least one character,
9070       what might be considered a partial match of an  empty  string  actually
9071       gives a "no match" result. For example:
9072
9073           re> /c(?<=abc)x/
9074         data> ab\P
9075         No match
9076
9077       If the next segment begins "cx", a match should be found, but this will
9078       only happen if characters from the previous segment are  retained.  For
9079       this  reason,  a  "no  match"  result should be interpreted as "partial
9080       match of an empty string" when the pattern contains lookbehinds.
9081
9082       4. Matching a subject string that is split into multiple  segments  may
9083       not  always produce exactly the same result as matching over one single
9084       long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
9085       "Partial  Matching  and  Word Boundaries" above describes an issue that
9086       arises if the pattern ends with \b or \B. Another  kind  of  difference
9087       may  occur when there are multiple matching possibilities, because (for
9088       PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
9089       no completed matches. This means that as soon as the shortest match has
9090       been found, continuation to a new subject segment is no  longer  possi-
9091       ble. Consider again this pcretest example:
9092
9093           re> /dog(sbody)?/
9094         data> dogsb\P
9095          0: dog
9096         data> do\P\D
9097         Partial match: do
9098         data> gsb\R\P\D
9099          0: g
9100         data> dogsbody\D
9101          0: dogsbody
9102          1: dog
9103
9104       The  first  data  line passes the string "dogsb" to a standard matching
9105       function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
9106       a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
9107       because the shorter string "dog" is a complete match.  Similarly,  when
9108       the  subject  is  presented to a DFA matching function in several parts
9109       ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
9110       been  found, and it is not possible to continue.  On the other hand, if
9111       "dogsbody" is presented as a single string,  a  DFA  matching  function
9112       finds both matches.
9113
9114       Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
9115       matching multi-segment data. The example  above  then  behaves  differ-
9116       ently:
9117
9118           re> /dog(sbody)?/
9119         data> dogsb\P\P
9120         Partial match: dogsb
9121         data> do\P\D
9122         Partial match: do
9123         data> gsb\R\P\P\D
9124         Partial match: gsb
9125
9126       5. Patterns that contain alternatives at the top level which do not all
9127       start with the  same  pattern  item  may  not  work  as  expected  when
9128       PCRE_DFA_RESTART is used. For example, consider this pattern:
9129
9130         1234|3789
9131
9132       If  the  first  part of the subject is "ABC123", a partial match of the
9133       first alternative is found at offset 3. There is no partial  match  for
9134       the second alternative, because such a match does not start at the same
9135       point in the subject string. Attempting to  continue  with  the  string
9136       "7890"  does  not  yield  a  match because only those alternatives that
9137       match at one point in the subject are remembered.  The  problem  arises
9138       because  the  start  of the second alternative matches within the first
9139       alternative. There is no problem with  anchored  patterns  or  patterns
9140       such as:
9141
9142         1234|ABCD
9143
9144       where  no  string can be a partial match for both alternatives. This is
9145       not a problem if a standard matching  function  is  used,  because  the
9146       entire match has to be rerun each time:
9147
9148           re> /1234|3789/
9149         data> ABC123\P\P
9150         Partial match: 123
9151         data> 1237890
9152          0: 3789
9153
9154       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
9155       running the entire match can also be used with the DFA  matching  func-
9156       tions.  Another  possibility  is to work with two buffers. If a partial
9157       match at offset n in the first buffer is followed by  "no  match"  when
9158       PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
9159       match starting at offset n+1 in the first buffer.
9160
9161
9162AUTHOR
9163
9164       Philip Hazel
9165       University Computing Service
9166       Cambridge CB2 3QH, England.
9167
9168
9169REVISION
9170
9171       Last updated: 02 July 2013
9172       Copyright (c) 1997-2013 University of Cambridge.
9173------------------------------------------------------------------------------
9174
9175
9176PCREPRECOMPILE(3)          Library Functions Manual          PCREPRECOMPILE(3)
9177
9178
9179
9180NAME
9181       PCRE - Perl-compatible regular expressions
9182
9183SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
9184
9185       If  you  are running an application that uses a large number of regular
9186       expression patterns, it may be useful to store them  in  a  precompiled
9187       form  instead  of  having to compile them every time the application is
9188       run.  If you are not  using  any  private  character  tables  (see  the
9189       pcre_maketables()  documentation),  this is relatively straightforward.
9190       If you are using private tables, it is a little bit  more  complicated.
9191       However,  if you are using the just-in-time optimization feature, it is
9192       not possible to save and reload the JIT data.
9193
9194       If you save compiled patterns to a file, you can copy them to a differ-
9195       ent host and run them there. If the two hosts have different endianness
9196       (byte    order),    you     should     run     the     pcre[16|32]_pat-
9197       tern_to_host_byte_order()  function  on  the  new host before trying to
9198       match the pattern. The matching functions return  PCRE_ERROR_BADENDIAN-
9199       NESS if they detect a pattern with the wrong endianness.
9200
9201       Compiling  regular  expressions with one version of PCRE for use with a
9202       different version is not guaranteed to work and may cause crashes,  and
9203       saving  and  restoring  a  compiled  pattern loses any JIT optimization
9204       data.
9205
9206
9207SAVING A COMPILED PATTERN
9208
9209       The value returned by pcre[16|32]_compile() points to a single block of
9210       memory  that  holds  the  compiled pattern and associated data. You can
9211       find   the   length   of   this   block    in    bytes    by    calling
9212       pcre[16|32]_fullinfo() with an argument of PCRE_INFO_SIZE. You can then
9213       save the data in any appropriate manner. Here is sample  code  for  the
9214       8-bit  library  that  compiles  a  pattern  and writes it to a file. It
9215       assumes that the variable fd refers to a file that is open for output:
9216
9217         int erroroffset, rc, size;
9218         char *error;
9219         pcre *re;
9220
9221         re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
9222         if (re == NULL) { ... handle errors ... }
9223         rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
9224         if (rc < 0) { ... handle errors ... }
9225         rc = fwrite(re, 1, size, fd);
9226         if (rc != size) { ... handle errors ... }
9227
9228       In this example, the bytes  that  comprise  the  compiled  pattern  are
9229       copied  exactly.  Note that this is binary data that may contain any of
9230       the 256 possible byte  values.  On  systems  that  make  a  distinction
9231       between binary and non-binary data, be sure that the file is opened for
9232       binary output.
9233
9234       If you want to write more than one pattern to a file, you will have  to
9235       devise  a  way of separating them. For binary data, preceding each pat-
9236       tern with its length is probably  the  most  straightforward  approach.
9237       Another  possibility is to write out the data in hexadecimal instead of
9238       binary, one pattern to a line.
9239
9240       Saving compiled patterns in a file is only one possible way of  storing
9241       them  for later use. They could equally well be saved in a database, or
9242       in the memory of some daemon process that passes them  via  sockets  to
9243       the processes that want them.
9244
9245       If the pattern has been studied, it is also possible to save the normal
9246       study data in a similar way to the compiled pattern itself. However, if
9247       the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
9248       ated cannot be saved because it is too dependent on the  current  envi-
9249       ronment.    When    studying    generates    additional    information,
9250       pcre[16|32]_study() returns  a  pointer  to  a  pcre[16|32]_extra  data
9251       block.  Its  format  is defined in the section on matching a pattern in
9252       the pcreapi documentation. The study_data field points  to  the  binary
9253       study  data,  and this is what you must save (not the pcre[16|32]_extra
9254       block itself). The length of the study data can be obtained by  calling
9255       pcre[16|32]_fullinfo()  with an argument of PCRE_INFO_STUDYSIZE. Remem-
9256       ber to check that  pcre[16|32]_study()  did  return  a  non-NULL  value
9257       before trying to save the study data.
9258
9259
9260RE-USING A PRECOMPILED PATTERN
9261
9262       Re-using  a  precompiled pattern is straightforward. Having reloaded it
9263       into main memory,  called  pcre[16|32]_pattern_to_host_byte_order()  if
9264       necessary,    you   pass   its   pointer   to   pcre[16|32]_exec()   or
9265       pcre[16|32]_dfa_exec() in the usual way.
9266
9267       However, if you passed a pointer to custom character  tables  when  the
9268       pattern  was compiled (the tableptr argument of pcre[16|32]_compile()),
9269       you  must  now  pass  a  similar  pointer  to   pcre[16|32]_exec()   or
9270       pcre[16|32]_dfa_exec(),  because the value saved with the compiled pat-
9271       tern will obviously be nonsense. A field in a pcre[16|32]_extra() block
9272       is  used  to  pass this data, as described in the section on matching a
9273       pattern in the pcreapi documentation.
9274
9275       Warning: The tables that pcre_exec() and pcre_dfa_exec()  use  must  be
9276       the same as those that were used when the pattern was compiled. If this
9277       is not the case, the behaviour is undefined.
9278
9279       If you did not provide custom character tables  when  the  pattern  was
9280       compiled, the pointer in the compiled pattern is NULL, which causes the
9281       matching functions to use PCRE's internal tables. Thus, you do not need
9282       to take any special action at run time in this case.
9283
9284       If  you  saved study data with the compiled pattern, you need to create
9285       your own pcre[16|32]_extra data block and set the study_data  field  to
9286       point   to   the   reloaded   study   data.   You  must  also  set  the
9287       PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
9288       data  is present. Then pass the pcre[16|32]_extra block to the matching
9289       function in the usual way. If the pattern was studied for  just-in-time
9290       optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
9291       save/restore cycle.
9292
9293
9294COMPATIBILITY WITH DIFFERENT PCRE RELEASES
9295
9296       In general, it is safest to  recompile  all  saved  patterns  when  you
9297       update  to  a new PCRE release, though not all updates actually require
9298       this.
9299
9300
9301AUTHOR
9302
9303       Philip Hazel
9304       University Computing Service
9305       Cambridge CB2 3QH, England.
9306
9307
9308REVISION
9309
9310       Last updated: 12 November 2013
9311       Copyright (c) 1997-2013 University of Cambridge.
9312------------------------------------------------------------------------------
9313
9314
9315PCREPERFORM(3)             Library Functions Manual             PCREPERFORM(3)
9316
9317
9318
9319NAME
9320       PCRE - Perl-compatible regular expressions
9321
9322PCRE PERFORMANCE
9323
9324       Two  aspects  of performance are discussed below: memory usage and pro-
9325       cessing time. The way you express your pattern as a regular  expression
9326       can affect both of them.
9327
9328
9329COMPILED PATTERN MEMORY USAGE
9330
9331       Patterns  are compiled by PCRE into a reasonably efficient interpretive
9332       code, so that most simple patterns do not  use  much  memory.  However,
9333       there  is  one case where the memory usage of a compiled pattern can be
9334       unexpectedly large. If a parenthesized subpattern has a quantifier with
9335       a minimum greater than 1 and/or a limited maximum, the whole subpattern
9336       is repeated in the compiled code. For example, the pattern
9337
9338         (abc|def){2,4}
9339
9340       is compiled as if it were
9341
9342         (abc|def)(abc|def)((abc|def)(abc|def)?)?
9343
9344       (Technical aside: It is done this way so that backtrack  points  within
9345       each of the repetitions can be independently maintained.)
9346
9347       For  regular expressions whose quantifiers use only small numbers, this
9348       is not usually a problem. However, if the numbers are large,  and  par-
9349       ticularly  if  such repetitions are nested, the memory usage can become
9350       an embarrassment. For example, the very simple pattern
9351
9352         ((ab){1,1000}c){1,3}
9353
9354       uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
9355       piled  with  its  default  internal pointer size of two bytes, the size
9356       limit on a compiled pattern is 64K data units, and this is reached with
9357       the  above  pattern  if  the outer repetition is increased from 3 to 4.
9358       PCRE can be compiled to use larger internal pointers  and  thus  handle
9359       larger  compiled patterns, but it is better to try to rewrite your pat-
9360       tern to use less memory if you can.
9361
9362       One way of reducing the memory usage for such patterns is to  make  use
9363       of PCRE's "subroutine" facility. Re-writing the above pattern as
9364
9365         ((ab)(?2){0,999}c)(?1){0,2}
9366
9367       reduces the memory requirements to 18K, and indeed it remains under 20K
9368       even with the outer repetition increased to 100. However, this  pattern
9369       is  not  exactly equivalent, because the "subroutine" calls are treated
9370       as atomic groups into which there can be no backtracking if there is  a
9371       subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
9372       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
9373       speed  when executing the modified pattern. Nevertheless, if the atomic
9374       grouping is not a problem and the loss of  speed  is  acceptable,  this
9375       kind  of  rewriting will allow you to process patterns that PCRE cannot
9376       otherwise handle.
9377
9378
9379STACK USAGE AT RUN TIME
9380
9381       When pcre_exec() or pcre[16|32]_exec() is used  for  matching,  certain
9382       kinds  of  pattern  can  cause  it  to use large amounts of the process
9383       stack. In some environments the default process stack is  quite  small,
9384       and  if it runs out the result is often SIGSEGV. This issue is probably
9385       the most frequently raised problem with PCRE.  Rewriting  your  pattern
9386       can  often  help.  The  pcrestack documentation discusses this issue in
9387       detail.
9388
9389
9390PROCESSING TIME
9391
9392       Certain items in regular expression patterns are processed  more  effi-
9393       ciently than others. It is more efficient to use a character class like
9394       [aeiou]  than  a  set  of   single-character   alternatives   such   as
9395       (a|e|i|o|u).  In  general,  the simplest construction that provides the
9396       required behaviour is usually the most efficient. Jeffrey Friedl's book
9397       contains  a  lot  of useful general discussion about optimizing regular
9398       expressions for efficient performance. This  document  contains  a  few
9399       observations about PCRE.
9400
9401       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
9402       slow, because PCRE has to use a multi-stage table  lookup  whenever  it
9403       needs  a  character's  property. If you can find an alternative pattern
9404       that does not use character properties, it will probably be faster.
9405
9406       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
9407       character  classes  such  as  [:alpha:]  do not use Unicode properties,
9408       partly for backwards compatibility, and partly for performance reasons.
9409       However,  you can set PCRE_UCP if you want Unicode character properties
9410       to be used. This can double the matching time for  items  such  as  \d,
9411       when matched with a traditional matching function; the performance loss
9412       is less with a DFA matching function, and in both cases  there  is  not
9413       much difference for \b.
9414
9415       When  a  pattern  begins  with .* not in parentheses, or in parentheses
9416       that are not the subject of a backreference, and the PCRE_DOTALL option
9417       is  set, the pattern is implicitly anchored by PCRE, since it can match
9418       only at the start of a subject string. However, if PCRE_DOTALL  is  not
9419       set,  PCRE  cannot  make this optimization, because the . metacharacter
9420       does not then match a newline, and if the subject string contains  new-
9421       lines,  the  pattern may match from the character immediately following
9422       one of them instead of from the very start. For example, the pattern
9423
9424         .*second
9425
9426       matches the subject "first\nand second" (where \n stands for a  newline
9427       character),  with the match starting at the seventh character. In order
9428       to do this, PCRE has to retry the match starting after every newline in
9429       the subject.
9430
9431       If  you  are using such a pattern with subject strings that do not con-
9432       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
9433       or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
9434       ing. That saves PCRE from having to scan along the subject looking  for
9435       a newline to restart at.
9436
9437       Beware  of  patterns  that contain nested indefinite repeats. These can
9438       take a long time to run when applied to a string that does  not  match.
9439       Consider the pattern fragment
9440
9441         ^(a+)*
9442
9443       This  can  match "aaaa" in 16 different ways, and this number increases
9444       very rapidly as the string gets longer. (The * repeat can match  0,  1,
9445       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
9446       repeats can match different numbers of times.) When  the  remainder  of
9447       the pattern is such that the entire match is going to fail, PCRE has in
9448       principle to try  every  possible  variation,  and  this  can  take  an
9449       extremely long time, even for relatively short strings.
9450
9451       An optimization catches some of the more simple cases such as
9452
9453         (a+)*b
9454
9455       where  a  literal  character  follows. Before embarking on the standard
9456       matching procedure, PCRE checks that there is a "b" later in  the  sub-
9457       ject  string, and if there is not, it fails the match immediately. How-
9458       ever, when there is no following literal this  optimization  cannot  be
9459       used. You can see the difference by comparing the behaviour of
9460
9461         (a+)*\d
9462
9463       with  the  pattern  above.  The former gives a failure almost instantly
9464       when applied to a whole line of  "a"  characters,  whereas  the  latter
9465       takes an appreciable time with strings longer than about 20 characters.
9466
9467       In many cases, the solution to this kind of performance issue is to use
9468       an atomic group or a possessive quantifier.
9469
9470
9471AUTHOR
9472
9473       Philip Hazel
9474       University Computing Service
9475       Cambridge CB2 3QH, England.
9476
9477
9478REVISION
9479
9480       Last updated: 25 August 2012
9481       Copyright (c) 1997-2012 University of Cambridge.
9482------------------------------------------------------------------------------
9483
9484
9485PCREPOSIX(3)               Library Functions Manual               PCREPOSIX(3)
9486
9487
9488
9489NAME
9490       PCRE - Perl-compatible regular expressions.
9491
9492SYNOPSIS
9493
9494       #include <pcreposix.h>
9495
9496       int regcomp(regex_t *preg, const char *pattern,
9497            int cflags);
9498
9499       int regexec(regex_t *preg, const char *string,
9500            size_t nmatch, regmatch_t pmatch[], int eflags);
9501            size_t regerror(int errcode, const regex_t *preg,
9502            char *errbuf, size_t errbuf_size);
9503
9504       void regfree(regex_t *preg);
9505
9506
9507DESCRIPTION
9508
9509       This  set  of functions provides a POSIX-style API for the PCRE regular
9510       expression 8-bit library. See the pcreapi documentation for a  descrip-
9511       tion  of  PCRE's native API, which contains much additional functional-
9512       ity. There is no POSIX-style  wrapper  for  PCRE's  16-bit  and  32-bit
9513       library.
9514
9515       The functions described here are just wrapper functions that ultimately
9516       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
9517       pcreposix.h  header  file,  and  on  Unix systems the library itself is
9518       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
9519       command  for  linking  an application that uses them. Because the POSIX
9520       functions call the native ones, it is also necessary to add -lpcre.
9521
9522       I have implemented only those POSIX option bits that can be  reasonably
9523       mapped  to PCRE native options. In addition, the option REG_EXTENDED is
9524       defined with the value zero. This has no  effect,  but  since  programs
9525       that  are  written  to  the POSIX interface often use it, this makes it
9526       easier to slot in PCRE as a replacement library.  Other  POSIX  options
9527       are not even defined.
9528
9529       There  are also some other options that are not defined by POSIX. These
9530       have been added at the request of users who want to make use of certain
9531       PCRE-specific features via the POSIX calling interface.
9532
9533       When  PCRE  is  called  via these functions, it is only the API that is
9534       POSIX-like in style. The syntax and semantics of  the  regular  expres-
9535       sions  themselves  are  still  those of Perl, subject to the setting of
9536       various PCRE options, as described below. "POSIX-like in  style"  means
9537       that  the  API  approximates  to  the POSIX definition; it is not fully
9538       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
9539       even less compatible.
9540
9541       The  header for these functions is supplied as pcreposix.h to avoid any
9542       potential clash with other POSIX  libraries.  It  can,  of  course,  be
9543       renamed or aliased as regex.h, which is the "correct" name. It provides
9544       two structure types, regex_t for  compiled  internal  forms,  and  reg-
9545       match_t  for  returning  captured substrings. It also defines some con-
9546       stants whose names start  with  "REG_";  these  are  used  for  setting
9547       options and identifying error codes.
9548
9549
9550COMPILING A PATTERN
9551
9552       The  function regcomp() is called to compile a pattern into an internal
9553       form. The pattern is a C string terminated by a  binary  zero,  and  is
9554       passed  in  the  argument  pattern. The preg argument is a pointer to a
9555       regex_t structure that is used as a base for storing information  about
9556       the compiled regular expression.
9557
9558       The argument cflags is either zero, or contains one or more of the bits
9559       defined by the following macros:
9560
9561         REG_DOTALL
9562
9563       The PCRE_DOTALL option is set when the regular expression is passed for
9564       compilation to the native function. Note that REG_DOTALL is not part of
9565       the POSIX standard.
9566
9567         REG_ICASE
9568
9569       The PCRE_CASELESS option is set when the regular expression  is  passed
9570       for compilation to the native function.
9571
9572         REG_NEWLINE
9573
9574       The  PCRE_MULTILINE option is set when the regular expression is passed
9575       for compilation to the native function. Note that this does  not  mimic
9576       the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
9577       tion).
9578
9579         REG_NOSUB
9580
9581       The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
9582       passed for compilation to the native function. In addition, when a pat-
9583       tern that is compiled with this flag is passed to regexec() for  match-
9584       ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
9585       strings are returned.
9586
9587         REG_UCP
9588
9589       The PCRE_UCP option is set when the regular expression  is  passed  for
9590       compilation  to  the  native  function. This causes PCRE to use Unicode
9591       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
9592       ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
9593
9594         REG_UNGREEDY
9595
9596       The  PCRE_UNGREEDY  option is set when the regular expression is passed
9597       for compilation to the native function. Note that REG_UNGREEDY  is  not
9598       part of the POSIX standard.
9599
9600         REG_UTF8
9601
9602       The  PCRE_UTF8  option is set when the regular expression is passed for
9603       compilation to the native function. This causes the pattern itself  and
9604       all  data  strings used for matching it to be treated as UTF-8 strings.
9605       Note that REG_UTF8 is not part of the POSIX standard.
9606
9607       In the absence of these flags, no options  are  passed  to  the  native
9608       function.   This  means  the  the  regex  is compiled with PCRE default
9609       semantics. In particular, the way it handles newline characters in  the
9610       subject  string  is  the Perl way, not the POSIX way. Note that setting
9611       PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
9612       It  does not affect the way newlines are matched by . (they are not) or
9613       by a negative class such as [^a] (they are).
9614
9615       The yield of regcomp() is zero on success, and non-zero otherwise.  The
9616       preg structure is filled in on success, and one member of the structure
9617       is public: re_nsub contains the number of capturing subpatterns in  the
9618       regular expression. Various error codes are defined in the header file.
9619
9620       NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
9621       use the contents of the preg structure. If, for example, you pass it to
9622       regexec(), the result is undefined and your program is likely to crash.
9623
9624
9625MATCHING NEWLINE CHARACTERS
9626
9627       This area is not simple, because POSIX and Perl take different views of
9628       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
9629       then  PCRE was never intended to be a POSIX engine. The following table
9630       lists the different possibilities for matching  newline  characters  in
9631       PCRE:
9632
9633                                 Default   Change with
9634
9635         . matches newline          no     PCRE_DOTALL
9636         newline matches [^a]       yes    not changeable
9637         $ matches \n at end        yes    PCRE_DOLLARENDONLY
9638         $ matches \n in middle     no     PCRE_MULTILINE
9639         ^ matches \n in middle     no     PCRE_MULTILINE
9640
9641       This is the equivalent table for POSIX:
9642
9643                                 Default   Change with
9644
9645         . matches newline          yes    REG_NEWLINE
9646         newline matches [^a]       yes    REG_NEWLINE
9647         $ matches \n at end        no     REG_NEWLINE
9648         $ matches \n in middle     no     REG_NEWLINE
9649         ^ matches \n in middle     no     REG_NEWLINE
9650
9651       PCRE's behaviour is the same as Perl's, except that there is no equiva-
9652       lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
9653       no way to stop newline from matching [^a].
9654
9655       The   default  POSIX  newline  handling  can  be  obtained  by  setting
9656       PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
9657       behave exactly as for the REG_NEWLINE action.
9658
9659
9660MATCHING A PATTERN
9661
9662       The  function  regexec()  is  called  to  match a compiled pattern preg
9663       against a given string, which is by default terminated by a  zero  byte
9664       (but  see  REG_STARTEND below), subject to the options in eflags. These
9665       can be:
9666
9667         REG_NOTBOL
9668
9669       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
9670       function.
9671
9672         REG_NOTEMPTY
9673
9674       The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
9675       ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
9676       However, setting this option can give more POSIX-like behaviour in some
9677       situations.
9678
9679         REG_NOTEOL
9680
9681       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
9682       function.
9683
9684         REG_STARTEND
9685
9686       The  string  is  considered to start at string + pmatch[0].rm_so and to
9687       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
9688       not  actually  be  a  NUL at that location), regardless of the value of
9689       nmatch. This is a BSD extension, compatible with but not  specified  by
9690       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
9691       software intended to be portable to other systems. Note that a non-zero
9692       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
9693       of the string, not how it is matched.
9694
9695       If the pattern was compiled with the REG_NOSUB flag, no data about  any
9696       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
9697       regexec() are ignored.
9698
9699       If the value of nmatch is zero, or if the value pmatch is NULL, no data
9700       about any matched strings is returned.
9701
9702       Otherwise,the portion of the string that was matched, and also any cap-
9703       tured substrings, are returned via the pmatch argument, which points to
9704       an  array  of nmatch structures of type regmatch_t, containing the mem-
9705       bers rm_so and rm_eo. These contain the offset to the  first  character
9706       of  each  substring and the offset to the first character after the end
9707       of each substring, respectively. The 0th element of the vector  relates
9708       to  the  entire portion of string that was matched; subsequent elements
9709       relate to the capturing subpatterns of the regular  expression.  Unused
9710       entries in the array have both structure members set to -1.
9711
9712       A  successful  match  yields  a  zero  return;  various error codes are
9713       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
9714       failure code.
9715
9716
9717ERROR MESSAGES
9718
9719       The regerror() function maps a non-zero errorcode from either regcomp()
9720       or regexec() to a printable message. If preg is  not  NULL,  the  error
9721       should have arisen from the use of that structure. A message terminated
9722       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
9723       including  the  zero, is limited to errbuf_size. The yield of the func-
9724       tion is the size of buffer needed to hold the whole message.
9725
9726
9727MEMORY USAGE
9728
9729       Compiling a regular expression causes memory to be allocated and  asso-
9730       ciated  with  the preg structure. The function regfree() frees all such
9731       memory, after which preg may no longer be used as  a  compiled  expres-
9732       sion.
9733
9734
9735AUTHOR
9736
9737       Philip Hazel
9738       University Computing Service
9739       Cambridge CB2 3QH, England.
9740
9741
9742REVISION
9743
9744       Last updated: 09 January 2012
9745       Copyright (c) 1997-2012 University of Cambridge.
9746------------------------------------------------------------------------------
9747
9748
9749PCRECPP(3)                 Library Functions Manual                 PCRECPP(3)
9750
9751
9752
9753NAME
9754       PCRE - Perl-compatible regular expressions.
9755
9756SYNOPSIS OF C++ WRAPPER
9757
9758       #include <pcrecpp.h>
9759
9760
9761DESCRIPTION
9762
9763       The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
9764       functionality was added by Giuseppe Maxia. This brief man page was con-
9765       structed  from  the  notes  in the pcrecpp.h file, which should be con-
9766       sulted for further details. Note that the C++ wrapper supports only the
9767       original  8-bit  PCRE  library. There is no 16-bit or 32-bit support at
9768       present.
9769
9770
9771MATCHING INTERFACE
9772
9773       The "FullMatch" operation checks that supplied text matches a  supplied
9774       pattern  exactly.  If pointer arguments are supplied, it copies matched
9775       sub-strings that match sub-patterns into them.
9776
9777         Example: successful match
9778            pcrecpp::RE re("h.*o");
9779            re.FullMatch("hello");
9780
9781         Example: unsuccessful match (requires full match):
9782            pcrecpp::RE re("e");
9783            !re.FullMatch("hello");
9784
9785         Example: creating a temporary RE object:
9786            pcrecpp::RE("h.*o").FullMatch("hello");
9787
9788       You can pass in a "const char*" or a "string" for "text". The  examples
9789       below  tend to use a const char*. You can, as in the different examples
9790       above, store the RE object explicitly in a variable or use a  temporary
9791       RE  object.  The  examples below use one mode or the other arbitrarily.
9792       Either could correctly be used for any of these examples.
9793
9794       You must supply extra pointer arguments to extract matched subpieces.
9795
9796         Example: extracts "ruby" into "s" and 1234 into "i"
9797            int i;
9798            string s;
9799            pcrecpp::RE re("(\\w+):(\\d+)");
9800            re.FullMatch("ruby:1234", &s, &i);
9801
9802         Example: does not try to extract any extra sub-patterns
9803            re.FullMatch("ruby:1234", &s);
9804
9805         Example: does not try to extract into NULL
9806            re.FullMatch("ruby:1234", NULL, &i);
9807
9808         Example: integer overflow causes failure
9809            !re.FullMatch("ruby:1234567891234", NULL, &i);
9810
9811         Example: fails because there aren't enough sub-patterns:
9812            !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
9813
9814         Example: fails because string cannot be stored in integer
9815            !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
9816
9817       The provided pointer arguments can be pointers to  any  scalar  numeric
9818       type, or one of:
9819
9820          string        (matched piece is copied to string)
9821          StringPiece   (StringPiece is mutated to point to matched piece)
9822          T             (where "bool T::ParseFrom(const char*, int)" exists)
9823          NULL          (the corresponding matched sub-pattern is not copied)
9824
9825       The  function returns true iff all of the following conditions are sat-
9826       isfied:
9827
9828         a. "text" matches "pattern" exactly;
9829
9830         b. The number of matched sub-patterns is >= number of supplied
9831            pointers;
9832
9833         c. The "i"th argument has a suitable type for holding the
9834            string captured as the "i"th sub-pattern. If you pass in
9835            void * NULL for the "i"th argument, or a non-void * NULL
9836            of the correct type, or pass fewer arguments than the
9837            number of sub-patterns, "i"th captured sub-pattern is
9838            ignored.
9839
9840       CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
9841       string  is  assigned  the  empty  string. Therefore, the following will
9842       return false (because the empty string is not a valid number):
9843
9844          int number;
9845          pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
9846
9847       The matching interface supports at most 16 arguments per call.  If  you
9848       need    more,    consider    using    the    more   general   interface
9849       pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
9850
9851       NOTE: Do not use no_arg, which is used internally to mark the end of  a
9852       list  of optional arguments, as a placeholder for missing arguments, as
9853       this can lead to segfaults.
9854
9855
9856QUOTING METACHARACTERS
9857
9858       You can use the "QuoteMeta" operation to insert backslashes before  all
9859       potentially  meaningful  characters  in  a string. The returned string,
9860       used as a regular expression, will exactly match the original string.
9861
9862         Example:
9863            string quoted = RE::QuoteMeta(unquoted);
9864
9865       Note that it's legal to escape a character even if it  has  no  special
9866       meaning  in  a  regular expression -- so this function does that. (This
9867       also makes it identical to the perl function  of  the  same  name;  see
9868       "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
9869       "1\.5\-2\.0\?".
9870
9871
9872PARTIAL MATCHES
9873
9874       You can use the "PartialMatch" operation when you want the  pattern  to
9875       match any substring of the text.
9876
9877         Example: simple search for a string:
9878            pcrecpp::RE("ell").PartialMatch("hello");
9879
9880         Example: find first number in a string:
9881            int number;
9882            pcrecpp::RE re("(\\d+)");
9883            re.PartialMatch("x*100 + 20", &number);
9884            assert(number == 100);
9885
9886
9887UTF-8 AND THE MATCHING INTERFACE
9888
9889       By  default,  pattern  and text are plain text, one byte per character.
9890       The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
9891       string to be treated as UTF-8 text, still a byte stream but potentially
9892       multiple bytes per character. In practice, the text is likelier  to  be
9893       UTF-8  than  the pattern, but the match returned may depend on the UTF8
9894       flag, so always use it when matching UTF8 text. For example,  "."  will
9895       match  one  byte normally but with UTF8 set may match up to three bytes
9896       of a multi-byte character.
9897
9898         Example:
9899            pcrecpp::RE_Options options;
9900            options.set_utf8();
9901            pcrecpp::RE re(utf8_pattern, options);
9902            re.FullMatch(utf8_string);
9903
9904         Example: using the convenience function UTF8():
9905            pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
9906            re.FullMatch(utf8_string);
9907
9908       NOTE: The UTF8 flag is ignored if pcre was not configured with the
9909             --enable-utf8 flag.
9910
9911
9912PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
9913
9914       PCRE defines some modifiers to  change  the  behavior  of  the  regular
9915       expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
9916       RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
9917       rently, the following modifiers are supported:
9918
9919          modifier              description               Perl corresponding
9920
9921          PCRE_CASELESS         case insensitive match      /i
9922          PCRE_MULTILINE        multiple lines match        /m
9923          PCRE_DOTALL           dot matches newlines        /s
9924          PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
9925          PCRE_EXTRA            strict escape parsing       N/A
9926          PCRE_EXTENDED         ignore white spaces         /x
9927          PCRE_UTF8             handles UTF8 chars          built-in
9928          PCRE_UNGREEDY         reverses * and *?           N/A
9929          PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
9930
9931       (*)  Both Perl and PCRE allow non capturing parentheses by means of the
9932       "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
9933       ture, while (ab|cd) does.
9934
9935       For  a  full  account on how each modifier works, please check the PCRE
9936       API reference page.
9937
9938       For each modifier, there are two member functions whose  name  is  made
9939       out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
9940       instance, PCRE_CASELESS is handled by
9941
9942         bool caseless()
9943
9944       which returns true if the modifier is set, and
9945
9946         RE_Options & set_caseless(bool)
9947
9948       which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
9949       be  accessed  through  the  set_match_limit()  and match_limit() member
9950       functions. Setting match_limit to a non-zero value will limit the  exe-
9951       cution  of pcre to keep it from doing bad things like blowing the stack
9952       or taking an eternity to return a result.  A  value  of  5000  is  good
9953       enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
9954       to  zero  disables  match  limiting.  Alternatively,   you   can   call
9955       match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
9956       limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
9957       matches PCRE does; match_limit_recursion() limits the depth of internal
9958       recursion, and therefore the amount of stack that is used.
9959
9960       Normally, to pass one or more modifiers to a RE class,  you  declare  a
9961       RE_Options object, set the appropriate options, and pass this object to
9962       a RE constructor. Example:
9963
9964          RE_Options opt;
9965          opt.set_caseless(true);
9966          if (RE("HELLO", opt).PartialMatch("hello world")) ...
9967
9968       RE_options has two constructors. The default constructor takes no argu-
9969       ments  and creates a set of flags that are off by default. The optional
9970       parameter option_flags is to facilitate transfer of legacy code from  C
9971       programs.  This lets you do
9972
9973          RE(pattern,
9974            RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
9975
9976       However, new code is better off doing
9977
9978          RE(pattern,
9979            RE_Options().set_caseless(true).set_multiline(true))
9980              .PartialMatch(str);
9981
9982       If you are going to pass one of the most used modifiers, there are some
9983       convenience functions that return a RE_Options class with the appropri-
9984       ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
9985       and EXTENDED().
9986
9987       If you need to set several options at once, and you don't  want  to  go
9988       through  the pains of declaring a RE_Options object and setting several
9989       options, there is a parallel method that give you such ability  on  the
9990       fly.  You  can  concatenate several set_xxxxx() member functions, since
9991       each of them returns a reference to its class object. For  example,  to
9992       pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
9993       statement, you may write:
9994
9995          RE(" ^ xyz \\s+ .* blah$",
9996            RE_Options()
9997              .set_caseless(true)
9998              .set_extended(true)
9999              .set_multiline(true)).PartialMatch(sometext);
10000
10001
10002SCANNING TEXT INCREMENTALLY
10003
10004       The "Consume" operation may be useful if you want to  repeatedly  match
10005       regular expressions at the front of a string and skip over them as they
10006       match. This requires use of the "StringPiece" type, which represents  a
10007       sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
10008       pcrecpp namespace.
10009
10010         Example: read lines of the form "var = value" from a string.
10011            string contents = ...;                 // Fill string somehow
10012            pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
10013
10014            string var;
10015            int value;
10016            pcrecpp::RE re("(\\w+) = (\\d+)\n");
10017            while (re.Consume(&input, &var, &value)) {
10018              ...;
10019            }
10020
10021       Each successful call  to  "Consume"  will  set  "var/value",  and  also
10022       advance "input" so it points past the matched text.
10023
10024       The  "FindAndConsume"  operation  is  similar to "Consume" but does not
10025       anchor your match at the beginning of  the  string.  For  example,  you
10026       could extract all words from a string by repeatedly calling
10027
10028         pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
10029
10030
10031PARSING HEX/OCTAL/C-RADIX NUMBERS
10032
10033       By default, if you pass a pointer to a numeric value, the corresponding
10034       text is interpreted as a base-10  number.  You  can  instead  wrap  the
10035       pointer with a call to one of the operators Hex(), Octal(), or CRadix()
10036       to interpret the text in another base. The CRadix  operator  interprets
10037       C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
10038       base-10.
10039
10040         Example:
10041           int a, b, c, d;
10042           pcrecpp::RE re("(.*) (.*) (.*) (.*)");
10043           re.FullMatch("100 40 0100 0x40",
10044                        pcrecpp::Octal(&a), pcrecpp::Hex(&b),
10045                        pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
10046
10047       will leave 64 in a, b, c, and d.
10048
10049
10050REPLACING PARTS OF STRINGS
10051
10052       You can replace the first match of "pattern" in "str"  with  "rewrite".
10053       Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
10054       insert text matching corresponding parenthesized group  from  the  pat-
10055       tern. \0 in "rewrite" refers to the entire matching text. For example:
10056
10057         string s = "yabba dabba doo";
10058         pcrecpp::RE("b+").Replace("d", &s);
10059
10060       will  leave  "s" containing "yada dabba doo". The result is true if the
10061       pattern matches and a replacement occurs, false otherwise.
10062
10063       GlobalReplace is like Replace except that it replaces  all  occurrences
10064       of  the  pattern  in  the string with the rewrite. Replacements are not
10065       subject to re-matching. For example:
10066
10067         string s = "yabba dabba doo";
10068         pcrecpp::RE("b+").GlobalReplace("d", &s);
10069
10070       will leave "s" containing "yada dada doo". It  returns  the  number  of
10071       replacements made.
10072
10073       Extract  is like Replace, except that if the pattern matches, "rewrite"
10074       is copied into "out" (an additional argument) with substitutions.   The
10075       non-matching  portions  of "text" are ignored. Returns true iff a match
10076       occurred and the extraction happened successfully;  if no match occurs,
10077       the string is left unaffected.
10078
10079
10080AUTHOR
10081
10082       The C++ wrapper was contributed by Google Inc.
10083       Copyright (c) 2007 Google Inc.
10084
10085
10086REVISION
10087
10088       Last updated: 08 January 2012
10089------------------------------------------------------------------------------
10090
10091
10092PCRESAMPLE(3)              Library Functions Manual              PCRESAMPLE(3)
10093
10094
10095
10096NAME
10097       PCRE - Perl-compatible regular expressions
10098
10099PCRE SAMPLE PROGRAM
10100
10101       A simple, complete demonstration program, to get you started with using
10102       PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
10103       listing  of this program is given in the pcredemo documentation. If you
10104       do not have a copy of the PCRE distribution, you can save this  listing
10105       to re-create pcredemo.c.
10106
10107       The  demonstration program, which uses the original PCRE 8-bit library,
10108       compiles the regular expression that is its first argument, and matches
10109       it  against  the subject string in its second argument. No PCRE options
10110       are set, and default character tables are used. If  matching  succeeds,
10111       the  program  outputs the portion of the subject that matched, together
10112       with the contents of any captured substrings.
10113
10114       If the -g option is given on the command line, the program then goes on
10115       to check for further matches of the same regular expression in the same
10116       subject string. The logic is a little bit tricky because of the  possi-
10117       bility  of  matching an empty string. Comments in the code explain what
10118       is going on.
10119
10120       If PCRE is installed in the standard include  and  library  directories
10121       for your operating system, you should be able to compile the demonstra-
10122       tion program using this command:
10123
10124         gcc -o pcredemo pcredemo.c -lpcre
10125
10126       If PCRE is installed elsewhere, you may need to add additional  options
10127       to  the  command line. For example, on a Unix-like system that has PCRE
10128       installed in /usr/local, you  can  compile  the  demonstration  program
10129       using a command like this:
10130
10131         gcc -o pcredemo -I/usr/local/include pcredemo.c \
10132             -L/usr/local/lib -lpcre
10133
10134       In  a  Windows  environment, if you want to statically link the program
10135       against a non-dll pcre.a file, you must uncomment the line that defines
10136       PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
10137       loc()   and   pcre_free()   exported   functions   will   be   declared
10138       __declspec(dllimport), with unwanted results.
10139
10140       Once  you  have  compiled and linked the demonstration program, you can
10141       run simple tests like this:
10142
10143         ./pcredemo 'cat|dog' 'the cat sat on the mat'
10144         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
10145
10146       Note that there is a  much  more  comprehensive  test  program,  called
10147       pcretest,  which  supports  many  more  facilities  for testing regular
10148       expressions and both PCRE libraries. The pcredemo program  is  provided
10149       as a simple coding example.
10150
10151       If  you  try to run pcredemo when PCRE is not installed in the standard
10152       library directory, you may get an error like  this  on  some  operating
10153       systems (e.g. Solaris):
10154
10155         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
10156       directory
10157
10158       This is caused by the way shared library support works  on  those  sys-
10159       tems. You need to add
10160
10161         -R/usr/local/lib
10162
10163       (for example) to the compile command to get round this problem.
10164
10165
10166AUTHOR
10167
10168       Philip Hazel
10169       University Computing Service
10170       Cambridge CB2 3QH, England.
10171
10172
10173REVISION
10174
10175       Last updated: 10 January 2012
10176       Copyright (c) 1997-2012 University of Cambridge.
10177------------------------------------------------------------------------------
10178PCRELIMITS(3)              Library Functions Manual              PCRELIMITS(3)
10179
10180
10181
10182NAME
10183       PCRE - Perl-compatible regular expressions
10184
10185SIZE AND OTHER LIMITATIONS
10186
10187       There  are some size limitations in PCRE but it is hoped that they will
10188       never in practice be relevant.
10189
10190       The maximum length of a compiled  pattern  is  approximately  64K  data
10191       units  (bytes  for  the  8-bit  library,  16-bit  units  for the 16-bit
10192       library, and 32-bit units for the 32-bit library) if PCRE  is  compiled
10193       with  the default internal linkage size, which is 2 bytes for the 8-bit
10194       and 16-bit libraries, and 4 bytes for the 32-bit library. If  you  want
10195       to process regular expressions that are truly enormous, you can compile
10196       PCRE with an internal linkage size of 3 or 4 (when building the  16-bit
10197       or  32-bit  library,  3 is rounded up to 4). See the README file in the
10198       source distribution and the pcrebuild  documentation  for  details.  In
10199       these  cases  the limit is substantially larger.  However, the speed of
10200       execution is slower.
10201
10202       All values in repeating quantifiers must be less than 65536.
10203
10204       There is no limit to the number of parenthesized subpatterns, but there
10205       can  be  no more than 65535 capturing subpatterns. There is, however, a
10206       limit to the depth of  nesting  of  parenthesized  subpatterns  of  all
10207       kinds.  This  is  imposed  in order to limit the amount of system stack
10208       used at compile time. The limit can be specified when  PCRE  is  built;
10209       the default is 250.
10210
10211       There is a limit to the number of forward references to subsequent sub-
10212       patterns of around 200,000.  Repeated  forward  references  with  fixed
10213       upper  limits,  for example, (?2){0,100} when subpattern number 2 is to
10214       the right, are included in the count. There is no limit to  the  number
10215       of backward references.
10216
10217       The maximum length of name for a named subpattern is 32 characters, and
10218       the maximum number of named subpatterns is 10000.
10219
10220       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
10221       (*THEN)  verb is 255 for the 8-bit library and 65535 for the 16-bit and
10222       32-bit libraries.
10223
10224       The maximum length of a subject string is the largest  positive  number
10225       that  an integer variable can hold. However, when using the traditional
10226       matching function, PCRE uses recursion to handle subpatterns and indef-
10227       inite  repetition.  This means that the available stack space may limit
10228       the size of a subject string that can be processed by certain patterns.
10229       For a discussion of stack issues, see the pcrestack documentation.
10230
10231
10232AUTHOR
10233
10234       Philip Hazel
10235       University Computing Service
10236       Cambridge CB2 3QH, England.
10237
10238
10239REVISION
10240
10241       Last updated: 05 November 2013
10242       Copyright (c) 1997-2013 University of Cambridge.
10243------------------------------------------------------------------------------
10244
10245
10246PCRESTACK(3)               Library Functions Manual               PCRESTACK(3)
10247
10248
10249
10250NAME
10251       PCRE - Perl-compatible regular expressions
10252
10253PCRE DISCUSSION OF STACK USAGE
10254
10255       When  you call pcre[16|32]_exec(), it makes use of an internal function
10256       called match(). This calls itself recursively at branch points  in  the
10257       pattern,  in  order  to  remember the state of the match so that it can
10258       back up and try a different alternative if  the  first  one  fails.  As
10259       matching proceeds deeper and deeper into the tree of possibilities, the
10260       recursion depth increases. The match() function is also called in other
10261       circumstances,  for  example,  whenever  a parenthesized sub-pattern is
10262       entered, and in certain cases of repetition.
10263
10264       Not all calls of match() increase the recursion depth; for an item such
10265       as  a* it may be called several times at the same level, after matching
10266       different numbers of a's. Furthermore, in a number of cases  where  the
10267       result  of  the  recursive call would immediately be passed back as the
10268       result of the current call (a "tail recursion"), the function  is  just
10269       restarted instead.
10270
10271       The  above  comments apply when pcre[16|32]_exec() is run in its normal
10272       interpretive  manner.   If   the   pattern   was   studied   with   the
10273       PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
10274       ful, and the options passed to pcre[16|32]_exec() were  not  incompati-
10275       ble,  the  matching  process  uses the JIT-compiled code instead of the
10276       match() function. In this case, the  memory  requirements  are  handled
10277       entirely differently. See the pcrejit documentation for details.
10278
10279       The  pcre[16|32]_dfa_exec()  function operates in an entirely different
10280       way, and uses recursion only when there is a regular expression  recur-
10281       sion or subroutine call in the pattern. This includes the processing of
10282       assertion and "once-only" subpatterns, which are handled  like  subrou-
10283       tine  calls.  Normally, these are never very deep, and the limit on the
10284       complexity of pcre[16|32]_dfa_exec() is controlled  by  the  amount  of
10285       workspace  it is given.  However, it is possible to write patterns with
10286       runaway    infinite    recursions;    such    patterns    will    cause
10287       pcre[16|32]_dfa_exec()  to  run  out  of stack. At present, there is no
10288       protection against this.
10289
10290       The comments that follow do NOT apply to  pcre[16|32]_dfa_exec();  they
10291       are relevant only for pcre[16|32]_exec() without the JIT optimization.
10292
10293   Reducing pcre[16|32]_exec()'s stack usage
10294
10295       Each  time  that match() is actually called recursively, it uses memory
10296       from the process stack. For certain kinds of  pattern  and  data,  very
10297       large  amounts of stack may be needed, despite the recognition of "tail
10298       recursion".  You can often reduce the amount of recursion,  and  there-
10299       fore  the  amount of stack used, by modifying the pattern that is being
10300       matched. Consider, for example, this pattern:
10301
10302         ([^<]|<(?!inet))+
10303
10304       It matches from wherever it starts until it encounters "<inet"  or  the
10305       end  of  the  data,  and is the kind of pattern that might be used when
10306       processing an XML file. Each iteration of the outer parentheses matches
10307       either  one  character that is not "<" or a "<" that is not followed by
10308       "inet". However, each time a  parenthesis  is  processed,  a  recursion
10309       occurs, so this formulation uses a stack frame for each matched charac-
10310       ter. For a long string, a lot of stack is required. Consider  now  this
10311       rewritten pattern, which matches exactly the same strings:
10312
10313         ([^<]++|<(?!inet))+
10314
10315       This  uses very much less stack, because runs of characters that do not
10316       contain "<" are "swallowed" in one item inside the parentheses.  Recur-
10317       sion  happens  only when a "<" character that is not followed by "inet"
10318       is encountered (and we assume this is relatively  rare).  A  possessive
10319       quantifier  is  used  to stop any backtracking into the runs of non-"<"
10320       characters, but that is not related to stack usage.
10321
10322       This example shows that one way of avoiding stack problems when  match-
10323       ing long subject strings is to write repeated parenthesized subpatterns
10324       to match more than one character whenever possible.
10325
10326   Compiling PCRE to use heap instead of stack for pcre[16|32]_exec()
10327
10328       In environments where stack memory is constrained, you  might  want  to
10329       compile  PCRE to use heap memory instead of stack for remembering back-
10330       up points when pcre[16|32]_exec() is running. This makes it run  a  lot
10331       more slowly, however.  Details of how to do this are given in the pcre-
10332       build documentation. When built in  this  way,  instead  of  using  the
10333       stack,  PCRE obtains and frees memory by calling the functions that are
10334       pointed to by the pcre[16|32]_stack_malloc  and  pcre[16|32]_stack_free
10335       variables.  By default, these point to malloc() and free(), but you can
10336       replace the pointers to cause PCRE to use your own functions. Since the
10337       block sizes are always the same, and are always freed in reverse order,
10338       it may be possible to implement customized  memory  handlers  that  are
10339       more efficient than the standard functions.
10340
10341   Limiting pcre[16|32]_exec()'s stack usage
10342
10343       You  can set limits on the number of times that match() is called, both
10344       in total and recursively. If a limit  is  exceeded,  pcre[16|32]_exec()
10345       returns  an  error code. Setting suitable limits should prevent it from
10346       running out of stack. The default values of the limits are very  large,
10347       and  unlikely  ever to operate. They can be changed when PCRE is built,
10348       and they can also be set when pcre[16|32]_exec() is called. For details
10349       of these interfaces, see the pcrebuild documentation and the section on
10350       extra data for pcre[16|32]_exec() in the pcreapi documentation.
10351
10352       As a very rough rule of thumb, you should reckon on about 500 bytes per
10353       recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
10354       should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
10355       hand, can support around 128000 recursions.
10356
10357       In Unix-like environments, the pcretest test program has a command line
10358       option (-S) that can be used to increase the size of its stack. As long
10359       as  the  stack is large enough, another option (-M) can be used to find
10360       the smallest limits that allow a particular pattern to  match  a  given
10361       subject  string.  This is done by calling pcre[16|32]_exec() repeatedly
10362       with different limits.
10363
10364   Obtaining an estimate of stack usage
10365
10366       The actual amount of stack used per recursion can  vary  quite  a  lot,
10367       depending on the compiler that was used to build PCRE and the optimiza-
10368       tion or debugging options that were set for it. The rule of thumb value
10369       of  500  bytes  mentioned  above  may be larger or smaller than what is
10370       actually needed. A better approximation can be obtained by running this
10371       command:
10372
10373         pcretest -m -C
10374
10375       The  -C  option causes pcretest to output information about the options
10376       with which PCRE was compiled. When -m is also given (before -C), infor-
10377       mation about stack use is given in a line like this:
10378
10379         Match recursion uses stack: approximate frame size = 640 bytes
10380
10381       The value is approximate because some recursions need a bit more (up to
10382       perhaps 16 more bytes).
10383
10384       If the above command is given when PCRE is compiled  to  use  the  heap
10385       instead  of  the  stack  for recursion, the value that is output is the
10386       size of each block that is obtained from the heap.
10387
10388   Changing stack size in Unix-like systems
10389
10390       In Unix-like environments, there is not often a problem with the  stack
10391       unless  very  long  strings  are  involved, though the default limit on
10392       stack size varies from system to system. Values from 8Mb  to  64Mb  are
10393       common. You can find your default limit by running the command:
10394
10395         ulimit -s
10396
10397       Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
10398       though sometimes a more explicit error message is given. You  can  nor-
10399       mally increase the limit on stack size by code such as this:
10400
10401         struct rlimit rlim;
10402         getrlimit(RLIMIT_STACK, &rlim);
10403         rlim.rlim_cur = 100*1024*1024;
10404         setrlimit(RLIMIT_STACK, &rlim);
10405
10406       This  reads  the current limits (soft and hard) using getrlimit(), then
10407       attempts to increase the soft limit to  100Mb  using  setrlimit().  You
10408       must do this before calling pcre[16|32]_exec().
10409
10410   Changing stack size in Mac OS X
10411
10412       Using setrlimit(), as described above, should also work on Mac OS X. It
10413       is also possible to set a stack size when linking a program. There is a
10414       discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
10415       http://developer.apple.com/qa/qa2005/qa1419.html.
10416
10417
10418AUTHOR
10419
10420       Philip Hazel
10421       University Computing Service
10422       Cambridge CB2 3QH, England.
10423
10424
10425REVISION
10426
10427       Last updated: 24 June 2012
10428       Copyright (c) 1997-2012 University of Cambridge.
10429------------------------------------------------------------------------------
10430
10431
10432