1----------------------------------------------------------------------------- 2This file contains a concatenation of the PCRE man pages, converted to plain 3text format for ease of searching with a text editor, or for use on systems 4that do not have a man page processor. The small individual files that give 5synopses of each function in the library have not been included. Neither has 6the pcredemo program. There are separate text files for the pcregrep and 7pcretest commands. 8----------------------------------------------------------------------------- 9 10 11PCRE(3) PCRE(3) 12 13 14NAME 15 PCRE - Perl-compatible regular expressions 16 17 18INTRODUCTION 19 20 The PCRE library is a set of functions that implement regular expres- 21 sion pattern matching using the same syntax and semantics as Perl, with 22 just a few differences. Some features that appeared in Python and PCRE 23 before they appeared in Perl are also available using the Python syn- 24 tax, there is some support for one or two .NET and Oniguruma syntax 25 items, and there is an option for requesting some minor changes that 26 give better JavaScript compatibility. 27 28 Starting with release 8.30, it is possible to compile two separate PCRE 29 libraries: the original, which supports 8-bit character strings 30 (including UTF-8 strings), and a second library that supports 16-bit 31 character strings (including UTF-16 strings). The build process allows 32 either one or both to be built. The majority of the work to make this 33 possible was done by Zoltan Herczeg. 34 35 Starting with release 8.32 it is possible to compile a third separate 36 PCRE library, which supports 32-bit character strings (including UTF-32 37 strings). The build process allows any set of the 8-, 16- and 32-bit 38 libraries. The work to make this possible was done by Christian Persch. 39 40 The three libraries contain identical sets of functions, except that 41 the names in the 16-bit library start with pcre16_ instead of pcre_, 42 and the names in the 32-bit library start with pcre32_ instead of 43 pcre_. To avoid over-complication and reduce the documentation mainte- 44 nance load, most of the documentation describes the 8-bit library, with 45 the differences for the 16-bit and 32-bit libraries described sepa- 46 rately in the pcre16 and pcre32 pages. References to functions or 47 structures of the form pcre[16|32]_xxx should be read as meaning 48 "pcre_xxx when using the 8-bit library, pcre16_xxx when using the 49 16-bit library, or pcre32_xxx when using the 32-bit library". 50 51 The current implementation of PCRE corresponds approximately with Perl 52 5.12, including support for UTF-8/16/32 encoded strings and Unicode 53 general category properties. However, UTF-8/16/32 and Unicode support 54 has to be explicitly enabled; it is not the default. The Unicode tables 55 correspond to Unicode release 6.2.0. 56 57 In addition to the Perl-compatible matching function, PCRE contains an 58 alternative function that matches the same compiled patterns in a dif- 59 ferent way. In certain circumstances, the alternative function has some 60 advantages. For a discussion of the two matching algorithms, see the 61 pcrematching page. 62 63 PCRE is written in C and released as a C library. A number of people 64 have written wrappers and interfaces of various kinds. In particular, 65 Google Inc. have provided a comprehensive C++ wrapper for the 8-bit 66 library. This is now included as part of the PCRE distribution. The 67 pcrecpp page has details of this interface. Other people's contribu- 68 tions can be found in the Contrib directory at the primary FTP site, 69 which is: 70 71 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre 72 73 Details of exactly which Perl regular expression features are and are 74 not supported by PCRE are given in separate documents. See the pcrepat- 75 tern and pcrecompat pages. There is a syntax summary in the pcresyntax 76 page. 77 78 Some features of PCRE can be included, excluded, or changed when the 79 library is built. The pcre_config() function makes it possible for a 80 client to discover which features are available. The features them- 81 selves are described in the pcrebuild page. Documentation about build- 82 ing PCRE for various operating systems can be found in the README and 83 NON-AUTOTOOLS_BUILD files in the source distribution. 84 85 The libraries contains a number of undocumented internal functions and 86 data tables that are used by more than one of the exported external 87 functions, but which are not intended for use by external callers. 88 Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which 89 hopefully will not provoke any name clashes. In some environments, it 90 is possible to control which external symbols are exported when a 91 shared library is built, and in these cases the undocumented symbols 92 are not exported. 93 94 95SECURITY CONSIDERATIONS 96 97 If you are using PCRE in a non-UTF application that permits users to 98 supply arbitrary patterns for compilation, you should be aware of a 99 feature that allows users to turn on UTF support from within a pattern, 100 provided that PCRE was built with UTF support. For example, an 8-bit 101 pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode, 102 which interprets patterns and subjects as strings of UTF-8 characters 103 instead of individual 8-bit characters. This causes both the pattern 104 and any data against which it is matched to be checked for UTF-8 valid- 105 ity. If the data string is very long, such a check might use suffi- 106 ciently many resources as to cause your application to lose perfor- 107 mance. 108 109 The best way of guarding against this possibility is to use the 110 pcre_fullinfo() function to check the compiled pattern's options for 111 UTF. 112 113 If your application is one that supports UTF, be aware that validity 114 checking can take time. If the same data string is to be matched many 115 times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second 116 and subsequent matches to save redundant checks. 117 118 Another way that performance can be hit is by running a pattern that 119 has a very large search tree against a string that will never match. 120 Nested unlimited repeats in a pattern are a common example. PCRE pro- 121 vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea- 122 ture in the pcreapi page. 123 124 125USER DOCUMENTATION 126 127 The user documentation for PCRE comprises a number of different sec- 128 tions. In the "man" format, each of these is a separate "man page". In 129 the HTML format, each is a separate page, linked from the index page. 130 In the plain text format, all the sections, except the pcredemo sec- 131 tion, are concatenated, for ease of searching. The sections are as fol- 132 lows: 133 134 pcre this document 135 pcre16 details of the 16-bit library 136 pcre32 details of the 32-bit library 137 pcre-config show PCRE installation configuration information 138 pcreapi details of PCRE's native C API 139 pcrebuild options for building PCRE 140 pcrecallout details of the callout feature 141 pcrecompat discussion of Perl compatibility 142 pcrecpp details of the C++ wrapper for the 8-bit library 143 pcredemo a demonstration C program that uses PCRE 144 pcregrep description of the pcregrep command (8-bit only) 145 pcrejit discussion of the just-in-time optimization support 146 pcrelimits details of size and other limits 147 pcrematching discussion of the two matching algorithms 148 pcrepartial details of the partial matching facility 149 pcrepattern syntax and semantics of supported 150 regular expressions 151 pcreperform discussion of performance issues 152 pcreposix the POSIX-compatible C API for the 8-bit library 153 pcreprecompile details of saving and re-using precompiled patterns 154 pcresample discussion of the pcredemo program 155 pcrestack discussion of stack usage 156 pcresyntax quick syntax reference 157 pcretest description of the pcretest testing command 158 pcreunicode discussion of Unicode and UTF-8/16/32 support 159 160 In addition, in the "man" and HTML formats, there is a short page for 161 each C library function, listing its arguments and results. 162 163 164AUTHOR 165 166 Philip Hazel 167 University Computing Service 168 Cambridge CB2 3QH, England. 169 170 Putting an actual email address here seems to have been a spam magnet, 171 so I've taken it away. If you want to email me, use my two initials, 172 followed by the two digits 10, at the domain cam.ac.uk. 173 174 175REVISION 176 177 Last updated: 11 November 2012 178 Copyright (c) 1997-2012 University of Cambridge. 179------------------------------------------------------------------------------ 180 181 182PCRE(3) PCRE(3) 183 184 185NAME 186 PCRE - Perl-compatible regular expressions 187 188 #include <pcre.h> 189 190 191PCRE 16-BIT API BASIC FUNCTIONS 192 193 pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options, 194 const char **errptr, int *erroffset, 195 const unsigned char *tableptr); 196 197 pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options, 198 int *errorcodeptr, 199 const char **errptr, int *erroffset, 200 const unsigned char *tableptr); 201 202 pcre16_extra *pcre16_study(const pcre16 *code, int options, 203 const char **errptr); 204 205 void pcre16_free_study(pcre16_extra *extra); 206 207 int pcre16_exec(const pcre16 *code, const pcre16_extra *extra, 208 PCRE_SPTR16 subject, int length, int startoffset, 209 int options, int *ovector, int ovecsize); 210 211 int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra, 212 PCRE_SPTR16 subject, int length, int startoffset, 213 int options, int *ovector, int ovecsize, 214 int *workspace, int wscount); 215 216 217PCRE 16-BIT API STRING EXTRACTION FUNCTIONS 218 219 int pcre16_copy_named_substring(const pcre16 *code, 220 PCRE_SPTR16 subject, int *ovector, 221 int stringcount, PCRE_SPTR16 stringname, 222 PCRE_UCHAR16 *buffer, int buffersize); 223 224 int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector, 225 int stringcount, int stringnumber, PCRE_UCHAR16 *buffer, 226 int buffersize); 227 228 int pcre16_get_named_substring(const pcre16 *code, 229 PCRE_SPTR16 subject, int *ovector, 230 int stringcount, PCRE_SPTR16 stringname, 231 PCRE_SPTR16 *stringptr); 232 233 int pcre16_get_stringnumber(const pcre16 *code, 234 PCRE_SPTR16 name); 235 236 int pcre16_get_stringtable_entries(const pcre16 *code, 237 PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last); 238 239 int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector, 240 int stringcount, int stringnumber, 241 PCRE_SPTR16 *stringptr); 242 243 int pcre16_get_substring_list(PCRE_SPTR16 subject, 244 int *ovector, int stringcount, PCRE_SPTR16 **listptr); 245 246 void pcre16_free_substring(PCRE_SPTR16 stringptr); 247 248 void pcre16_free_substring_list(PCRE_SPTR16 *stringptr); 249 250 251PCRE 16-BIT API AUXILIARY FUNCTIONS 252 253 pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize); 254 255 void pcre16_jit_stack_free(pcre16_jit_stack *stack); 256 257 void pcre16_assign_jit_stack(pcre16_extra *extra, 258 pcre16_jit_callback callback, void *data); 259 260 const unsigned char *pcre16_maketables(void); 261 262 int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra, 263 int what, void *where); 264 265 int pcre16_refcount(pcre16 *code, int adjust); 266 267 int pcre16_config(int what, void *where); 268 269 const char *pcre16_version(void); 270 271 int pcre16_pattern_to_host_byte_order(pcre16 *code, 272 pcre16_extra *extra, const unsigned char *tables); 273 274 275PCRE 16-BIT API INDIRECTED FUNCTIONS 276 277 void *(*pcre16_malloc)(size_t); 278 279 void (*pcre16_free)(void *); 280 281 void *(*pcre16_stack_malloc)(size_t); 282 283 void (*pcre16_stack_free)(void *); 284 285 int (*pcre16_callout)(pcre16_callout_block *); 286 287 288PCRE 16-BIT API 16-BIT-ONLY FUNCTION 289 290 int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output, 291 PCRE_SPTR16 input, int length, int *byte_order, 292 int keep_boms); 293 294 295THE PCRE 16-BIT LIBRARY 296 297 Starting with release 8.30, it is possible to compile a PCRE library 298 that supports 16-bit character strings, including UTF-16 strings, as 299 well as or instead of the original 8-bit library. The majority of the 300 work to make this possible was done by Zoltan Herczeg. The two 301 libraries contain identical sets of functions, used in exactly the same 302 way. Only the names of the functions and the data types of their argu- 303 ments and results are different. To avoid over-complication and reduce 304 the documentation maintenance load, most of the PCRE documentation 305 describes the 8-bit library, with only occasional references to the 306 16-bit library. This page describes what is different when you use the 307 16-bit library. 308 309 WARNING: A single application can be linked with both libraries, but 310 you must take care when processing any particular pattern to use func- 311 tions from just one library. For example, if you want to study a pat- 312 tern that was compiled with pcre16_compile(), you must do so with 313 pcre16_study(), not pcre_study(), and you must free the study data with 314 pcre16_free_study(). 315 316 317THE HEADER FILE 318 319 There is only one header file, pcre.h. It contains prototypes for all 320 the functions in all libraries, as well as definitions of flags, struc- 321 tures, error codes, etc. 322 323 324THE LIBRARY NAME 325 326 In Unix-like systems, the 16-bit library is called libpcre16, and can 327 normally be accesss by adding -lpcre16 to the command for linking an 328 application that uses PCRE. 329 330 331STRING TYPES 332 333 In the 8-bit library, strings are passed to PCRE library functions as 334 vectors of bytes with the C type "char *". In the 16-bit library, 335 strings are passed as vectors of unsigned 16-bit quantities. The macro 336 PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is 337 defined as "const PCRE_UCHAR16 *". In very many environments, "short 338 int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16 339 as "unsigned short int", but checks that it really is a 16-bit data 340 type. If it is not, the build fails with an error message telling the 341 maintainer to modify the definition appropriately. 342 343 344STRUCTURE TYPES 345 346 The types of the opaque structures that are used for compiled 16-bit 347 patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively. 348 The type of the user-accessible structure that is returned by 349 pcre16_study() is pcre16_extra, and the type of the structure that is 350 used for passing data to a callout function is pcre16_callout_block. 351 These structures contain the same fields, with the same names, as their 352 8-bit counterparts. The only difference is that pointers to character 353 strings are 16-bit instead of 8-bit types. 354 355 35616-BIT FUNCTIONS 357 358 For every function in the 8-bit library there is a corresponding func- 359 tion in the 16-bit library with a name that starts with pcre16_ instead 360 of pcre_. The prototypes are listed above. In addition, there is one 361 extra function, pcre16_utf16_to_host_byte_order(). This is a utility 362 function that converts a UTF-16 character string to host byte order if 363 necessary. The other 16-bit functions expect the strings they are 364 passed to be in host byte order. 365 366 The input and output arguments of pcre16_utf16_to_host_byte_order() may 367 point to the same address, that is, conversion in place is supported. 368 The output buffer must be at least as long as the input. 369 370 The length argument specifies the number of 16-bit data units in the 371 input string; a negative value specifies a zero-terminated string. 372 373 If byte_order is NULL, it is assumed that the string starts off in host 374 byte order. This may be changed by byte-order marks (BOMs) anywhere in 375 the string (commonly as the first character). 376 377 If byte_order is not NULL, a non-zero value of the integer to which it 378 points means that the input starts off in host byte order, otherwise 379 the opposite order is assumed. Again, BOMs in the string can change 380 this. The final byte order is passed back at the end of processing. 381 382 If keep_boms is not zero, byte-order mark characters (0xfeff) are 383 copied into the output string. Otherwise they are discarded. 384 385 The result of the function is the number of 16-bit units placed into 386 the output buffer, including the zero terminator if the string was 387 zero-terminated. 388 389 390SUBJECT STRING OFFSETS 391 392 The offsets within subject strings that are returned by the matching 393 functions are in 16-bit units rather than bytes. 394 395 396NAMED SUBPATTERNS 397 398 The name-to-number translation table that is maintained for named sub- 399 patterns uses 16-bit characters. The pcre16_get_stringtable_entries() 400 function returns the length of each entry in the table as the number of 401 16-bit data units. 402 403 404OPTION NAMES 405 406 There are two new general option names, PCRE_UTF16 and 407 PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and 408 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options 409 define the same bits in the options word. There is a discussion about 410 the validity of UTF-16 strings in the pcreunicode page. 411 412 For the pcre16_config() function there is an option PCRE_CONFIG_UTF16 413 that returns 1 if UTF-16 support is configured, otherwise 0. If this 414 option is given to pcre_config() or pcre32_config(), or if the 415 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con- 416 fig(), the result is the PCRE_ERROR_BADOPTION error. 417 418 419CHARACTER CODES 420 421 In 16-bit mode, when PCRE_UTF16 is not set, character values are 422 treated in the same way as in 8-bit, non UTF-8 mode, except, of course, 423 that they can range from 0 to 0xffff instead of 0 to 0xff. Character 424 types for characters less than 0xff can therefore be influenced by the 425 locale in the same way as before. Characters greater than 0xff have 426 only one case, and no "type" (such as letter or digit). 427 428 In UTF-16 mode, the character code is Unicode, in the range 0 to 429 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff 430 because those are "surrogate" values that are used in pairs to encode 431 values greater than 0xffff. 432 433 A UTF-16 string can indicate its endianness by special code knows as a 434 byte-order mark (BOM). The PCRE functions do not handle this, expecting 435 strings to be in host byte order. A utility function called 436 pcre16_utf16_to_host_byte_order() is provided to help with this (see 437 above). 438 439 440ERROR NAMES 441 442 The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre- 443 spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is 444 given when a compiled pattern is passed to a function that processes 445 patterns in the other mode, for example, if a pattern compiled with 446 pcre_compile() is passed to pcre16_exec(). 447 448 There are new error codes whose names begin with PCRE_UTF16_ERR for 449 invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for 450 UTF-8 strings that are described in the section entitled "Reason codes 451 for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors 452 are: 453 454 PCRE_UTF16_ERR1 Missing low surrogate at end of string 455 PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate 456 PCRE_UTF16_ERR3 Isolated low surrogate 457 PCRE_UTF16_ERR4 Non-character 458 459 460ERROR TEXTS 461 462 If there is an error while compiling a pattern, the error text that is 463 passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit 464 character string, zero-terminated. 465 466 467CALLOUTS 468 469 The subject and mark fields in the callout block that is passed to a 470 callout function point to 16-bit vectors. 471 472 473TESTING 474 475 The pcretest program continues to operate with 8-bit input and output 476 files, but it can be used for testing the 16-bit library. If it is run 477 with the command line option -16, patterns and subject strings are con- 478 verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit 479 library functions are used instead of the 8-bit ones. Returned 16-bit 480 strings are converted to 8-bit for output. If both the 8-bit and the 481 32-bit libraries were not compiled, pcretest defaults to 16-bit and the 482 -16 option is ignored. 483 484 When PCRE is being built, the RunTest script that is called by "make 485 check" uses the pcretest -C option to discover which of the 8-bit, 486 16-bit and 32-bit libraries has been built, and runs the tests appro- 487 priately. 488 489 490NOT SUPPORTED IN 16-BIT MODE 491 492 Not all the features of the 8-bit library are available with the 16-bit 493 library. The C++ and POSIX wrapper functions support only the 8-bit 494 library, and the pcregrep program is at present 8-bit only. 495 496 497AUTHOR 498 499 Philip Hazel 500 University Computing Service 501 Cambridge CB2 3QH, England. 502 503 504REVISION 505 506 Last updated: 08 November 2012 507 Copyright (c) 1997-2012 University of Cambridge. 508------------------------------------------------------------------------------ 509 510 511PCRE(3) PCRE(3) 512 513 514NAME 515 PCRE - Perl-compatible regular expressions 516 517 #include <pcre.h> 518 519 520PCRE 32-BIT API BASIC FUNCTIONS 521 522 pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options, 523 const char **errptr, int *erroffset, 524 const unsigned char *tableptr); 525 526 pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options, 527 int *errorcodeptr, 528 const char **errptr, int *erroffset, 529 const unsigned char *tableptr); 530 531 pcre32_extra *pcre32_study(const pcre32 *code, int options, 532 const char **errptr); 533 534 void pcre32_free_study(pcre32_extra *extra); 535 536 int pcre32_exec(const pcre32 *code, const pcre32_extra *extra, 537 PCRE_SPTR32 subject, int length, int startoffset, 538 int options, int *ovector, int ovecsize); 539 540 int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra, 541 PCRE_SPTR32 subject, int length, int startoffset, 542 int options, int *ovector, int ovecsize, 543 int *workspace, int wscount); 544 545 546PCRE 32-BIT API STRING EXTRACTION FUNCTIONS 547 548 int pcre32_copy_named_substring(const pcre32 *code, 549 PCRE_SPTR32 subject, int *ovector, 550 int stringcount, PCRE_SPTR32 stringname, 551 PCRE_UCHAR32 *buffer, int buffersize); 552 553 int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector, 554 int stringcount, int stringnumber, PCRE_UCHAR32 *buffer, 555 int buffersize); 556 557 int pcre32_get_named_substring(const pcre32 *code, 558 PCRE_SPTR32 subject, int *ovector, 559 int stringcount, PCRE_SPTR32 stringname, 560 PCRE_SPTR32 *stringptr); 561 562 int pcre32_get_stringnumber(const pcre32 *code, 563 PCRE_SPTR32 name); 564 565 int pcre32_get_stringtable_entries(const pcre32 *code, 566 PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last); 567 568 int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector, 569 int stringcount, int stringnumber, 570 PCRE_SPTR32 *stringptr); 571 572 int pcre32_get_substring_list(PCRE_SPTR32 subject, 573 int *ovector, int stringcount, PCRE_SPTR32 **listptr); 574 575 void pcre32_free_substring(PCRE_SPTR32 stringptr); 576 577 void pcre32_free_substring_list(PCRE_SPTR32 *stringptr); 578 579 580PCRE 32-BIT API AUXILIARY FUNCTIONS 581 582 pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize); 583 584 void pcre32_jit_stack_free(pcre32_jit_stack *stack); 585 586 void pcre32_assign_jit_stack(pcre32_extra *extra, 587 pcre32_jit_callback callback, void *data); 588 589 const unsigned char *pcre32_maketables(void); 590 591 int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra, 592 int what, void *where); 593 594 int pcre32_refcount(pcre32 *code, int adjust); 595 596 int pcre32_config(int what, void *where); 597 598 const char *pcre32_version(void); 599 600 int pcre32_pattern_to_host_byte_order(pcre32 *code, 601 pcre32_extra *extra, const unsigned char *tables); 602 603 604PCRE 32-BIT API INDIRECTED FUNCTIONS 605 606 void *(*pcre32_malloc)(size_t); 607 608 void (*pcre32_free)(void *); 609 610 void *(*pcre32_stack_malloc)(size_t); 611 612 void (*pcre32_stack_free)(void *); 613 614 int (*pcre32_callout)(pcre32_callout_block *); 615 616 617PCRE 32-BIT API 32-BIT-ONLY FUNCTION 618 619 int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output, 620 PCRE_SPTR32 input, int length, int *byte_order, 621 int keep_boms); 622 623 624THE PCRE 32-BIT LIBRARY 625 626 Starting with release 8.32, it is possible to compile a PCRE library 627 that supports 32-bit character strings, including UTF-32 strings, as 628 well as or instead of the original 8-bit library. This work was done by 629 Christian Persch, based on the work done by Zoltan Herczeg for the 630 16-bit library. All three libraries contain identical sets of func- 631 tions, used in exactly the same way. Only the names of the functions 632 and the data types of their arguments and results are different. To 633 avoid over-complication and reduce the documentation maintenance load, 634 most of the PCRE documentation describes the 8-bit library, with only 635 occasional references to the 16-bit and 32-bit libraries. This page 636 describes what is different when you use the 32-bit library. 637 638 WARNING: A single application can be linked with all or any of the 639 three libraries, but you must take care when processing any particular 640 pattern to use functions from just one library. For example, if you 641 want to study a pattern that was compiled with pcre32_compile(), you 642 must do so with pcre32_study(), not pcre_study(), and you must free the 643 study data with pcre32_free_study(). 644 645 646THE HEADER FILE 647 648 There is only one header file, pcre.h. It contains prototypes for all 649 the functions in all libraries, as well as definitions of flags, struc- 650 tures, error codes, etc. 651 652 653THE LIBRARY NAME 654 655 In Unix-like systems, the 32-bit library is called libpcre32, and can 656 normally be accesss by adding -lpcre32 to the command for linking an 657 application that uses PCRE. 658 659 660STRING TYPES 661 662 In the 8-bit library, strings are passed to PCRE library functions as 663 vectors of bytes with the C type "char *". In the 32-bit library, 664 strings are passed as vectors of unsigned 32-bit quantities. The macro 665 PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is 666 defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned 667 int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32 668 as "unsigned int", but checks that it really is a 32-bit data type. If 669 it is not, the build fails with an error message telling the maintainer 670 to modify the definition appropriately. 671 672 673STRUCTURE TYPES 674 675 The types of the opaque structures that are used for compiled 32-bit 676 patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively. 677 The type of the user-accessible structure that is returned by 678 pcre32_study() is pcre32_extra, and the type of the structure that is 679 used for passing data to a callout function is pcre32_callout_block. 680 These structures contain the same fields, with the same names, as their 681 8-bit counterparts. The only difference is that pointers to character 682 strings are 32-bit instead of 8-bit types. 683 684 68532-BIT FUNCTIONS 686 687 For every function in the 8-bit library there is a corresponding func- 688 tion in the 32-bit library with a name that starts with pcre32_ instead 689 of pcre_. The prototypes are listed above. In addition, there is one 690 extra function, pcre32_utf32_to_host_byte_order(). This is a utility 691 function that converts a UTF-32 character string to host byte order if 692 necessary. The other 32-bit functions expect the strings they are 693 passed to be in host byte order. 694 695 The input and output arguments of pcre32_utf32_to_host_byte_order() may 696 point to the same address, that is, conversion in place is supported. 697 The output buffer must be at least as long as the input. 698 699 The length argument specifies the number of 32-bit data units in the 700 input string; a negative value specifies a zero-terminated string. 701 702 If byte_order is NULL, it is assumed that the string starts off in host 703 byte order. This may be changed by byte-order marks (BOMs) anywhere in 704 the string (commonly as the first character). 705 706 If byte_order is not NULL, a non-zero value of the integer to which it 707 points means that the input starts off in host byte order, otherwise 708 the opposite order is assumed. Again, BOMs in the string can change 709 this. The final byte order is passed back at the end of processing. 710 711 If keep_boms is not zero, byte-order mark characters (0xfeff) are 712 copied into the output string. Otherwise they are discarded. 713 714 The result of the function is the number of 32-bit units placed into 715 the output buffer, including the zero terminator if the string was 716 zero-terminated. 717 718 719SUBJECT STRING OFFSETS 720 721 The offsets within subject strings that are returned by the matching 722 functions are in 32-bit units rather than bytes. 723 724 725NAMED SUBPATTERNS 726 727 The name-to-number translation table that is maintained for named sub- 728 patterns uses 32-bit characters. The pcre32_get_stringtable_entries() 729 function returns the length of each entry in the table as the number of 730 32-bit data units. 731 732 733OPTION NAMES 734 735 There are two new general option names, PCRE_UTF32 and 736 PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and 737 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options 738 define the same bits in the options word. There is a discussion about 739 the validity of UTF-32 strings in the pcreunicode page. 740 741 For the pcre32_config() function there is an option PCRE_CONFIG_UTF32 742 that returns 1 if UTF-32 support is configured, otherwise 0. If this 743 option is given to pcre_config() or pcre16_config(), or if the 744 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con- 745 fig(), the result is the PCRE_ERROR_BADOPTION error. 746 747 748CHARACTER CODES 749 750 In 32-bit mode, when PCRE_UTF32 is not set, character values are 751 treated in the same way as in 8-bit, non UTF-8 mode, except, of course, 752 that they can range from 0 to 0x7fffffff instead of 0 to 0xff. Charac- 753 ter types for characters less than 0xff can therefore be influenced by 754 the locale in the same way as before. Characters greater than 0xff 755 have only one case, and no "type" (such as letter or digit). 756 757 In UTF-32 mode, the character code is Unicode, in the range 0 to 758 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff 759 because those are "surrogate" values that are ill-formed in UTF-32. 760 761 A UTF-32 string can indicate its endianness by special code knows as a 762 byte-order mark (BOM). The PCRE functions do not handle this, expecting 763 strings to be in host byte order. A utility function called 764 pcre32_utf32_to_host_byte_order() is provided to help with this (see 765 above). 766 767 768ERROR NAMES 769 770 The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart. 771 The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed 772 to a function that processes patterns in the other mode, for example, 773 if a pattern compiled with pcre_compile() is passed to pcre32_exec(). 774 775 There are new error codes whose names begin with PCRE_UTF32_ERR for 776 invalid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for 777 UTF-8 strings that are described in the section entitled "Reason codes 778 for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors 779 are: 780 781 PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff) 782 PCRE_UTF32_ERR2 Non-character 783 PCRE_UTF32_ERR3 Character > 0x10ffff 784 785 786ERROR TEXTS 787 788 If there is an error while compiling a pattern, the error text that is 789 passed back by pcre32_compile() or pcre32_compile2() is still an 8-bit 790 character string, zero-terminated. 791 792 793CALLOUTS 794 795 The subject and mark fields in the callout block that is passed to a 796 callout function point to 32-bit vectors. 797 798 799TESTING 800 801 The pcretest program continues to operate with 8-bit input and output 802 files, but it can be used for testing the 32-bit library. If it is run 803 with the command line option -32, patterns and subject strings are con- 804 verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit 805 library functions are used instead of the 8-bit ones. Returned 32-bit 806 strings are converted to 8-bit for output. If both the 8-bit and the 807 16-bit libraries were not compiled, pcretest defaults to 32-bit and the 808 -32 option is ignored. 809 810 When PCRE is being built, the RunTest script that is called by "make 811 check" uses the pcretest -C option to discover which of the 8-bit, 812 16-bit and 32-bit libraries has been built, and runs the tests appro- 813 priately. 814 815 816NOT SUPPORTED IN 32-BIT MODE 817 818 Not all the features of the 8-bit library are available with the 32-bit 819 library. The C++ and POSIX wrapper functions support only the 8-bit 820 library, and the pcregrep program is at present 8-bit only. 821 822 823AUTHOR 824 825 Philip Hazel 826 University Computing Service 827 Cambridge CB2 3QH, England. 828 829 830REVISION 831 832 Last updated: 08 November 2012 833 Copyright (c) 1997-2012 University of Cambridge. 834------------------------------------------------------------------------------ 835 836 837PCREBUILD(3) PCREBUILD(3) 838 839 840NAME 841 PCRE - Perl-compatible regular expressions 842 843 844PCRE BUILD-TIME OPTIONS 845 846 This document describes the optional features of PCRE that can be 847 selected when the library is compiled. It assumes use of the configure 848 script, where the optional features are selected or deselected by pro- 849 viding options to configure before running the make command. However, 850 the same options can be selected in both Unix-like and non-Unix-like 851 environments using the GUI facility of cmake-gui if you are using CMake 852 instead of configure to build PCRE. 853 854 There is a lot more information about building PCRE without using con- 855 figure (including information about using CMake or building "by hand") 856 in the file called NON-AUTOTOOLS-BUILD, which is part of the PCRE dis- 857 tribution. You should consult this file as well as the README file if 858 you are building in a non-Unix-like environment. 859 860 The complete list of options for configure (which includes the standard 861 ones such as the selection of the installation directory) can be 862 obtained by running 863 864 ./configure --help 865 866 The following sections include descriptions of options whose names 867 begin with --enable or --disable. These settings specify changes to the 868 defaults for the configure command. Because of the way that configure 869 works, --enable and --disable always come in pairs, so the complemen- 870 tary option always exists as well, but as it specifies the default, it 871 is not described. 872 873 874BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES 875 876 By default, a library called libpcre is built, containing functions 877 that take string arguments contained in vectors of bytes, either as 878 single-byte characters, or interpreted as UTF-8 strings. You can also 879 build a separate library, called libpcre16, in which strings are con- 880 tained in vectors of 16-bit data units and interpreted either as sin- 881 gle-unit characters or UTF-16 strings, by adding 882 883 --enable-pcre16 884 885 to the configure command. You can also build a separate library, called 886 libpcre32, in which strings are contained in vectors of 32-bit data 887 units and interpreted either as single-unit characters or UTF-32 888 strings, by adding 889 890 --enable-pcre32 891 892 to the configure command. If you do not want the 8-bit library, add 893 894 --disable-pcre8 895 896 as well. At least one of the three libraries must be built. Note that 897 the C++ and POSIX wrappers are for the 8-bit library only, and that 898 pcregrep is an 8-bit program. None of these are built if you select 899 only the 16-bit or 32-bit libraries. 900 901 902BUILDING SHARED AND STATIC LIBRARIES 903 904 The PCRE building process uses libtool to build both shared and static 905 Unix libraries by default. You can suppress one of these by adding one 906 of 907 908 --disable-shared 909 --disable-static 910 911 to the configure command, as required. 912 913 914C++ SUPPORT 915 916 By default, if the 8-bit library is being built, the configure script 917 will search for a C++ compiler and C++ header files. If it finds them, 918 it automatically builds the C++ wrapper library (which supports only 919 8-bit strings). You can disable this by adding 920 921 --disable-cpp 922 923 to the configure command. 924 925 926UTF-8, UTF-16 AND UTF-32 SUPPORT 927 928 To build PCRE with support for UTF Unicode character strings, add 929 930 --enable-utf 931 932 to the configure command. This setting applies to all three libraries, 933 adding support for UTF-8 to the 8-bit library, support for UTF-16 to 934 the 16-bit library, and support for UTF-32 to the to the 32-bit 935 library. There are no separate options for enabling UTF-8, UTF-16 and 936 UTF-32 independently because that would allow ridiculous settings such 937 as requesting UTF-16 support while building only the 8-bit library. It 938 is not possible to build one library with UTF support and another with- 939 out in the same configuration. (For backwards compatibility, --enable- 940 utf8 is a synonym of --enable-utf.) 941 942 Of itself, this setting does not make PCRE treat strings as UTF-8, 943 UTF-16 or UTF-32. As well as compiling PCRE with this option, you also 944 have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as 945 appropriate) when you call one of the pattern compiling functions. 946 947 If you set --enable-utf when compiling in an EBCDIC environment, PCRE 948 expects its input to be either ASCII or UTF-8 (depending on the run- 949 time option). It is not possible to support both EBCDIC and UTF-8 codes 950 in the same version of the library. Consequently, --enable-utf and 951 --enable-ebcdic are mutually exclusive. 952 953 954UNICODE CHARACTER PROPERTY SUPPORT 955 956 UTF support allows the libraries to process character codepoints up to 957 0x10ffff in the strings that they handle. On its own, however, it does 958 not provide any facilities for accessing the properties of such charac- 959 ters. If you want to be able to use the pattern escapes \P, \p, and \X, 960 which refer to Unicode character properties, you must add 961 962 --enable-unicode-properties 963 964 to the configure command. This implies UTF support, even if you have 965 not explicitly requested it. 966 967 Including Unicode property support adds around 30K of tables to the 968 PCRE library. Only the general category properties such as Lu and Nd 969 are supported. Details are given in the pcrepattern documentation. 970 971 972JUST-IN-TIME COMPILER SUPPORT 973 974 Just-in-time compiler support is included in the build by specifying 975 976 --enable-jit 977 978 This support is available only for certain hardware architectures. If 979 this option is set for an unsupported architecture, a compile time 980 error occurs. See the pcrejit documentation for a discussion of JIT 981 usage. When JIT support is enabled, pcregrep automatically makes use of 982 it, unless you add 983 984 --disable-pcregrep-jit 985 986 to the "configure" command. 987 988 989CODE VALUE OF NEWLINE 990 991 By default, PCRE interprets the linefeed (LF) character as indicating 992 the end of a line. This is the normal newline character on Unix-like 993 systems. You can compile PCRE to use carriage return (CR) instead, by 994 adding 995 996 --enable-newline-is-cr 997 998 to the configure command. There is also a --enable-newline-is-lf 999 option, which explicitly specifies linefeed as the newline character. 1000 1001 Alternatively, you can specify that line endings are to be indicated by 1002 the two character sequence CRLF. If you want this, add 1003 1004 --enable-newline-is-crlf 1005 1006 to the configure command. There is a fourth option, specified by 1007 1008 --enable-newline-is-anycrlf 1009 1010 which causes PCRE to recognize any of the three sequences CR, LF, or 1011 CRLF as indicating a line ending. Finally, a fifth option, specified by 1012 1013 --enable-newline-is-any 1014 1015 causes PCRE to recognize any Unicode newline sequence. 1016 1017 Whatever line ending convention is selected when PCRE is built can be 1018 overridden when the library functions are called. At build time it is 1019 conventional to use the standard for your operating system. 1020 1021 1022WHAT \R MATCHES 1023 1024 By default, the sequence \R in a pattern matches any Unicode newline 1025 sequence, whatever has been selected as the line ending sequence. If 1026 you specify 1027 1028 --enable-bsr-anycrlf 1029 1030 the default is changed so that \R matches only CR, LF, or CRLF. What- 1031 ever is selected when PCRE is built can be overridden when the library 1032 functions are called. 1033 1034 1035POSIX MALLOC USAGE 1036 1037 When the 8-bit library is called through the POSIX interface (see the 1038 pcreposix documentation), additional working storage is required for 1039 holding the pointers to capturing substrings, because PCRE requires 1040 three integers per substring, whereas the POSIX interface provides only 1041 two. If the number of expected substrings is small, the wrapper func- 1042 tion uses space on the stack, because this is faster than using mal- 1043 loc() for each call. The default threshold above which the stack is no 1044 longer used is 10; it can be changed by adding a setting such as 1045 1046 --with-posix-malloc-threshold=20 1047 1048 to the configure command. 1049 1050 1051HANDLING VERY LARGE PATTERNS 1052 1053 Within a compiled pattern, offset values are used to point from one 1054 part to another (for example, from an opening parenthesis to an alter- 1055 nation metacharacter). By default, in the 8-bit and 16-bit libraries, 1056 two-byte values are used for these offsets, leading to a maximum size 1057 for a compiled pattern of around 64K. This is sufficient to handle all 1058 but the most gigantic patterns. Nevertheless, some people do want to 1059 process truly enormous patterns, so it is possible to compile PCRE to 1060 use three-byte or four-byte offsets by adding a setting such as 1061 1062 --with-link-size=3 1063 1064 to the configure command. The value given must be 2, 3, or 4. For the 1065 16-bit library, a value of 3 is rounded up to 4. In these libraries, 1066 using longer offsets slows down the operation of PCRE because it has to 1067 load additional data when handling them. For the 32-bit library the 1068 value is always 4 and cannot be overridden; the value of --with-link- 1069 size is ignored. 1070 1071 1072AVOIDING EXCESSIVE STACK USAGE 1073 1074 When matching with the pcre_exec() function, PCRE implements backtrack- 1075 ing by making recursive calls to an internal function called match(). 1076 In environments where the size of the stack is limited, this can se- 1077 verely limit PCRE's operation. (The Unix environment does not usually 1078 suffer from this problem, but it may sometimes be necessary to increase 1079 the maximum stack size. There is a discussion in the pcrestack docu- 1080 mentation.) An alternative approach to recursion that uses memory from 1081 the heap to remember data, instead of using recursive function calls, 1082 has been implemented to work round the problem of limited stack size. 1083 If you want to build a version of PCRE that works this way, add 1084 1085 --disable-stack-for-recursion 1086 1087 to the configure command. With this configuration, PCRE will use the 1088 pcre_stack_malloc and pcre_stack_free variables to call memory manage- 1089 ment functions. By default these point to malloc() and free(), but you 1090 can replace the pointers so that your own functions are used instead. 1091 1092 Separate functions are provided rather than using pcre_malloc and 1093 pcre_free because the usage is very predictable: the block sizes 1094 requested are always the same, and the blocks are always freed in 1095 reverse order. A calling program might be able to implement optimized 1096 functions that perform better than malloc() and free(). PCRE runs 1097 noticeably more slowly when built in this way. This option affects only 1098 the pcre_exec() function; it is not relevant for pcre_dfa_exec(). 1099 1100 1101LIMITING PCRE RESOURCE USAGE 1102 1103 Internally, PCRE has a function called match(), which it calls repeat- 1104 edly (sometimes recursively) when matching a pattern with the 1105 pcre_exec() function. By controlling the maximum number of times this 1106 function may be called during a single matching operation, a limit can 1107 be placed on the resources used by a single call to pcre_exec(). The 1108 limit can be changed at run time, as described in the pcreapi documen- 1109 tation. The default is 10 million, but this can be changed by adding a 1110 setting such as 1111 1112 --with-match-limit=500000 1113 1114 to the configure command. This setting has no effect on the 1115 pcre_dfa_exec() matching function. 1116 1117 In some environments it is desirable to limit the depth of recursive 1118 calls of match() more strictly than the total number of calls, in order 1119 to restrict the maximum amount of stack (or heap, if --disable-stack- 1120 for-recursion is specified) that is used. A second limit controls this; 1121 it defaults to the value that is set for --with-match-limit, which 1122 imposes no additional constraints. However, you can set a lower limit 1123 by adding, for example, 1124 1125 --with-match-limit-recursion=10000 1126 1127 to the configure command. This value can also be overridden at run 1128 time. 1129 1130 1131CREATING CHARACTER TABLES AT BUILD TIME 1132 1133 PCRE uses fixed tables for processing characters whose code values are 1134 less than 256. By default, PCRE is built with a set of tables that are 1135 distributed in the file pcre_chartables.c.dist. These tables are for 1136 ASCII codes only. If you add 1137 1138 --enable-rebuild-chartables 1139 1140 to the configure command, the distributed tables are no longer used. 1141 Instead, a program called dftables is compiled and run. This outputs 1142 the source for new set of tables, created in the default locale of your 1143 C run-time system. (This method of replacing the tables does not work 1144 if you are cross compiling, because dftables is run on the local host. 1145 If you need to create alternative tables when cross compiling, you will 1146 have to do so "by hand".) 1147 1148 1149USING EBCDIC CODE 1150 1151 PCRE assumes by default that it will run in an environment where the 1152 character code is ASCII (or Unicode, which is a superset of ASCII). 1153 This is the case for most computer operating systems. PCRE can, how- 1154 ever, be compiled to run in an EBCDIC environment by adding 1155 1156 --enable-ebcdic 1157 1158 to the configure command. This setting implies --enable-rebuild-charta- 1159 bles. You should only use it if you know that you are in an EBCDIC 1160 environment (for example, an IBM mainframe operating system). The 1161 --enable-ebcdic option is incompatible with --enable-utf. 1162 1163 The EBCDIC character that corresponds to an ASCII LF is assumed to have 1164 the value 0x15 by default. However, in some EBCDIC environments, 0x25 1165 is used. In such an environment you should use 1166 1167 --enable-ebcdic-nl25 1168 1169 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR 1170 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 1171 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 1172 acter (which, in Unicode, is 0x85). 1173 1174 The options that select newline behaviour, such as --enable-newline-is- 1175 cr, and equivalent run-time options, refer to these character values in 1176 an EBCDIC environment. 1177 1178 1179PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT 1180 1181 By default, pcregrep reads all files as plain text. You can build it so 1182 that it recognizes files whose names end in .gz or .bz2, and reads them 1183 with libz or libbz2, respectively, by adding one or both of 1184 1185 --enable-pcregrep-libz 1186 --enable-pcregrep-libbz2 1187 1188 to the configure command. These options naturally require that the rel- 1189 evant libraries are installed on your system. Configuration will fail 1190 if they are not. 1191 1192 1193PCREGREP BUFFER SIZE 1194 1195 pcregrep uses an internal buffer to hold a "window" on the file it is 1196 scanning, in order to be able to output "before" and "after" lines when 1197 it finds a match. The size of the buffer is controlled by a parameter 1198 whose default value is 20K. The buffer itself is three times this size, 1199 but because of the way it is used for holding "before" lines, the long- 1200 est line that is guaranteed to be processable is the parameter size. 1201 You can change the default parameter value by adding, for example, 1202 1203 --with-pcregrep-bufsize=50K 1204 1205 to the configure command. The caller of pcregrep can, however, override 1206 this value by specifying a run-time option. 1207 1208 1209PCRETEST OPTION FOR LIBREADLINE SUPPORT 1210 1211 If you add 1212 1213 --enable-pcretest-libreadline 1214 1215 to the configure command, pcretest is linked with the libreadline 1216 library, and when its input is from a terminal, it reads it using the 1217 readline() function. This provides line-editing and history facilities. 1218 Note that libreadline is GPL-licensed, so if you distribute a binary of 1219 pcretest linked in this way, there may be licensing issues. 1220 1221 Setting this option causes the -lreadline option to be added to the 1222 pcretest build. In many operating environments with a sytem-installed 1223 libreadline this is sufficient. However, in some environments (e.g. if 1224 an unmodified distribution version of readline is in use), some extra 1225 configuration may be necessary. The INSTALL file for libreadline says 1226 this: 1227 1228 "Readline uses the termcap functions, but does not link with the 1229 termcap or curses library itself, allowing applications which link 1230 with readline the to choose an appropriate library." 1231 1232 If your environment has not been set up so that an appropriate library 1233 is automatically included, you may need to add something like 1234 1235 LIBS="-ncurses" 1236 1237 immediately before the configure command. 1238 1239 1240DEBUGGING WITH VALGRIND SUPPORT 1241 1242 By adding the 1243 1244 --enable-valgrind 1245 1246 option to to the configure command, PCRE will use valgrind annotations 1247 to mark certain memory regions as unaddressable. This allows it to 1248 detect invalid memory accesses, and is mostly useful for debugging PCRE 1249 itself. 1250 1251 1252CODE COVERAGE REPORTING 1253 1254 If your C compiler is gcc, you can build a version of PCRE that can 1255 generate a code coverage report for its test suite. To enable this, you 1256 must install lcov version 1.6 or above. Then specify 1257 1258 --enable-coverage 1259 1260 to the configure command and build PCRE in the usual way. 1261 1262 Note that using ccache (a caching C compiler) is incompatible with code 1263 coverage reporting. If you have configured ccache to run automatically 1264 on your system, you must set the environment variable 1265 1266 CCACHE_DISABLE=1 1267 1268 before running make to build PCRE, so that ccache is not used. 1269 1270 When --enable-coverage is used, the following addition targets are 1271 added to the Makefile: 1272 1273 make coverage 1274 1275 This creates a fresh coverage report for the PCRE test suite. It is 1276 equivalent to running "make coverage-reset", "make coverage-baseline", 1277 "make check", and then "make coverage-report". 1278 1279 make coverage-reset 1280 1281 This zeroes the coverage counters, but does nothing else. 1282 1283 make coverage-baseline 1284 1285 This captures baseline coverage information. 1286 1287 make coverage-report 1288 1289 This creates the coverage report. 1290 1291 make coverage-clean-report 1292 1293 This removes the generated coverage report without cleaning the cover- 1294 age data itself. 1295 1296 make coverage-clean-data 1297 1298 This removes the captured coverage data without removing the coverage 1299 files created at compile time (*.gcno). 1300 1301 make coverage-clean 1302 1303 This cleans all coverage data including the generated coverage report. 1304 For more information about code coverage, see the gcov and lcov docu- 1305 mentation. 1306 1307 1308SEE ALSO 1309 1310 pcreapi(3), pcre16, pcre32, pcre_config(3). 1311 1312 1313AUTHOR 1314 1315 Philip Hazel 1316 University Computing Service 1317 Cambridge CB2 3QH, England. 1318 1319 1320REVISION 1321 1322 Last updated: 30 October 2012 1323 Copyright (c) 1997-2012 University of Cambridge. 1324------------------------------------------------------------------------------ 1325 1326 1327PCREMATCHING(3) PCREMATCHING(3) 1328 1329 1330NAME 1331 PCRE - Perl-compatible regular expressions 1332 1333 1334PCRE MATCHING ALGORITHMS 1335 1336 This document describes the two different algorithms that are available 1337 in PCRE for matching a compiled regular expression against a given sub- 1338 ject string. The "standard" algorithm is the one provided by the 1339 pcre_exec(), pcre16_exec() and pcre32_exec() functions. These work in 1340 the same as as Perl's matching function, and provide a Perl-compatible 1341 matching operation. The just-in-time (JIT) optimization that is 1342 described in the pcrejit documentation is compatible with these func- 1343 tions. 1344 1345 An alternative algorithm is provided by the pcre_dfa_exec(), 1346 pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a 1347 different way, and are not Perl-compatible. This alternative has advan- 1348 tages and disadvantages compared with the standard algorithm, and these 1349 are described below. 1350 1351 When there is only one possible way in which a given subject string can 1352 match a pattern, the two algorithms give the same answer. A difference 1353 arises, however, when there are multiple possibilities. For example, if 1354 the pattern 1355 1356 ^<.*> 1357 1358 is matched against the string 1359 1360 <something> <something else> <something further> 1361 1362 there are three possible answers. The standard algorithm finds only one 1363 of them, whereas the alternative algorithm finds all three. 1364 1365 1366REGULAR EXPRESSIONS AS TREES 1367 1368 The set of strings that are matched by a regular expression can be rep- 1369 resented as a tree structure. An unlimited repetition in the pattern 1370 makes the tree of infinite size, but it is still a tree. Matching the 1371 pattern to a given subject string (from a given starting point) can be 1372 thought of as a search of the tree. There are two ways to search a 1373 tree: depth-first and breadth-first, and these correspond to the two 1374 matching algorithms provided by PCRE. 1375 1376 1377THE STANDARD MATCHING ALGORITHM 1378 1379 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- 1380 sions", the standard algorithm is an "NFA algorithm". It conducts a 1381 depth-first search of the pattern tree. That is, it proceeds along a 1382 single path through the tree, checking that the subject matches what is 1383 required. When there is a mismatch, the algorithm tries any alterna- 1384 tives at the current point, and if they all fail, it backs up to the 1385 previous branch point in the tree, and tries the next alternative 1386 branch at that level. This often involves backing up (moving to the 1387 left) in the subject string as well. The order in which repetition 1388 branches are tried is controlled by the greedy or ungreedy nature of 1389 the quantifier. 1390 1391 If a leaf node is reached, a matching string has been found, and at 1392 that point the algorithm stops. Thus, if there is more than one possi- 1393 ble match, this algorithm returns the first one that it finds. Whether 1394 this is the shortest, the longest, or some intermediate length depends 1395 on the way the greedy and ungreedy repetition quantifiers are specified 1396 in the pattern. 1397 1398 Because it ends up with a single path through the tree, it is rela- 1399 tively straightforward for this algorithm to keep track of the sub- 1400 strings that are matched by portions of the pattern in parentheses. 1401 This provides support for capturing parentheses and back references. 1402 1403 1404THE ALTERNATIVE MATCHING ALGORITHM 1405 1406 This algorithm conducts a breadth-first search of the tree. Starting 1407 from the first matching point in the subject, it scans the subject 1408 string from left to right, once, character by character, and as it does 1409 this, it remembers all the paths through the tree that represent valid 1410 matches. In Friedl's terminology, this is a kind of "DFA algorithm", 1411 though it is not implemented as a traditional finite state machine (it 1412 keeps multiple states active simultaneously). 1413 1414 Although the general principle of this matching algorithm is that it 1415 scans the subject string only once, without backtracking, there is one 1416 exception: when a lookaround assertion is encountered, the characters 1417 following or preceding the current point have to be independently 1418 inspected. 1419 1420 The scan continues until either the end of the subject is reached, or 1421 there are no more unterminated paths. At this point, terminated paths 1422 represent the different matching possibilities (if there are none, the 1423 match has failed). Thus, if there is more than one possible match, 1424 this algorithm finds all of them, and in particular, it finds the long- 1425 est. The matches are returned in decreasing order of length. There is 1426 an option to stop the algorithm after the first match (which is neces- 1427 sarily the shortest) is found. 1428 1429 Note that all the matches that are found start at the same point in the 1430 subject. If the pattern 1431 1432 cat(er(pillar)?)? 1433 1434 is matched against the string "the caterpillar catchment", the result 1435 will be the three strings "caterpillar", "cater", and "cat" that start 1436 at the fifth character of the subject. The algorithm does not automati- 1437 cally move on to find matches that start at later positions. 1438 1439 There are a number of features of PCRE regular expressions that are not 1440 supported by the alternative matching algorithm. They are as follows: 1441 1442 1. Because the algorithm finds all possible matches, the greedy or 1443 ungreedy nature of repetition quantifiers is not relevant. Greedy and 1444 ungreedy quantifiers are treated in exactly the same way. However, pos- 1445 sessive quantifiers can make a difference when what follows could also 1446 match what is quantified, for example in a pattern like this: 1447 1448 ^a++\w! 1449 1450 This pattern matches "aaab!" but not "aaa!", which would be matched by 1451 a non-possessive quantifier. Similarly, if an atomic group is present, 1452 it is matched as if it were a standalone pattern at the current point, 1453 and the longest match is then "locked in" for the rest of the overall 1454 pattern. 1455 1456 2. When dealing with multiple paths through the tree simultaneously, it 1457 is not straightforward to keep track of captured substrings for the 1458 different matching possibilities, and PCRE's implementation of this 1459 algorithm does not attempt to do this. This means that no captured sub- 1460 strings are available. 1461 1462 3. Because no substrings are captured, back references within the pat- 1463 tern are not supported, and cause errors if encountered. 1464 1465 4. For the same reason, conditional expressions that use a backrefer- 1466 ence as the condition or test for a specific group recursion are not 1467 supported. 1468 1469 5. Because many paths through the tree may be active, the \K escape 1470 sequence, which resets the start of the match when encountered (but may 1471 be on some paths and not on others), is not supported. It causes an 1472 error if encountered. 1473 1474 6. Callouts are supported, but the value of the capture_top field is 1475 always 1, and the value of the capture_last field is always -1. 1476 1477 7. The \C escape sequence, which (in the standard algorithm) always 1478 matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is 1479 not supported in these modes, because the alternative algorithm moves 1480 through the subject string one character (not data unit) at a time, for 1481 all active paths through the tree. 1482 1483 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) 1484 are not supported. (*FAIL) is supported, and behaves like a failing 1485 negative assertion. 1486 1487 1488ADVANTAGES OF THE ALTERNATIVE ALGORITHM 1489 1490 Using the alternative matching algorithm provides the following advan- 1491 tages: 1492 1493 1. All possible matches (at a single point in the subject) are automat- 1494 ically found, and in particular, the longest match is found. To find 1495 more than one match using the standard algorithm, you have to do kludgy 1496 things with callouts. 1497 1498 2. Because the alternative algorithm scans the subject string just 1499 once, and never needs to backtrack (except for lookbehinds), it is pos- 1500 sible to pass very long subject strings to the matching function in 1501 several pieces, checking for partial matching each time. Although it is 1502 possible to do multi-segment matching using the standard algorithm by 1503 retaining partially matched substrings, it is more complicated. The 1504 pcrepartial documentation gives details of partial matching and dis- 1505 cusses multi-segment matching. 1506 1507 1508DISADVANTAGES OF THE ALTERNATIVE ALGORITHM 1509 1510 The alternative algorithm suffers from a number of disadvantages: 1511 1512 1. It is substantially slower than the standard algorithm. This is 1513 partly because it has to search for all possible matches, but is also 1514 because it is less susceptible to optimization. 1515 1516 2. Capturing parentheses and back references are not supported. 1517 1518 3. Although atomic groups are supported, their use does not provide the 1519 performance advantage that it does for the standard algorithm. 1520 1521 1522AUTHOR 1523 1524 Philip Hazel 1525 University Computing Service 1526 Cambridge CB2 3QH, England. 1527 1528 1529REVISION 1530 1531 Last updated: 08 January 2012 1532 Copyright (c) 1997-2012 University of Cambridge. 1533------------------------------------------------------------------------------ 1534 1535 1536PCREAPI(3) PCREAPI(3) 1537 1538 1539NAME 1540 PCRE - Perl-compatible regular expressions 1541 1542 #include <pcre.h> 1543 1544 1545PCRE NATIVE API BASIC FUNCTIONS 1546 1547 pcre *pcre_compile(const char *pattern, int options, 1548 const char **errptr, int *erroffset, 1549 const unsigned char *tableptr); 1550 1551 pcre *pcre_compile2(const char *pattern, int options, 1552 int *errorcodeptr, 1553 const char **errptr, int *erroffset, 1554 const unsigned char *tableptr); 1555 1556 pcre_extra *pcre_study(const pcre *code, int options, 1557 const char **errptr); 1558 1559 void pcre_free_study(pcre_extra *extra); 1560 1561 int pcre_exec(const pcre *code, const pcre_extra *extra, 1562 const char *subject, int length, int startoffset, 1563 int options, int *ovector, int ovecsize); 1564 1565 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, 1566 const char *subject, int length, int startoffset, 1567 int options, int *ovector, int ovecsize, 1568 int *workspace, int wscount); 1569 1570 1571PCRE NATIVE API STRING EXTRACTION FUNCTIONS 1572 1573 int pcre_copy_named_substring(const pcre *code, 1574 const char *subject, int *ovector, 1575 int stringcount, const char *stringname, 1576 char *buffer, int buffersize); 1577 1578 int pcre_copy_substring(const char *subject, int *ovector, 1579 int stringcount, int stringnumber, char *buffer, 1580 int buffersize); 1581 1582 int pcre_get_named_substring(const pcre *code, 1583 const char *subject, int *ovector, 1584 int stringcount, const char *stringname, 1585 const char **stringptr); 1586 1587 int pcre_get_stringnumber(const pcre *code, 1588 const char *name); 1589 1590 int pcre_get_stringtable_entries(const pcre *code, 1591 const char *name, char **first, char **last); 1592 1593 int pcre_get_substring(const char *subject, int *ovector, 1594 int stringcount, int stringnumber, 1595 const char **stringptr); 1596 1597 int pcre_get_substring_list(const char *subject, 1598 int *ovector, int stringcount, const char ***listptr); 1599 1600 void pcre_free_substring(const char *stringptr); 1601 1602 void pcre_free_substring_list(const char **stringptr); 1603 1604 1605PCRE NATIVE API AUXILIARY FUNCTIONS 1606 1607 int pcre_jit_exec(const pcre *code, const pcre_extra *extra, 1608 const char *subject, int length, int startoffset, 1609 int options, int *ovector, int ovecsize, 1610 pcre_jit_stack *jstack); 1611 1612 pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize); 1613 1614 void pcre_jit_stack_free(pcre_jit_stack *stack); 1615 1616 void pcre_assign_jit_stack(pcre_extra *extra, 1617 pcre_jit_callback callback, void *data); 1618 1619 const unsigned char *pcre_maketables(void); 1620 1621 int pcre_fullinfo(const pcre *code, const pcre_extra *extra, 1622 int what, void *where); 1623 1624 int pcre_refcount(pcre *code, int adjust); 1625 1626 int pcre_config(int what, void *where); 1627 1628 const char *pcre_version(void); 1629 1630 int pcre_pattern_to_host_byte_order(pcre *code, 1631 pcre_extra *extra, const unsigned char *tables); 1632 1633 1634PCRE NATIVE API INDIRECTED FUNCTIONS 1635 1636 void *(*pcre_malloc)(size_t); 1637 1638 void (*pcre_free)(void *); 1639 1640 void *(*pcre_stack_malloc)(size_t); 1641 1642 void (*pcre_stack_free)(void *); 1643 1644 int (*pcre_callout)(pcre_callout_block *); 1645 1646 1647PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES 1648 1649 As well as support for 8-bit character strings, PCRE also supports 1650 16-bit strings (from release 8.30) and 32-bit strings (from release 1651 8.32), by means of two additional libraries. They can be built as well 1652 as, or instead of, the 8-bit library. To avoid too much complication, 1653 this document describes the 8-bit versions of the functions, with only 1654 occasional references to the 16-bit and 32-bit libraries. 1655 1656 The 16-bit and 32-bit functions operate in the same way as their 8-bit 1657 counterparts; they just use different data types for their arguments 1658 and results, and their names start with pcre16_ or pcre32_ instead of 1659 pcre_. For every option that has UTF8 in its name (for example, 1660 PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 1661 replaced by UTF16 or UTF32, respectively. This facility is in fact just 1662 cosmetic; the 16-bit and 32-bit option names define the same bit val- 1663 ues. 1664 1665 References to bytes and UTF-8 in this document should be read as refer- 1666 ences to 16-bit data quantities and UTF-16 when using the 16-bit 1667 library, or 32-bit data quantities and UTF-32 when using the 32-bit 1668 library, unless specified otherwise. More details of the specific dif- 1669 ferences for the 16-bit and 32-bit libraries are given in the pcre16 1670 and pcre32 pages. 1671 1672 1673PCRE API OVERVIEW 1674 1675 PCRE has its own native API, which is described in this document. There 1676 are also some wrapper functions (for the 8-bit library only) that cor- 1677 respond to the POSIX regular expression API, but they do not give 1678 access to all the functionality. They are described in the pcreposix 1679 documentation. Both of these APIs define a set of C function calls. A 1680 C++ wrapper (again for the 8-bit library only) is also distributed with 1681 PCRE. It is documented in the pcrecpp page. 1682 1683 The native API C function prototypes are defined in the header file 1684 pcre.h, and on Unix-like systems the (8-bit) library itself is called 1685 libpcre. It can normally be accessed by adding -lpcre to the command 1686 for linking an application that uses PCRE. The header file defines the 1687 macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release 1688 numbers for the library. Applications can use these to include support 1689 for different releases of PCRE. 1690 1691 In a Windows environment, if you want to statically link an application 1692 program against a non-dll pcre.a file, you must define PCRE_STATIC 1693 before including pcre.h or pcrecpp.h, because otherwise the pcre_mal- 1694 loc() and pcre_free() exported functions will be declared 1695 __declspec(dllimport), with unwanted results. 1696 1697 The functions pcre_compile(), pcre_compile2(), pcre_study(), and 1698 pcre_exec() are used for compiling and matching regular expressions in 1699 a Perl-compatible manner. A sample program that demonstrates the sim- 1700 plest way of using them is provided in the file called pcredemo.c in 1701 the PCRE source distribution. A listing of this program is given in the 1702 pcredemo documentation, and the pcresample documentation describes how 1703 to compile and run it. 1704 1705 Just-in-time compiler support is an optional feature of PCRE that can 1706 be built in appropriate hardware environments. It greatly speeds up the 1707 matching performance of many patterns. Simple programs can easily 1708 request that it be used if available, by setting an option that is 1709 ignored when it is not relevant. More complicated programs might need 1710 to make use of the functions pcre_jit_stack_alloc(), 1711 pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control 1712 the JIT code's memory usage. 1713 1714 From release 8.32 there is also a direct interface for JIT execution, 1715 which gives improved performance. The JIT-specific functions are dis- 1716 cussed in the pcrejit documentation. 1717 1718 A second matching function, pcre_dfa_exec(), which is not Perl-compati- 1719 ble, is also provided. This uses a different algorithm for the match- 1720 ing. The alternative algorithm finds all possible matches (at a given 1721 point in the subject), and scans the subject just once (unless there 1722 are lookbehind assertions). However, this algorithm does not return 1723 captured substrings. A description of the two matching algorithms and 1724 their advantages and disadvantages is given in the pcrematching docu- 1725 mentation. 1726 1727 In addition to the main compiling and matching functions, there are 1728 convenience functions for extracting captured substrings from a subject 1729 string that is matched by pcre_exec(). They are: 1730 1731 pcre_copy_substring() 1732 pcre_copy_named_substring() 1733 pcre_get_substring() 1734 pcre_get_named_substring() 1735 pcre_get_substring_list() 1736 pcre_get_stringnumber() 1737 pcre_get_stringtable_entries() 1738 1739 pcre_free_substring() and pcre_free_substring_list() are also provided, 1740 to free the memory used for extracted strings. 1741 1742 The function pcre_maketables() is used to build a set of character 1743 tables in the current locale for passing to pcre_compile(), 1744 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is 1745 provided for specialist use. Most commonly, no special tables are 1746 passed, in which case internal tables that are generated when PCRE is 1747 built are used. 1748 1749 The function pcre_fullinfo() is used to find out information about a 1750 compiled pattern. The function pcre_version() returns a pointer to a 1751 string containing the version of PCRE and its date of release. 1752 1753 The function pcre_refcount() maintains a reference count in a data 1754 block containing a compiled pattern. This is provided for the benefit 1755 of object-oriented applications. 1756 1757 The global variables pcre_malloc and pcre_free initially contain the 1758 entry points of the standard malloc() and free() functions, respec- 1759 tively. PCRE calls the memory management functions via these variables, 1760 so a calling program can replace them if it wishes to intercept the 1761 calls. This should be done before calling any PCRE functions. 1762 1763 The global variables pcre_stack_malloc and pcre_stack_free are also 1764 indirections to memory management functions. These special functions 1765 are used only when PCRE is compiled to use the heap for remembering 1766 data, instead of recursive function calls, when running the pcre_exec() 1767 function. See the pcrebuild documentation for details of how to do 1768 this. It is a non-standard way of building PCRE, for use in environ- 1769 ments that have limited stacks. Because of the greater use of memory 1770 management, it runs more slowly. Separate functions are provided so 1771 that special-purpose external code can be used for this case. When 1772 used, these functions are always called in a stack-like manner (last 1773 obtained, first freed), and always for memory blocks of the same size. 1774 There is a discussion about PCRE's stack usage in the pcrestack docu- 1775 mentation. 1776 1777 The global variable pcre_callout initially contains NULL. It can be set 1778 by the caller to a "callout" function, which PCRE will then call at 1779 specified points during a matching operation. Details are given in the 1780 pcrecallout documentation. 1781 1782 1783NEWLINES 1784 1785 PCRE supports five different conventions for indicating line breaks in 1786 strings: a single CR (carriage return) character, a single LF (line- 1787 feed) character, the two-character sequence CRLF, any of the three pre- 1788 ceding, or any Unicode newline sequence. The Unicode newline sequences 1789 are the three just mentioned, plus the single characters VT (vertical 1790 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line 1791 separator, U+2028), and PS (paragraph separator, U+2029). 1792 1793 Each of the first three conventions is used by at least one operating 1794 system as its standard newline sequence. When PCRE is built, a default 1795 can be specified. The default default is LF, which is the Unix stan- 1796 dard. When PCRE is run, the default can be overridden, either when a 1797 pattern is compiled, or when it is matched. 1798 1799 At compile time, the newline convention can be specified by the options 1800 argument of pcre_compile(), or it can be specified by special text at 1801 the start of the pattern itself; this overrides any other settings. See 1802 the pcrepattern page for details of the special character sequences. 1803 1804 In the PCRE documentation the word "newline" is used to mean "the char- 1805 acter or pair of characters that indicate a line break". The choice of 1806 newline convention affects the handling of the dot, circumflex, and 1807 dollar metacharacters, the handling of #-comments in /x mode, and, when 1808 CRLF is a recognized line ending sequence, the match position advance- 1809 ment for a non-anchored pattern. There is more detail about this in the 1810 section on pcre_exec() options below. 1811 1812 The choice of newline convention does not affect the interpretation of 1813 the \n or \r escape sequences, nor does it affect what \R matches, 1814 which is controlled in a similar way, but by separate options. 1815 1816 1817MULTITHREADING 1818 1819 The PCRE functions can be used in multi-threading applications, with 1820 the proviso that the memory management functions pointed to by 1821 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the 1822 callout function pointed to by pcre_callout, are shared by all threads. 1823 1824 The compiled form of a regular expression is not altered during match- 1825 ing, so the same compiled pattern can safely be used by several threads 1826 at once. 1827 1828 If the just-in-time optimization feature is being used, it needs sepa- 1829 rate memory stack areas for each thread. See the pcrejit documentation 1830 for more details. 1831 1832 1833SAVING PRECOMPILED PATTERNS FOR LATER USE 1834 1835 The compiled form of a regular expression can be saved and re-used at a 1836 later time, possibly by a different program, and even on a host other 1837 than the one on which it was compiled. Details are given in the 1838 pcreprecompile documentation, which includes a description of the 1839 pcre_pattern_to_host_byte_order() function. However, compiling a regu- 1840 lar expression with one version of PCRE for use with a different ver- 1841 sion is not guaranteed to work and may cause crashes. 1842 1843 1844CHECKING BUILD-TIME OPTIONS 1845 1846 int pcre_config(int what, void *where); 1847 1848 The function pcre_config() makes it possible for a PCRE client to dis- 1849 cover which optional features have been compiled into the PCRE library. 1850 The pcrebuild documentation has more details about these optional fea- 1851 tures. 1852 1853 The first argument for pcre_config() is an integer, specifying which 1854 information is required; the second argument is a pointer to a variable 1855 into which the information is placed. The returned value is zero on 1856 success, or the negative error code PCRE_ERROR_BADOPTION if the value 1857 in the first argument is not recognized. The following information is 1858 available: 1859 1860 PCRE_CONFIG_UTF8 1861 1862 The output is an integer that is set to one if UTF-8 support is avail- 1863 able; otherwise it is set to zero. This value should normally be given 1864 to the 8-bit version of this function, pcre_config(). If it is given to 1865 the 16-bit or 32-bit version of this function, the result is 1866 PCRE_ERROR_BADOPTION. 1867 1868 PCRE_CONFIG_UTF16 1869 1870 The output is an integer that is set to one if UTF-16 support is avail- 1871 able; otherwise it is set to zero. This value should normally be given 1872 to the 16-bit version of this function, pcre16_config(). If it is given 1873 to the 8-bit or 32-bit version of this function, the result is 1874 PCRE_ERROR_BADOPTION. 1875 1876 PCRE_CONFIG_UTF32 1877 1878 The output is an integer that is set to one if UTF-32 support is avail- 1879 able; otherwise it is set to zero. This value should normally be given 1880 to the 32-bit version of this function, pcre32_config(). If it is given 1881 to the 8-bit or 16-bit version of this function, the result is 1882 PCRE_ERROR_BADOPTION. 1883 1884 PCRE_CONFIG_UNICODE_PROPERTIES 1885 1886 The output is an integer that is set to one if support for Unicode 1887 character properties is available; otherwise it is set to zero. 1888 1889 PCRE_CONFIG_JIT 1890 1891 The output is an integer that is set to one if support for just-in-time 1892 compiling is available; otherwise it is set to zero. 1893 1894 PCRE_CONFIG_JITTARGET 1895 1896 The output is a pointer to a zero-terminated "const char *" string. If 1897 JIT support is available, the string contains the name of the architec- 1898 ture for which the JIT compiler is configured, for example "x86 32bit 1899 (little endian + unaligned)". If JIT support is not available, the 1900 result is NULL. 1901 1902 PCRE_CONFIG_NEWLINE 1903 1904 The output is an integer whose value specifies the default character 1905 sequence that is recognized as meaning "newline". The values that are 1906 supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338 1907 for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR, 1908 ANYCRLF, and ANY yield the same values. However, the value for LF is 1909 normally 21, though some EBCDIC environments use 37. The corresponding 1910 values for CRLF are 3349 and 3365. The default should normally corre- 1911 spond to the standard sequence for your operating system. 1912 1913 PCRE_CONFIG_BSR 1914 1915 The output is an integer whose value indicates what character sequences 1916 the \R escape sequence matches by default. A value of 0 means that \R 1917 matches any Unicode line ending sequence; a value of 1 means that \R 1918 matches only CR, LF, or CRLF. The default can be overridden when a pat- 1919 tern is compiled or matched. 1920 1921 PCRE_CONFIG_LINK_SIZE 1922 1923 The output is an integer that contains the number of bytes used for 1924 internal linkage in compiled regular expressions. For the 8-bit 1925 library, the value can be 2, 3, or 4. For the 16-bit library, the value 1926 is either 2 or 4 and is still a number of bytes. For the 32-bit 1927 library, the value is either 2 or 4 and is still a number of bytes. The 1928 default value of 2 is sufficient for all but the most massive patterns, 1929 since it allows the compiled pattern to be up to 64K in size. Larger 1930 values allow larger regular expressions to be compiled, at the expense 1931 of slower matching. 1932 1933 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD 1934 1935 The output is an integer that contains the threshold above which the 1936 POSIX interface uses malloc() for output vectors. Further details are 1937 given in the pcreposix documentation. 1938 1939 PCRE_CONFIG_MATCH_LIMIT 1940 1941 The output is a long integer that gives the default limit for the num- 1942 ber of internal matching function calls in a pcre_exec() execution. 1943 Further details are given with pcre_exec() below. 1944 1945 PCRE_CONFIG_MATCH_LIMIT_RECURSION 1946 1947 The output is a long integer that gives the default limit for the depth 1948 of recursion when calling the internal matching function in a 1949 pcre_exec() execution. Further details are given with pcre_exec() 1950 below. 1951 1952 PCRE_CONFIG_STACKRECURSE 1953 1954 The output is an integer that is set to one if internal recursion when 1955 running pcre_exec() is implemented by recursive function calls that use 1956 the stack to remember their state. This is the usual way that PCRE is 1957 compiled. The output is zero if PCRE was compiled to use blocks of data 1958 on the heap instead of recursive function calls. In this case, 1959 pcre_stack_malloc and pcre_stack_free are called to manage memory 1960 blocks on the heap, thus avoiding the use of the stack. 1961 1962 1963COMPILING A PATTERN 1964 1965 pcre *pcre_compile(const char *pattern, int options, 1966 const char **errptr, int *erroffset, 1967 const unsigned char *tableptr); 1968 1969 pcre *pcre_compile2(const char *pattern, int options, 1970 int *errorcodeptr, 1971 const char **errptr, int *erroffset, 1972 const unsigned char *tableptr); 1973 1974 Either of the functions pcre_compile() or pcre_compile2() can be called 1975 to compile a pattern into an internal form. The only difference between 1976 the two interfaces is that pcre_compile2() has an additional argument, 1977 errorcodeptr, via which a numerical error code can be returned. To 1978 avoid too much repetition, we refer just to pcre_compile() below, but 1979 the information applies equally to pcre_compile2(). 1980 1981 The pattern is a C string terminated by a binary zero, and is passed in 1982 the pattern argument. A pointer to a single block of memory that is 1983 obtained via pcre_malloc is returned. This contains the compiled code 1984 and related data. The pcre type is defined for the returned block; this 1985 is a typedef for a structure whose contents are not externally defined. 1986 It is up to the caller to free the memory (via pcre_free) when it is no 1987 longer required. 1988 1989 Although the compiled code of a PCRE regex is relocatable, that is, it 1990 does not depend on memory location, the complete pcre data block is not 1991 fully relocatable, because it may contain a copy of the tableptr argu- 1992 ment, which is an address (see below). 1993 1994 The options argument contains various bit settings that affect the com- 1995 pilation. It should be zero if no options are required. The available 1996 options are described below. Some of them (in particular, those that 1997 are compatible with Perl, but some others as well) can also be set and 1998 unset from within the pattern (see the detailed description in the 1999 pcrepattern documentation). For those options that can be different in 2000 different parts of the pattern, the contents of the options argument 2001 specifies their settings at the start of compilation and execution. The 2002 PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and 2003 PCRE_NO_START_OPTIMIZE options can be set at the time of matching as 2004 well as at compile time. 2005 2006 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, 2007 if compilation of a pattern fails, pcre_compile() returns NULL, and 2008 sets the variable pointed to by errptr to point to a textual error mes- 2009 sage. This is a static string that is part of the library. You must not 2010 try to free it. Normally, the offset from the start of the pattern to 2011 the byte that was being processed when the error was discovered is 2012 placed in the variable pointed to by erroffset, which must not be NULL 2013 (if it is, an immediate error is given). However, for an invalid UTF-8 2014 string, the offset is that of the first byte of the failing character. 2015 2016 Some errors are not detected until the whole pattern has been scanned; 2017 in these cases, the offset passed back is the length of the pattern. 2018 Note that the offset is in bytes, not characters, even in UTF-8 mode. 2019 It may sometimes point into the middle of a UTF-8 character. 2020 2021 If pcre_compile2() is used instead of pcre_compile(), and the error- 2022 codeptr argument is not NULL, a non-zero error code number is returned 2023 via this argument in the event of an error. This is in addition to the 2024 textual error message. Error codes and messages are listed below. 2025 2026 If the final argument, tableptr, is NULL, PCRE uses a default set of 2027 character tables that are built when PCRE is compiled, using the 2028 default C locale. Otherwise, tableptr must be an address that is the 2029 result of a call to pcre_maketables(). This value is stored with the 2030 compiled pattern, and used again by pcre_exec(), unless another table 2031 pointer is passed to it. For more discussion, see the section on locale 2032 support below. 2033 2034 This code fragment shows a typical straightforward call to pcre_com- 2035 pile(): 2036 2037 pcre *re; 2038 const char *error; 2039 int erroffset; 2040 re = pcre_compile( 2041 "^A.*Z", /* the pattern */ 2042 0, /* default options */ 2043 &error, /* for error message */ 2044 &erroffset, /* for error offset */ 2045 NULL); /* use default character tables */ 2046 2047 The following names for option bits are defined in the pcre.h header 2048 file: 2049 2050 PCRE_ANCHORED 2051 2052 If this bit is set, the pattern is forced to be "anchored", that is, it 2053 is constrained to match only at the first matching point in the string 2054 that is being searched (the "subject string"). This effect can also be 2055 achieved by appropriate constructs in the pattern itself, which is the 2056 only way to do it in Perl. 2057 2058 PCRE_AUTO_CALLOUT 2059 2060 If this bit is set, pcre_compile() automatically inserts callout items, 2061 all with number 255, before each pattern item. For discussion of the 2062 callout facility, see the pcrecallout documentation. 2063 2064 PCRE_BSR_ANYCRLF 2065 PCRE_BSR_UNICODE 2066 2067 These options (which are mutually exclusive) control what the \R escape 2068 sequence matches. The choice is either to match only CR, LF, or CRLF, 2069 or to match any Unicode newline sequence. The default is specified when 2070 PCRE is built. It can be overridden from within the pattern, or by set- 2071 ting an option when a compiled pattern is matched. 2072 2073 PCRE_CASELESS 2074 2075 If this bit is set, letters in the pattern match both upper and lower 2076 case letters. It is equivalent to Perl's /i option, and it can be 2077 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE 2078 always understands the concept of case for characters whose values are 2079 less than 128, so caseless matching is always possible. For characters 2080 with higher values, the concept of case is supported if PCRE is com- 2081 piled with Unicode property support, but not otherwise. If you want to 2082 use caseless matching for characters 128 and above, you must ensure 2083 that PCRE is compiled with Unicode property support as well as with 2084 UTF-8 support. 2085 2086 PCRE_DOLLAR_ENDONLY 2087 2088 If this bit is set, a dollar metacharacter in the pattern matches only 2089 at the end of the subject string. Without this option, a dollar also 2090 matches immediately before a newline at the end of the string (but not 2091 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored 2092 if PCRE_MULTILINE is set. There is no equivalent to this option in 2093 Perl, and no way to set it within a pattern. 2094 2095 PCRE_DOTALL 2096 2097 If this bit is set, a dot metacharacter in the pattern matches a char- 2098 acter of any value, including one that indicates a newline. However, it 2099 only ever matches one character, even if newlines are coded as CRLF. 2100 Without this option, a dot does not match when the current position is 2101 at a newline. This option is equivalent to Perl's /s option, and it can 2102 be changed within a pattern by a (?s) option setting. A negative class 2103 such as [^a] always matches newline characters, independent of the set- 2104 ting of this option. 2105 2106 PCRE_DUPNAMES 2107 2108 If this bit is set, names used to identify capturing subpatterns need 2109 not be unique. This can be helpful for certain types of pattern when it 2110 is known that only one instance of the named subpattern can ever be 2111 matched. There are more details of named subpatterns below; see also 2112 the pcrepattern documentation. 2113 2114 PCRE_EXTENDED 2115 2116 If this bit is set, white space data characters in the pattern are 2117 totally ignored except when escaped or inside a character class. White 2118 space does not include the VT character (code 11). In addition, charac- 2119 ters between an unescaped # outside a character class and the next new- 2120 line, inclusive, are also ignored. This is equivalent to Perl's /x 2121 option, and it can be changed within a pattern by a (?x) option set- 2122 ting. 2123 2124 Which characters are interpreted as newlines is controlled by the 2125 options passed to pcre_compile() or by a special sequence at the start 2126 of the pattern, as described in the section entitled "Newline conven- 2127 tions" in the pcrepattern documentation. Note that the end of this type 2128 of comment is a literal newline sequence in the pattern; escape 2129 sequences that happen to represent a newline do not count. 2130 2131 This option makes it possible to include comments inside complicated 2132 patterns. Note, however, that this applies only to data characters. 2133 White space characters may never appear within special character 2134 sequences in a pattern, for example within the sequence (?( that intro- 2135 duces a conditional subpattern. 2136 2137 PCRE_EXTRA 2138 2139 This option was invented in order to turn on additional functionality 2140 of PCRE that is incompatible with Perl, but it is currently of very 2141 little use. When set, any backslash in a pattern that is followed by a 2142 letter that has no special meaning causes an error, thus reserving 2143 these combinations for future expansion. By default, as in Perl, a 2144 backslash followed by a letter with no special meaning is treated as a 2145 literal. (Perl can, however, be persuaded to give an error for this, by 2146 running it with the -w option.) There are at present no other features 2147 controlled by this option. It can also be set by a (?X) option setting 2148 within a pattern. 2149 2150 PCRE_FIRSTLINE 2151 2152 If this option is set, an unanchored pattern is required to match 2153 before or at the first newline in the subject string, though the 2154 matched text may continue over the newline. 2155 2156 PCRE_JAVASCRIPT_COMPAT 2157 2158 If this option is set, PCRE's behaviour is changed in some ways so that 2159 it is compatible with JavaScript rather than Perl. The changes are as 2160 follows: 2161 2162 (1) A lone closing square bracket in a pattern causes a compile-time 2163 error, because this is illegal in JavaScript (by default it is treated 2164 as a data character). Thus, the pattern AB]CD becomes illegal when this 2165 option is set. 2166 2167 (2) At run time, a back reference to an unset subpattern group matches 2168 an empty string (by default this causes the current matching alterna- 2169 tive to fail). A pattern such as (\1)(a) succeeds when this option is 2170 set (assuming it can find an "a" in the subject), whereas it fails by 2171 default, for Perl compatibility. 2172 2173 (3) \U matches an upper case "U" character; by default \U causes a com- 2174 pile time error (Perl uses \U to upper case subsequent characters). 2175 2176 (4) \u matches a lower case "u" character unless it is followed by four 2177 hexadecimal digits, in which case the hexadecimal number defines the 2178 code point to match. By default, \u causes a compile time error (Perl 2179 uses it to upper case the following character). 2180 2181 (5) \x matches a lower case "x" character unless it is followed by two 2182 hexadecimal digits, in which case the hexadecimal number defines the 2183 code point to match. By default, as in Perl, a hexadecimal number is 2184 always expected after \x, but it may have zero, one, or two digits (so, 2185 for example, \xz matches a binary zero character followed by z). 2186 2187 PCRE_MULTILINE 2188 2189 By default, PCRE treats the subject string as consisting of a single 2190 line of characters (even if it actually contains newlines). The "start 2191 of line" metacharacter (^) matches only at the start of the string, 2192 while the "end of line" metacharacter ($) matches only at the end of 2193 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY 2194 is set). This is the same as Perl. 2195 2196 When PCRE_MULTILINE it is set, the "start of line" and "end of line" 2197 constructs match immediately following or immediately before internal 2198 newlines in the subject string, respectively, as well as at the very 2199 start and end. This is equivalent to Perl's /m option, and it can be 2200 changed within a pattern by a (?m) option setting. If there are no new- 2201 lines in a subject string, or no occurrences of ^ or $ in a pattern, 2202 setting PCRE_MULTILINE has no effect. 2203 2204 PCRE_NEWLINE_CR 2205 PCRE_NEWLINE_LF 2206 PCRE_NEWLINE_CRLF 2207 PCRE_NEWLINE_ANYCRLF 2208 PCRE_NEWLINE_ANY 2209 2210 These options override the default newline definition that was chosen 2211 when PCRE was built. Setting the first or the second specifies that a 2212 newline is indicated by a single character (CR or LF, respectively). 2213 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the 2214 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies 2215 that any of the three preceding sequences should be recognized. Setting 2216 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be 2217 recognized. 2218 2219 In an ASCII/Unicode environment, the Unicode newline sequences are the 2220 three just mentioned, plus the single characters VT (vertical tab, 2221 U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep- 2222 arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit 2223 library, the last two are recognized only in UTF-8 mode. 2224 2225 When PCRE is compiled to run in an EBCDIC (mainframe) environment, the 2226 code for CR is 0x0d, the same as ASCII. However, the character code for 2227 LF is normally 0x15, though in some EBCDIC environments 0x25 is used. 2228 Whichever of these is not LF is made to correspond to Unicode's NEL 2229 character. EBCDIC codes are all less than 256. For more details, see 2230 the pcrebuild documentation. 2231 2232 The newline setting in the options word uses three bits that are 2233 treated as a number, giving eight possibilities. Currently only six are 2234 used (default plus the five values above). This means that if you set 2235 more than one newline option, the combination may or may not be sensi- 2236 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to 2237 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and 2238 cause an error. 2239 2240 The only time that a line break in a pattern is specially recognized 2241 when compiling is when PCRE_EXTENDED is set. CR and LF are white space 2242 characters, and so are ignored in this mode. Also, an unescaped # out- 2243 side a character class indicates a comment that lasts until after the 2244 next line break sequence. In other circumstances, line break sequences 2245 in patterns are treated as literal data. 2246 2247 The newline option that is set at compile time becomes the default that 2248 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. 2249 2250 PCRE_NO_AUTO_CAPTURE 2251 2252 If this option is set, it disables the use of numbered capturing paren- 2253 theses in the pattern. Any opening parenthesis that is not followed by 2254 ? behaves as if it were followed by ?: but named parentheses can still 2255 be used for capturing (and they acquire numbers in the usual way). 2256 There is no equivalent of this option in Perl. 2257 2258 NO_START_OPTIMIZE 2259 2260 This is an option that acts at matching time; that is, it is really an 2261 option for pcre_exec() or pcre_dfa_exec(). If it is set at compile 2262 time, it is remembered with the compiled pattern and assumed at match- 2263 ing time. For details see the discussion of PCRE_NO_START_OPTIMIZE 2264 below. 2265 2266 PCRE_UCP 2267 2268 This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, 2269 \w, and some of the POSIX character classes. By default, only ASCII 2270 characters are recognized, but if PCRE_UCP is set, Unicode properties 2271 are used instead to classify characters. More details are given in the 2272 section on generic character types in the pcrepattern page. If you set 2273 PCRE_UCP, matching one of the items it affects takes much longer. The 2274 option is available only if PCRE has been compiled with Unicode prop- 2275 erty support. 2276 2277 PCRE_UNGREEDY 2278 2279 This option inverts the "greediness" of the quantifiers so that they 2280 are not greedy by default, but become greedy if followed by "?". It is 2281 not compatible with Perl. It can also be set by a (?U) option setting 2282 within the pattern. 2283 2284 PCRE_UTF8 2285 2286 This option causes PCRE to regard both the pattern and the subject as 2287 strings of UTF-8 characters instead of single-byte strings. However, it 2288 is available only when PCRE is built to include UTF support. If not, 2289 the use of this option provokes an error. Details of how this option 2290 changes the behaviour of PCRE are given in the pcreunicode page. 2291 2292 PCRE_NO_UTF8_CHECK 2293 2294 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is 2295 automatically checked. There is a discussion about the validity of 2296 UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is 2297 found, pcre_compile() returns an error. If you already know that your 2298 pattern is valid, and you want to skip this check for performance rea- 2299 sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the 2300 effect of passing an invalid UTF-8 string as a pattern is undefined. It 2301 may cause your program to crash. Note that this option can also be 2302 passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity 2303 checking of subject strings only. If the same string is being matched 2304 many times, the option can be safely set for the second and subsequent 2305 matchings to improve performance. 2306 2307 2308COMPILATION ERROR CODES 2309 2310 The following table lists the error codes than may be returned by 2311 pcre_compile2(), along with the error messages that may be returned by 2312 both compiling functions. Note that error messages are always 8-bit 2313 ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed, 2314 some error codes have fallen out of use. To avoid confusion, they have 2315 not been re-used. 2316 2317 0 no error 2318 1 \ at end of pattern 2319 2 \c at end of pattern 2320 3 unrecognized character follows \ 2321 4 numbers out of order in {} quantifier 2322 5 number too big in {} quantifier 2323 6 missing terminating ] for character class 2324 7 invalid escape sequence in character class 2325 8 range out of order in character class 2326 9 nothing to repeat 2327 10 [this code is not in use] 2328 11 internal error: unexpected repeat 2329 12 unrecognized character after (? or (?- 2330 13 POSIX named classes are supported only within a class 2331 14 missing ) 2332 15 reference to non-existent subpattern 2333 16 erroffset passed as NULL 2334 17 unknown option bit(s) set 2335 18 missing ) after comment 2336 19 [this code is not in use] 2337 20 regular expression is too large 2338 21 failed to get memory 2339 22 unmatched parentheses 2340 23 internal error: code overflow 2341 24 unrecognized character after (?< 2342 25 lookbehind assertion is not fixed length 2343 26 malformed number or name after (?( 2344 27 conditional group contains more than two branches 2345 28 assertion expected after (?( 2346 29 (?R or (?[+-]digits must be followed by ) 2347 30 unknown POSIX class name 2348 31 POSIX collating elements are not supported 2349 32 this version of PCRE is compiled without UTF support 2350 33 [this code is not in use] 2351 34 character value in \x{...} sequence is too large 2352 35 invalid condition (?(0) 2353 36 \C not allowed in lookbehind assertion 2354 37 PCRE does not support \L, \l, \N{name}, \U, or \u 2355 38 number after (?C is > 255 2356 39 closing ) for (?C expected 2357 40 recursive call could loop indefinitely 2358 41 unrecognized character after (?P 2359 42 syntax error in subpattern name (missing terminator) 2360 43 two named subpatterns have the same name 2361 44 invalid UTF-8 string (specifically UTF-8) 2362 45 support for \P, \p, and \X has not been compiled 2363 46 malformed \P or \p sequence 2364 47 unknown property name after \P or \p 2365 48 subpattern name is too long (maximum 32 characters) 2366 49 too many named subpatterns (maximum 10000) 2367 50 [this code is not in use] 2368 51 octal value is greater than \377 in 8-bit non-UTF-8 mode 2369 52 internal error: overran compiling workspace 2370 53 internal error: previously-checked referenced subpattern 2371 not found 2372 54 DEFINE group contains more than one branch 2373 55 repeating a DEFINE group is not allowed 2374 56 inconsistent NEWLINE options 2375 57 \g is not followed by a braced, angle-bracketed, or quoted 2376 name/number or by a plain number 2377 58 a numbered reference must not be zero 2378 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT) 2379 60 (*VERB) not recognized 2380 61 number is too big 2381 62 subpattern name expected 2382 63 digit expected after (?+ 2383 64 ] is an invalid data character in JavaScript compatibility mode 2384 65 different names for subpatterns of the same number are 2385 not allowed 2386 66 (*MARK) must have an argument 2387 67 this version of PCRE is not compiled with Unicode property 2388 support 2389 68 \c must be followed by an ASCII character 2390 69 \k is not followed by a braced, angle-bracketed, or quoted name 2391 70 internal error: unknown opcode in find_fixedlength() 2392 71 \N is not supported in a class 2393 72 too many forward references 2394 73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff) 2395 74 invalid UTF-16 string (specifically UTF-16) 2396 75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN) 2397 76 character value in \u.... sequence is too large 2398 77 invalid UTF-32 string (specifically UTF-32) 2399 2400 The numbers 32 and 10000 in errors 48 and 49 are defaults; different 2401 values may be used if the limits were changed when PCRE was built. 2402 2403 2404STUDYING A PATTERN 2405 2406 pcre_extra *pcre_study(const pcre *code, int options 2407 const char **errptr); 2408 2409 If a compiled pattern is going to be used several times, it is worth 2410 spending more time analyzing it in order to speed up the time taken for 2411 matching. The function pcre_study() takes a pointer to a compiled pat- 2412 tern as its first argument. If studying the pattern produces additional 2413 information that will help speed up matching, pcre_study() returns a 2414 pointer to a pcre_extra block, in which the study_data field points to 2415 the results of the study. 2416 2417 The returned value from pcre_study() can be passed directly to 2418 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- 2419 tains other fields that can be set by the caller before the block is 2420 passed; these are described below in the section on matching a pattern. 2421 2422 If studying the pattern does not produce any useful information, 2423 pcre_study() returns NULL by default. In that circumstance, if the 2424 calling program wants to pass any of the other fields to pcre_exec() or 2425 pcre_dfa_exec(), it must set up its own pcre_extra block. However, if 2426 pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it 2427 returns a pcre_extra block even if studying did not find any additional 2428 information. It may still return NULL, however, if an error occurs in 2429 pcre_study(). 2430 2431 The second argument of pcre_study() contains option bits. There are 2432 three further options in addition to PCRE_STUDY_EXTRA_NEEDED: 2433 2434 PCRE_STUDY_JIT_COMPILE 2435 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE 2436 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE 2437 2438 If any of these are set, and the just-in-time compiler is available, 2439 the pattern is further compiled into machine code that executes much 2440 faster than the pcre_exec() interpretive matching function. If the 2441 just-in-time compiler is not available, these options are ignored. All 2442 undefined bits in the options argument must be zero. 2443 2444 JIT compilation is a heavyweight optimization. It can take some time 2445 for patterns to be analyzed, and for one-off matches and simple pat- 2446 terns the benefit of faster execution might be offset by a much slower 2447 study time. Not all patterns can be optimized by the JIT compiler. For 2448 those that cannot be handled, matching automatically falls back to the 2449 pcre_exec() interpreter. For more details, see the pcrejit documenta- 2450 tion. 2451 2452 The third argument for pcre_study() is a pointer for an error message. 2453 If studying succeeds (even if no data is returned), the variable it 2454 points to is set to NULL. Otherwise it is set to point to a textual 2455 error message. This is a static string that is part of the library. You 2456 must not try to free it. You should test the error pointer for NULL 2457 after calling pcre_study(), to be sure that it has run successfully. 2458 2459 When you are finished with a pattern, you can free the memory used for 2460 the study data by calling pcre_free_study(). This function was added to 2461 the API for release 8.20. For earlier versions, the memory could be 2462 freed with pcre_free(), just like the pattern itself. This will still 2463 work in cases where JIT optimization is not used, but it is advisable 2464 to change to the new function when convenient. 2465 2466 This is a typical way in which pcre_study() is used (except that in a 2467 real application there should be tests for errors): 2468 2469 int rc; 2470 pcre *re; 2471 pcre_extra *sd; 2472 re = pcre_compile("pattern", 0, &error, &erroroffset, NULL); 2473 sd = pcre_study( 2474 re, /* result of pcre_compile() */ 2475 0, /* no options */ 2476 &error); /* set to NULL or points to a message */ 2477 rc = pcre_exec( /* see below for details of pcre_exec() options */ 2478 re, sd, "subject", 7, 0, 0, ovector, 30); 2479 ... 2480 pcre_free_study(sd); 2481 pcre_free(re); 2482 2483 Studying a pattern does two things: first, a lower bound for the length 2484 of subject string that is needed to match the pattern is computed. This 2485 does not mean that there are any strings of that length that match, but 2486 it does guarantee that no shorter strings match. The value is used to 2487 avoid wasting time by trying to match strings that are shorter than the 2488 lower bound. You can find out the value in a calling program via the 2489 pcre_fullinfo() function. 2490 2491 Studying a pattern is also useful for non-anchored patterns that do not 2492 have a single fixed starting character. A bitmap of possible starting 2493 bytes is created. This speeds up finding a position in the subject at 2494 which to start matching. (In 16-bit mode, the bitmap is used for 16-bit 2495 values less than 256. In 32-bit mode, the bitmap is used for 32-bit 2496 values less than 256.) 2497 2498 These two optimizations apply to both pcre_exec() and pcre_dfa_exec(), 2499 and the information is also used by the JIT compiler. The optimiza- 2500 tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when 2501 calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu- 2502 tion is also disabled. You might want to do this if your pattern con- 2503 tains callouts or (*MARK) and you want to make use of these facilities 2504 in cases where matching fails. See the discussion of 2505 PCRE_NO_START_OPTIMIZE below. 2506 2507 2508LOCALE SUPPORT 2509 2510 PCRE handles caseless matching, and determines whether characters are 2511 letters, digits, or whatever, by reference to a set of tables, indexed 2512 by character value. When running in UTF-8 mode, this applies only to 2513 characters with codes less than 128. By default, higher-valued codes 2514 never match escapes such as \w or \d, but they can be tested with \p if 2515 PCRE is built with Unicode character property support. Alternatively, 2516 the PCRE_UCP option can be set at compile time; this causes \w and 2517 friends to use Unicode property support instead of built-in tables. The 2518 use of locales with Unicode is discouraged. If you are handling charac- 2519 ters with codes greater than 128, you should either use UTF-8 and Uni- 2520 code, or use locales, but not try to mix the two. 2521 2522 PCRE contains an internal set of tables that are used when the final 2523 argument of pcre_compile() is NULL. These are sufficient for many 2524 applications. Normally, the internal tables recognize only ASCII char- 2525 acters. However, when PCRE is built, it is possible to cause the inter- 2526 nal tables to be rebuilt in the default "C" locale of the local system, 2527 which may cause them to be different. 2528 2529 The internal tables can always be overridden by tables supplied by the 2530 application that calls PCRE. These may be created in a different locale 2531 from the default. As more and more applications change to using Uni- 2532 code, the need for this locale support is expected to die away. 2533 2534 External tables are built by calling the pcre_maketables() function, 2535 which has no arguments, in the relevant locale. The result can then be 2536 passed to pcre_compile() or pcre_exec() as often as necessary. For 2537 example, to build and use tables that are appropriate for the French 2538 locale (where accented characters with values greater than 128 are 2539 treated as letters), the following code could be used: 2540 2541 setlocale(LC_CTYPE, "fr_FR"); 2542 tables = pcre_maketables(); 2543 re = pcre_compile(..., tables); 2544 2545 The locale name "fr_FR" is used on Linux and other Unix-like systems; 2546 if you are using Windows, the name for the French locale is "french". 2547 2548 When pcre_maketables() runs, the tables are built in memory that is 2549 obtained via pcre_malloc. It is the caller's responsibility to ensure 2550 that the memory containing the tables remains available for as long as 2551 it is needed. 2552 2553 The pointer that is passed to pcre_compile() is saved with the compiled 2554 pattern, and the same tables are used via this pointer by pcre_study() 2555 and normally also by pcre_exec(). Thus, by default, for any single pat- 2556 tern, compilation, studying and matching all happen in the same locale, 2557 but different patterns can be compiled in different locales. 2558 2559 It is possible to pass a table pointer or NULL (indicating the use of 2560 the internal tables) to pcre_exec(). Although not intended for this 2561 purpose, this facility could be used to match a pattern in a different 2562 locale from the one in which it was compiled. Passing table pointers at 2563 run time is discussed below in the section on matching a pattern. 2564 2565 2566INFORMATION ABOUT A PATTERN 2567 2568 int pcre_fullinfo(const pcre *code, const pcre_extra *extra, 2569 int what, void *where); 2570 2571 The pcre_fullinfo() function returns information about a compiled pat- 2572 tern. It replaces the pcre_info() function, which was removed from the 2573 library at version 8.30, after more than 10 years of obsolescence. 2574 2575 The first argument for pcre_fullinfo() is a pointer to the compiled 2576 pattern. The second argument is the result of pcre_study(), or NULL if 2577 the pattern was not studied. The third argument specifies which piece 2578 of information is required, and the fourth argument is a pointer to a 2579 variable to receive the data. The yield of the function is zero for 2580 success, or one of the following negative numbers: 2581 2582 PCRE_ERROR_NULL the argument code was NULL 2583 the argument where was NULL 2584 PCRE_ERROR_BADMAGIC the "magic number" was not found 2585 PCRE_ERROR_BADENDIANNESS the pattern was compiled with different 2586 endianness 2587 PCRE_ERROR_BADOPTION the value of what was invalid 2588 2589 The "magic number" is placed at the start of each compiled pattern as 2590 an simple check against passing an arbitrary memory pointer. The endi- 2591 anness error can occur if a compiled pattern is saved and reloaded on a 2592 different host. Here is a typical call of pcre_fullinfo(), to obtain 2593 the length of the compiled pattern: 2594 2595 int rc; 2596 size_t length; 2597 rc = pcre_fullinfo( 2598 re, /* result of pcre_compile() */ 2599 sd, /* result of pcre_study(), or NULL */ 2600 PCRE_INFO_SIZE, /* what is required */ 2601 &length); /* where to put the data */ 2602 2603 The possible values for the third argument are defined in pcre.h, and 2604 are as follows: 2605 2606 PCRE_INFO_BACKREFMAX 2607 2608 Return the number of the highest back reference in the pattern. The 2609 fourth argument should point to an int variable. Zero is returned if 2610 there are no back references. 2611 2612 PCRE_INFO_CAPTURECOUNT 2613 2614 Return the number of capturing subpatterns in the pattern. The fourth 2615 argument should point to an int variable. 2616 2617 PCRE_INFO_DEFAULT_TABLES 2618 2619 Return a pointer to the internal default character tables within PCRE. 2620 The fourth argument should point to an unsigned char * variable. This 2621 information call is provided for internal use by the pcre_study() func- 2622 tion. External callers can cause PCRE to use its internal tables by 2623 passing a NULL table pointer. 2624 2625 PCRE_INFO_FIRSTBYTE 2626 2627 Return information about the first data unit of any matched string, for 2628 a non-anchored pattern. (The name of this option refers to the 8-bit 2629 library, where data units are bytes.) The fourth argument should point 2630 to an int variable. 2631 2632 If there is a fixed first value, for example, the letter "c" from a 2633 pattern such as (cat|cow|coyote), its value is returned. In the 8-bit 2634 library, the value is always less than 256. In the 16-bit library the 2635 value can be up to 0xffff. In the 32-bit library the value can be up to 2636 0x10ffff. 2637 2638 If there is no fixed first value, and if either 2639 2640 (a) the pattern was compiled with the PCRE_MULTILINE option, and every 2641 branch starts with "^", or 2642 2643 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not 2644 set (if it were set, the pattern would be anchored), 2645 2646 -1 is returned, indicating that the pattern matches only at the start 2647 of a subject string or after any newline within the string. Otherwise 2648 -2 is returned. For anchored patterns, -2 is returned. 2649 2650 Since for the 32-bit library using the non-UTF-32 mode, this function 2651 is unable to return the full 32-bit range of the character, this value 2652 is deprecated; instead the PCRE_INFO_FIRSTCHARACTERFLAGS and 2653 PCRE_INFO_FIRSTCHARACTER values should be used. 2654 2655 PCRE_INFO_FIRSTTABLE 2656 2657 If the pattern was studied, and this resulted in the construction of a 2658 256-bit table indicating a fixed set of values for the first data unit 2659 in any matching string, a pointer to the table is returned. Otherwise 2660 NULL is returned. The fourth argument should point to an unsigned char 2661 * variable. 2662 2663 PCRE_INFO_HASCRORLF 2664 2665 Return 1 if the pattern contains any explicit matches for CR or LF 2666 characters, otherwise 0. The fourth argument should point to an int 2667 variable. An explicit match is either a literal CR or LF character, or 2668 \r or \n. 2669 2670 PCRE_INFO_JCHANGED 2671 2672 Return 1 if the (?J) or (?-J) option setting is used in the pattern, 2673 otherwise 0. The fourth argument should point to an int variable. (?J) 2674 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. 2675 2676 PCRE_INFO_JIT 2677 2678 Return 1 if the pattern was studied with one of the JIT options, and 2679 just-in-time compiling was successful. The fourth argument should point 2680 to an int variable. A return value of 0 means that JIT support is not 2681 available in this version of PCRE, or that the pattern was not studied 2682 with a JIT option, or that the JIT compiler could not handle this par- 2683 ticular pattern. See the pcrejit documentation for details of what can 2684 and cannot be handled. 2685 2686 PCRE_INFO_JITSIZE 2687 2688 If the pattern was successfully studied with a JIT option, return the 2689 size of the JIT compiled code, otherwise return zero. The fourth argu- 2690 ment should point to a size_t variable. 2691 2692 PCRE_INFO_LASTLITERAL 2693 2694 Return the value of the rightmost literal data unit that must exist in 2695 any matched string, other than at its start, if such a value has been 2696 recorded. The fourth argument should point to an int variable. If there 2697 is no such value, -1 is returned. For anchored patterns, a last literal 2698 value is recorded only if it follows something of variable length. For 2699 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for 2700 /^a\dz\d/ the returned value is -1. 2701 2702 Since for the 32-bit library using the non-UTF-32 mode, this function 2703 is unable to return the full 32-bit range of the character, this value 2704 is deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and 2705 PCRE_INFO_REQUIREDCHAR values should be used. 2706 2707 PCRE_INFO_MAXLOOKBEHIND 2708 2709 Return the number of characters (NB not bytes) in the longest lookbe- 2710 hind assertion in the pattern. Note that the simple assertions \b and 2711 \B require a one-character lookbehind. This information is useful when 2712 doing multi-segment matching using the partial matching facilities. 2713 2714 PCRE_INFO_MINLENGTH 2715 2716 If the pattern was studied and a minimum length for matching subject 2717 strings was computed, its value is returned. Otherwise the returned 2718 value is -1. The value is a number of characters, which in UTF-8 mode 2719 may be different from the number of bytes. The fourth argument should 2720 point to an int variable. A non-negative value is a lower bound to the 2721 length of any matching string. There may not be any strings of that 2722 length that do actually match, but every string that does match is at 2723 least that long. 2724 2725 PCRE_INFO_NAMECOUNT 2726 PCRE_INFO_NAMEENTRYSIZE 2727 PCRE_INFO_NAMETABLE 2728 2729 PCRE supports the use of named as well as numbered capturing parenthe- 2730 ses. The names are just an additional way of identifying the parenthe- 2731 ses, which still acquire numbers. Several convenience functions such as 2732 pcre_get_named_substring() are provided for extracting captured sub- 2733 strings by name. It is also possible to extract the data directly, by 2734 first converting the name to a number in order to access the correct 2735 pointers in the output vector (described with pcre_exec() below). To do 2736 the conversion, you need to use the name-to-number map, which is 2737 described by these three values. 2738 2739 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT 2740 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size 2741 of each entry; both of these return an int value. The entry size 2742 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns 2743 a pointer to the first entry of the table. This is a pointer to char in 2744 the 8-bit library, where the first two bytes of each entry are the num- 2745 ber of the capturing parenthesis, most significant byte first. In the 2746 16-bit library, the pointer points to 16-bit data units, the first of 2747 which contains the parenthesis number. In the 32-bit library, the 2748 pointer points to 32-bit data units, the first of which contains the 2749 parenthesis number. The rest of the entry is the corresponding name, 2750 zero terminated. 2751 2752 The names are in alphabetical order. Duplicate names may appear if (?| 2753 is used to create multiple groups with the same number, as described in 2754 the section on duplicate subpattern numbers in the pcrepattern page. 2755 Duplicate names for subpatterns with different numbers are permitted 2756 only if PCRE_DUPNAMES is set. In all cases of duplicate names, they 2757 appear in the table in the order in which they were found in the pat- 2758 tern. In the absence of (?| this is the order of increasing number; 2759 when (?| is used this is not necessarily the case because later subpat- 2760 terns may have lower numbers. 2761 2762 As a simple example of the name/number table, consider the following 2763 pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is 2764 set, so white space - including newlines - is ignored): 2765 2766 (?<date> (?<year>(\d\d)?\d\d) - 2767 (?<month>\d\d) - (?<day>\d\d) ) 2768 2769 There are four named subpatterns, so the table has four entries, and 2770 each entry in the table is eight bytes long. The table is as follows, 2771 with non-printing bytes shows in hexadecimal, and undefined bytes shown 2772 as ??: 2773 2774 00 01 d a t e 00 ?? 2775 00 05 d a y 00 ?? ?? 2776 00 04 m o n t h 00 2777 00 02 y e a r 00 ?? 2778 2779 When writing code to extract data from named subpatterns using the 2780 name-to-number map, remember that the length of the entries is likely 2781 to be different for each compiled pattern. 2782 2783 PCRE_INFO_OKPARTIAL 2784 2785 Return 1 if the pattern can be used for partial matching with 2786 pcre_exec(), otherwise 0. The fourth argument should point to an int 2787 variable. From release 8.00, this always returns 1, because the 2788 restrictions that previously applied to partial matching have been 2789 lifted. The pcrepartial documentation gives details of partial match- 2790 ing. 2791 2792 PCRE_INFO_OPTIONS 2793 2794 Return a copy of the options with which the pattern was compiled. The 2795 fourth argument should point to an unsigned long int variable. These 2796 option bits are those specified in the call to pcre_compile(), modified 2797 by any top-level option settings at the start of the pattern itself. In 2798 other words, they are the options that will be in force when matching 2799 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with 2800 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, 2801 and PCRE_EXTENDED. 2802 2803 A pattern is automatically anchored by PCRE if all of its top-level 2804 alternatives begin with one of the following: 2805 2806 ^ unless PCRE_MULTILINE is set 2807 \A always 2808 \G always 2809 .* if PCRE_DOTALL is set and there are no back 2810 references to the subpattern in which .* appears 2811 2812 For such patterns, the PCRE_ANCHORED bit is set in the options returned 2813 by pcre_fullinfo(). 2814 2815 PCRE_INFO_SIZE 2816 2817 Return the size of the compiled pattern in bytes (for both libraries). 2818 The fourth argument should point to a size_t variable. This value does 2819 not include the size of the pcre structure that is returned by 2820 pcre_compile(). The value that is passed as the argument to pcre_mal- 2821 loc() when pcre_compile() is getting memory in which to place the com- 2822 piled data is the value returned by this option plus the size of the 2823 pcre structure. Studying a compiled pattern, with or without JIT, does 2824 not alter the value returned by this option. 2825 2826 PCRE_INFO_STUDYSIZE 2827 2828 Return the size in bytes of the data block pointed to by the study_data 2829 field in a pcre_extra block. If pcre_extra is NULL, or there is no 2830 study data, zero is returned. The fourth argument should point to a 2831 size_t variable. The study_data field is set by pcre_study() to record 2832 information that will speed up matching (see the section entitled 2833 "Studying a pattern" above). The format of the study_data block is pri- 2834 vate, but its length is made available via this option so that it can 2835 be saved and restored (see the pcreprecompile documentation for 2836 details). 2837 2838 PCRE_INFO_FIRSTCHARACTERFLAGS 2839 2840 Return information about the first data unit of any matched string, for 2841 a non-anchored pattern. The fourth argument should point to an int 2842 variable. 2843 2844 If there is a fixed first value, for example, the letter "c" from a 2845 pattern such as (cat|cow|coyote), 1 is returned, and the character 2846 value can be retrieved using PCRE_INFO_FIRSTCHARACTER. 2847 2848 If there is no fixed first value, and if either 2849 2850 (a) the pattern was compiled with the PCRE_MULTILINE option, and every 2851 branch starts with "^", or 2852 2853 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not 2854 set (if it were set, the pattern would be anchored), 2855 2856 2 is returned, indicating that the pattern matches only at the start of 2857 a subject string or after any newline within the string. Otherwise 0 is 2858 returned. For anchored patterns, 0 is returned. 2859 2860 PCRE_INFO_FIRSTCHARACTER 2861 2862 Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTER- 2863 FLAGS returned 1; otherwise returns 0. The fourth argument should point 2864 to an uint_t variable. 2865 2866 In the 8-bit library, the value is always less than 256. In the 16-bit 2867 library the value can be up to 0xffff. In the 32-bit library in UTF-32 2868 mode the value can be up to 0x10ffff, and up to 0xffffffff when not 2869 using UTF-32 mode. 2870 2871 If there is no fixed first value, and if either 2872 2873 (a) the pattern was compiled with the PCRE_MULTILINE option, and every 2874 branch starts with "^", or 2875 2876 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not 2877 set (if it were set, the pattern would be anchored), 2878 2879 -1 is returned, indicating that the pattern matches only at the start 2880 of a subject string or after any newline within the string. Otherwise 2881 -2 is returned. For anchored patterns, -2 is returned. 2882 2883 PCRE_INFO_REQUIREDCHARFLAGS 2884 2885 Returns 1 if there is a rightmost literal data unit that must exist in 2886 any matched string, other than at its start. The fourth argument should 2887 point to an int variable. If there is no such value, 0 is returned. If 2888 returning 1, the character value itself can be retrieved using 2889 PCRE_INFO_REQUIREDCHAR. 2890 2891 For anchored patterns, a last literal value is recorded only if it fol- 2892 lows something of variable length. For example, for the pattern 2893 /^a\d+z\d+/ the returned value 1 (with "z" returned from 2894 PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0. 2895 2896 PCRE_INFO_REQUIREDCHAR 2897 2898 Return the value of the rightmost literal data unit that must exist in 2899 any matched string, other than at its start, if such a value has been 2900 recorded. The fourth argument should point to an uint32_t variable. If 2901 there is no such value, 0 is returned. 2902 2903 2904REFERENCE COUNTS 2905 2906 int pcre_refcount(pcre *code, int adjust); 2907 2908 The pcre_refcount() function is used to maintain a reference count in 2909 the data block that contains a compiled pattern. It is provided for the 2910 benefit of applications that operate in an object-oriented manner, 2911 where different parts of the application may be using the same compiled 2912 pattern, but you want to free the block when they are all done. 2913 2914 When a pattern is compiled, the reference count field is initialized to 2915 zero. It is changed only by calling this function, whose action is to 2916 add the adjust value (which may be positive or negative) to it. The 2917 yield of the function is the new value. However, the value of the count 2918 is constrained to lie between 0 and 65535, inclusive. If the new value 2919 is outside these limits, it is forced to the appropriate limit value. 2920 2921 Except when it is zero, the reference count is not correctly preserved 2922 if a pattern is compiled on one host and then transferred to a host 2923 whose byte-order is different. (This seems a highly unlikely scenario.) 2924 2925 2926MATCHING A PATTERN: THE TRADITIONAL FUNCTION 2927 2928 int pcre_exec(const pcre *code, const pcre_extra *extra, 2929 const char *subject, int length, int startoffset, 2930 int options, int *ovector, int ovecsize); 2931 2932 The function pcre_exec() is called to match a subject string against a 2933 compiled pattern, which is passed in the code argument. If the pattern 2934 was studied, the result of the study should be passed in the extra 2935 argument. You can call pcre_exec() with the same code and extra argu- 2936 ments as many times as you like, in order to match different subject 2937 strings with the same pattern. 2938 2939 This function is the main matching facility of the library, and it 2940 operates in a Perl-like manner. For specialist use there is also an 2941 alternative matching function, which is described below in the section 2942 about the pcre_dfa_exec() function. 2943 2944 In most applications, the pattern will have been compiled (and option- 2945 ally studied) in the same process that calls pcre_exec(). However, it 2946 is possible to save compiled patterns and study data, and then use them 2947 later in different processes, possibly even on different hosts. For a 2948 discussion about this, see the pcreprecompile documentation. 2949 2950 Here is an example of a simple call to pcre_exec(): 2951 2952 int rc; 2953 int ovector[30]; 2954 rc = pcre_exec( 2955 re, /* result of pcre_compile() */ 2956 NULL, /* we didn't study the pattern */ 2957 "some string", /* the subject string */ 2958 11, /* the length of the subject string */ 2959 0, /* start at offset 0 in the subject */ 2960 0, /* default options */ 2961 ovector, /* vector of integers for substring information */ 2962 30); /* number of elements (NOT size in bytes) */ 2963 2964 Extra data for pcre_exec() 2965 2966 If the extra argument is not NULL, it must point to a pcre_extra data 2967 block. The pcre_study() function returns such a block (when it doesn't 2968 return NULL), but you can also create one for yourself, and pass addi- 2969 tional information in it. The pcre_extra block contains the following 2970 fields (not necessarily in this order): 2971 2972 unsigned long int flags; 2973 void *study_data; 2974 void *executable_jit; 2975 unsigned long int match_limit; 2976 unsigned long int match_limit_recursion; 2977 void *callout_data; 2978 const unsigned char *tables; 2979 unsigned char **mark; 2980 2981 In the 16-bit version of this structure, the mark field has type 2982 "PCRE_UCHAR16 **". 2983 2984 In the 32-bit version of this structure, the mark field has type 2985 "PCRE_UCHAR32 **". 2986 2987 The flags field is used to specify which of the other fields are set. 2988 The flag bits are: 2989 2990 PCRE_EXTRA_CALLOUT_DATA 2991 PCRE_EXTRA_EXECUTABLE_JIT 2992 PCRE_EXTRA_MARK 2993 PCRE_EXTRA_MATCH_LIMIT 2994 PCRE_EXTRA_MATCH_LIMIT_RECURSION 2995 PCRE_EXTRA_STUDY_DATA 2996 PCRE_EXTRA_TABLES 2997 2998 Other flag bits should be set to zero. The study_data field and some- 2999 times the executable_jit field are set in the pcre_extra block that is 3000 returned by pcre_study(), together with the appropriate flag bits. You 3001 should not set these yourself, but you may add to the block by setting 3002 other fields and their corresponding flag bits. 3003 3004 The match_limit field provides a means of preventing PCRE from using up 3005 a vast amount of resources when running patterns that are not going to 3006 match, but which have a very large number of possibilities in their 3007 search trees. The classic example is a pattern that uses nested unlim- 3008 ited repeats. 3009 3010 Internally, pcre_exec() uses a function called match(), which it calls 3011 repeatedly (sometimes recursively). The limit set by match_limit is 3012 imposed on the number of times this function is called during a match, 3013 which has the effect of limiting the amount of backtracking that can 3014 take place. For patterns that are not anchored, the count restarts from 3015 zero for each position in the subject string. 3016 3017 When pcre_exec() is called with a pattern that was successfully studied 3018 with a JIT option, the way that the matching is executed is entirely 3019 different. However, there is still the possibility of runaway matching 3020 that goes on for a very long time, and so the match_limit value is also 3021 used in this case (but in a different way) to limit how long the match- 3022 ing can continue. 3023 3024 The default value for the limit can be set when PCRE is built; the 3025 default default is 10 million, which handles all but the most extreme 3026 cases. You can override the default by suppling pcre_exec() with a 3027 pcre_extra block in which match_limit is set, and 3028 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is 3029 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. 3030 3031 The match_limit_recursion field is similar to match_limit, but instead 3032 of limiting the total number of times that match() is called, it limits 3033 the depth of recursion. The recursion depth is a smaller number than 3034 the total number of calls, because not all calls to match() are recur- 3035 sive. This limit is of use only if it is set smaller than match_limit. 3036 3037 Limiting the recursion depth limits the amount of machine stack that 3038 can be used, or, when PCRE has been compiled to use memory on the heap 3039 instead of the stack, the amount of heap memory that can be used. This 3040 limit is not relevant, and is ignored, when matching is done using JIT 3041 compiled code. 3042 3043 The default value for match_limit_recursion can be set when PCRE is 3044 built; the default default is the same value as the default for 3045 match_limit. You can override the default by suppling pcre_exec() with 3046 a pcre_extra block in which match_limit_recursion is set, and 3047 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the 3048 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. 3049 3050 The callout_data field is used in conjunction with the "callout" fea- 3051 ture, and is described in the pcrecallout documentation. 3052 3053 The tables field is used to pass a character tables pointer to 3054 pcre_exec(); this overrides the value that is stored with the compiled 3055 pattern. A non-NULL value is stored with the compiled pattern only if 3056 custom tables were supplied to pcre_compile() via its tableptr argu- 3057 ment. If NULL is passed to pcre_exec() using this mechanism, it forces 3058 PCRE's internal tables to be used. This facility is helpful when re- 3059 using patterns that have been saved after compiling with an external 3060 set of tables, because the external tables might be at a different 3061 address when pcre_exec() is called. See the pcreprecompile documenta- 3062 tion for a discussion of saving compiled patterns for later use. 3063 3064 If PCRE_EXTRA_MARK is set in the flags field, the mark field must be 3065 set to point to a suitable variable. If the pattern contains any back- 3066 tracking control verbs such as (*MARK:NAME), and the execution ends up 3067 with a name to pass back, a pointer to the name string (zero termi- 3068 nated) is placed in the variable pointed to by the mark field. The 3069 names are within the compiled pattern; if you wish to retain such a 3070 name you must copy it before freeing the memory of a compiled pattern. 3071 If there is no name to pass back, the variable pointed to by the mark 3072 field is set to NULL. For details of the backtracking control verbs, 3073 see the section entitled "Backtracking control" in the pcrepattern doc- 3074 umentation. 3075 3076 Option bits for pcre_exec() 3077 3078 The unused bits of the options argument for pcre_exec() must be zero. 3079 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, 3080 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, 3081 PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and 3082 PCRE_PARTIAL_SOFT. 3083 3084 If the pattern was successfully studied with one of the just-in-time 3085 (JIT) compile options, the only supported options for JIT execution are 3086 PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, 3087 PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an 3088 unsupported option is used, JIT execution is disabled and the normal 3089 interpretive code in pcre_exec() is run. 3090 3091 PCRE_ANCHORED 3092 3093 The PCRE_ANCHORED option limits pcre_exec() to matching at the first 3094 matching position. If a pattern was compiled with PCRE_ANCHORED, or 3095 turned out to be anchored by virtue of its contents, it cannot be made 3096 unachored at matching time. 3097 3098 PCRE_BSR_ANYCRLF 3099 PCRE_BSR_UNICODE 3100 3101 These options (which are mutually exclusive) control what the \R escape 3102 sequence matches. The choice is either to match only CR, LF, or CRLF, 3103 or to match any Unicode newline sequence. These options override the 3104 choice that was made or defaulted when the pattern was compiled. 3105 3106 PCRE_NEWLINE_CR 3107 PCRE_NEWLINE_LF 3108 PCRE_NEWLINE_CRLF 3109 PCRE_NEWLINE_ANYCRLF 3110 PCRE_NEWLINE_ANY 3111 3112 These options override the newline definition that was chosen or 3113 defaulted when the pattern was compiled. For details, see the descrip- 3114 tion of pcre_compile() above. During matching, the newline choice 3115 affects the behaviour of the dot, circumflex, and dollar metacharac- 3116 ters. It may also alter the way the match position is advanced after a 3117 match failure for an unanchored pattern. 3118 3119 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is 3120 set, and a match attempt for an unanchored pattern fails when the cur- 3121 rent position is at a CRLF sequence, and the pattern contains no 3122 explicit matches for CR or LF characters, the match position is 3123 advanced by two characters instead of one, in other words, to after the 3124 CRLF. 3125 3126 The above rule is a compromise that makes the most common cases work as 3127 expected. For example, if the pattern is .+A (and the PCRE_DOTALL 3128 option is not set), it does not match the string "\r\nA" because, after 3129 failing at the start, it skips both the CR and the LF before retrying. 3130 However, the pattern [\r\n]A does match that string, because it con- 3131 tains an explicit CR or LF reference, and so advances only by one char- 3132 acter after the first failure. 3133 3134 An explicit match for CR of LF is either a literal appearance of one of 3135 those characters, or one of the \r or \n escape sequences. Implicit 3136 matches such as [^X] do not count, nor does \s (which includes CR and 3137 LF in the characters that it matches). 3138 3139 Notwithstanding the above, anomalous effects may still occur when CRLF 3140 is a valid newline sequence and explicit \r or \n escapes appear in the 3141 pattern. 3142 3143 PCRE_NOTBOL 3144 3145 This option specifies that first character of the subject string is not 3146 the beginning of a line, so the circumflex metacharacter should not 3147 match before it. Setting this without PCRE_MULTILINE (at compile time) 3148 causes circumflex never to match. This option affects only the behav- 3149 iour of the circumflex metacharacter. It does not affect \A. 3150 3151 PCRE_NOTEOL 3152 3153 This option specifies that the end of the subject string is not the end 3154 of a line, so the dollar metacharacter should not match it nor (except 3155 in multiline mode) a newline immediately before it. Setting this with- 3156 out PCRE_MULTILINE (at compile time) causes dollar never to match. This 3157 option affects only the behaviour of the dollar metacharacter. It does 3158 not affect \Z or \z. 3159 3160 PCRE_NOTEMPTY 3161 3162 An empty string is not considered to be a valid match if this option is 3163 set. If there are alternatives in the pattern, they are tried. If all 3164 the alternatives match the empty string, the entire match fails. For 3165 example, if the pattern 3166 3167 a?b? 3168 3169 is applied to a string not beginning with "a" or "b", it matches an 3170 empty string at the start of the subject. With PCRE_NOTEMPTY set, this 3171 match is not valid, so PCRE searches further into the string for occur- 3172 rences of "a" or "b". 3173 3174 PCRE_NOTEMPTY_ATSTART 3175 3176 This is like PCRE_NOTEMPTY, except that an empty string match that is 3177 not at the start of the subject is permitted. If the pattern is 3178 anchored, such a match can occur only if the pattern contains \K. 3179 3180 Perl has no direct equivalent of PCRE_NOTEMPTY or 3181 PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern 3182 match of the empty string within its split() function, and when using 3183 the /g modifier. It is possible to emulate Perl's behaviour after 3184 matching a null string by first trying the match again at the same off- 3185 set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that 3186 fails, by advancing the starting offset (see below) and trying an ordi- 3187 nary match again. There is some code that demonstrates how to do this 3188 in the pcredemo sample program. In the most general case, you have to 3189 check to see if the newline convention recognizes CRLF as a newline, 3190 and if so, and the current character is CR followed by LF, advance the 3191 starting offset by two characters instead of one. 3192 3193 PCRE_NO_START_OPTIMIZE 3194 3195 There are a number of optimizations that pcre_exec() uses at the start 3196 of a match, in order to speed up the process. For example, if it is 3197 known that an unanchored match must start with a specific character, it 3198 searches the subject for that character, and fails immediately if it 3199 cannot find it, without actually running the main matching function. 3200 This means that a special item such as (*COMMIT) at the start of a pat- 3201 tern is not considered until after a suitable starting point for the 3202 match has been found. When callouts or (*MARK) items are in use, these 3203 "start-up" optimizations can cause them to be skipped if the pattern is 3204 never actually used. The start-up optimizations are in effect a pre- 3205 scan of the subject that takes place before the pattern is run. 3206 3207 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, 3208 possibly causing performance to suffer, but ensuring that in cases 3209 where the result is "no match", the callouts do occur, and that items 3210 such as (*COMMIT) and (*MARK) are considered at every possible starting 3211 position in the subject string. If PCRE_NO_START_OPTIMIZE is set at 3212 compile time, it cannot be unset at matching time. The use of 3213 PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching 3214 is always done using interpretively. 3215 3216 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching 3217 operation. Consider the pattern 3218 3219 (*COMMIT)ABC 3220 3221 When this is compiled, PCRE records the fact that a match must start 3222 with the character "A". Suppose the subject string is "DEFABC". The 3223 start-up optimization scans along the subject, finds "A" and runs the 3224 first match attempt from there. The (*COMMIT) item means that the pat- 3225 tern must match the current starting position, which in this case, it 3226 does. However, if the same match is run with PCRE_NO_START_OPTIMIZE 3227 set, the initial scan along the subject string does not happen. The 3228 first match attempt is run starting from "D" and when this fails, 3229 (*COMMIT) prevents any further matches being tried, so the overall 3230 result is "no match". If the pattern is studied, more start-up opti- 3231 mizations may be used. For example, a minimum length for the subject 3232 may be recorded. Consider the pattern 3233 3234 (*MARK:A)(X|Y) 3235 3236 The minimum length for a match is one character. If the subject is 3237 "ABC", there will be attempts to match "ABC", "BC", "C", and then 3238 finally an empty string. If the pattern is studied, the final attempt 3239 does not take place, because PCRE knows that the subject is too short, 3240 and so the (*MARK) is never encountered. In this case, studying the 3241 pattern does not affect the overall match result, which is still "no 3242 match", but it does affect the auxiliary information that is returned. 3243 3244 PCRE_NO_UTF8_CHECK 3245 3246 When PCRE_UTF8 is set at compile time, the validity of the subject as a 3247 UTF-8 string is automatically checked when pcre_exec() is subsequently 3248 called. The entire string is checked before any other processing takes 3249 place. The value of startoffset is also checked to ensure that it 3250 points to the start of a UTF-8 character. There is a discussion about 3251 the validity of UTF-8 strings in the pcreunicode page. If an invalid 3252 sequence of bytes is found, pcre_exec() returns the error 3253 PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a 3254 truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In 3255 both cases, information about the precise nature of the error may also 3256 be returned (see the descriptions of these errors in the section enti- 3257 tled Error return values from pcre_exec() below). If startoffset con- 3258 tains a value that does not point to the start of a UTF-8 character (or 3259 to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned. 3260 3261 If you already know that your subject is valid, and you want to skip 3262 these checks for performance reasons, you can set the 3263 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to 3264 do this for the second and subsequent calls to pcre_exec() if you are 3265 making repeated calls to find all the matches in a single subject 3266 string. However, you should be sure that the value of startoffset 3267 points to the start of a character (or the end of the subject). When 3268 PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a 3269 subject or an invalid value of startoffset is undefined. Your program 3270 may crash. 3271 3272 PCRE_PARTIAL_HARD 3273 PCRE_PARTIAL_SOFT 3274 3275 These options turn on the partial matching feature. For backwards com- 3276 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial 3277 match occurs if the end of the subject string is reached successfully, 3278 but there are not enough subject characters to complete the match. If 3279 this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, 3280 matching continues by testing any remaining alternatives. Only if no 3281 complete match can be found is PCRE_ERROR_PARTIAL returned instead of 3282 PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the 3283 caller is prepared to handle a partial match, but only if no complete 3284 match can be found. 3285 3286 If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this 3287 case, if a partial match is found, pcre_exec() immediately returns 3288 PCRE_ERROR_PARTIAL, without considering any other alternatives. In 3289 other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- 3290 ered to be more important that an alternative complete match. 3291 3292 In both cases, the portion of the string that was inspected when the 3293 partial match was found is set as the first matching string. There is a 3294 more detailed discussion of partial and multi-segment matching, with 3295 examples, in the pcrepartial documentation. 3296 3297 The string to be matched by pcre_exec() 3298 3299 The subject string is passed to pcre_exec() as a pointer in subject, a 3300 length in bytes in length, and a starting byte offset in startoffset. 3301 If this is negative or greater than the length of the subject, 3302 pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is 3303 zero, the search for a match starts at the beginning of the subject, 3304 and this is by far the most common case. In UTF-8 mode, the byte offset 3305 must point to the start of a UTF-8 character (or the end of the sub- 3306 ject). Unlike the pattern string, the subject may contain binary zero 3307 bytes. 3308 3309 A non-zero starting offset is useful when searching for another match 3310 in the same subject by calling pcre_exec() again after a previous suc- 3311 cess. Setting startoffset differs from just passing over a shortened 3312 string and setting PCRE_NOTBOL in the case of a pattern that begins 3313 with any kind of lookbehind. For example, consider the pattern 3314 3315 \Biss\B 3316 3317 which finds occurrences of "iss" in the middle of words. (\B matches 3318 only if the current position in the subject is not a word boundary.) 3319 When applied to the string "Mississipi" the first call to pcre_exec() 3320 finds the first occurrence. If pcre_exec() is called again with just 3321 the remainder of the subject, namely "issipi", it does not match, 3322 because \B is always false at the start of the subject, which is deemed 3323 to be a word boundary. However, if pcre_exec() is passed the entire 3324 string again, but with startoffset set to 4, it finds the second occur- 3325 rence of "iss" because it is able to look behind the starting point to 3326 discover that it is preceded by a letter. 3327 3328 Finding all the matches in a subject is tricky when the pattern can 3329 match an empty string. It is possible to emulate Perl's /g behaviour by 3330 first trying the match again at the same offset, with the 3331 PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that 3332 fails, advancing the starting offset and trying an ordinary match 3333 again. There is some code that demonstrates how to do this in the pcre- 3334 demo sample program. In the most general case, you have to check to see 3335 if the newline convention recognizes CRLF as a newline, and if so, and 3336 the current character is CR followed by LF, advance the starting offset 3337 by two characters instead of one. 3338 3339 If a non-zero starting offset is passed when the pattern is anchored, 3340 one attempt to match at the given offset is made. This can only succeed 3341 if the pattern does not require the match to be at the start of the 3342 subject. 3343 3344 How pcre_exec() returns captured substrings 3345 3346 In general, a pattern matches a certain portion of the subject, and in 3347 addition, further substrings from the subject may be picked out by 3348 parts of the pattern. Following the usage in Jeffrey Friedl's book, 3349 this is called "capturing" in what follows, and the phrase "capturing 3350 subpattern" is used for a fragment of a pattern that picks out a sub- 3351 string. PCRE supports several other kinds of parenthesized subpattern 3352 that do not cause substrings to be captured. 3353 3354 Captured substrings are returned to the caller via a vector of integers 3355 whose address is passed in ovector. The number of elements in the vec- 3356 tor is passed in ovecsize, which must be a non-negative number. Note: 3357 this argument is NOT the size of ovector in bytes. 3358 3359 The first two-thirds of the vector is used to pass back captured sub- 3360 strings, each substring using a pair of integers. The remaining third 3361 of the vector is used as workspace by pcre_exec() while matching cap- 3362 turing subpatterns, and is not available for passing back information. 3363 The number passed in ovecsize should always be a multiple of three. If 3364 it is not, it is rounded down. 3365 3366 When a match is successful, information about captured substrings is 3367 returned in pairs of integers, starting at the beginning of ovector, 3368 and continuing up to two-thirds of its length at the most. The first 3369 element of each pair is set to the byte offset of the first character 3370 in a substring, and the second is set to the byte offset of the first 3371 character after the end of a substring. Note: these values are always 3372 byte offsets, even in UTF-8 mode. They are not character counts. 3373 3374 The first pair of integers, ovector[0] and ovector[1], identify the 3375 portion of the subject string matched by the entire pattern. The next 3376 pair is used for the first capturing subpattern, and so on. The value 3377 returned by pcre_exec() is one more than the highest numbered pair that 3378 has been set. For example, if two substrings have been captured, the 3379 returned value is 3. If there are no capturing subpatterns, the return 3380 value from a successful match is 1, indicating that just the first pair 3381 of offsets has been set. 3382 3383 If a capturing subpattern is matched repeatedly, it is the last portion 3384 of the string that it matched that is returned. 3385 3386 If the vector is too small to hold all the captured substring offsets, 3387 it is used as far as possible (up to two-thirds of its length), and the 3388 function returns a value of zero. If neither the actual string matched 3389 nor any captured substrings are of interest, pcre_exec() may be called 3390 with ovector passed as NULL and ovecsize as zero. However, if the pat- 3391 tern contains back references and the ovector is not big enough to 3392 remember the related substrings, PCRE has to get additional memory for 3393 use during matching. Thus it is usually advisable to supply an ovector 3394 of reasonable size. 3395 3396 There are some cases where zero is returned (indicating vector over- 3397 flow) when in fact the vector is exactly the right size for the final 3398 match. For example, consider the pattern 3399 3400 (a)(?:(b)c|bd) 3401 3402 If a vector of 6 elements (allowing for only 1 captured substring) is 3403 given with subject string "abd", pcre_exec() will try to set the second 3404 captured string, thereby recording a vector overflow, before failing to 3405 match "c" and backing up to try the second alternative. The zero 3406 return, however, does correctly indicate that the maximum number of 3407 slots (namely 2) have been filled. In similar cases where there is tem- 3408 porary overflow, but the final number of used slots is actually less 3409 than the maximum, a non-zero value is returned. 3410 3411 The pcre_fullinfo() function can be used to find out how many capturing 3412 subpatterns there are in a compiled pattern. The smallest size for 3413 ovector that will allow for n captured substrings, in addition to the 3414 offsets of the substring matched by the whole pattern, is (n+1)*3. 3415 3416 It is possible for capturing subpattern number n+1 to match some part 3417 of the subject when subpattern n has not been used at all. For example, 3418 if the string "abc" is matched against the pattern (a|(z))(bc) the 3419 return from the function is 4, and subpatterns 1 and 3 are matched, but 3420 2 is not. When this happens, both values in the offset pairs corre- 3421 sponding to unused subpatterns are set to -1. 3422 3423 Offset values that correspond to unused subpatterns at the end of the 3424 expression are also set to -1. For example, if the string "abc" is 3425 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not 3426 matched. The return from the function is 2, because the highest used 3427 capturing subpattern number is 1, and the offsets for for the second 3428 and third capturing subpatterns (assuming the vector is large enough, 3429 of course) are set to -1. 3430 3431 Note: Elements in the first two-thirds of ovector that do not corre- 3432 spond to capturing parentheses in the pattern are never changed. That 3433 is, if a pattern contains n capturing parentheses, no more than ovec- 3434 tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in 3435 the first two-thirds) retain whatever values they previously had. 3436 3437 Some convenience functions are provided for extracting the captured 3438 substrings as separate strings. These are described below. 3439 3440 Error return values from pcre_exec() 3441 3442 If pcre_exec() fails, it returns a negative number. The following are 3443 defined in the header file: 3444 3445 PCRE_ERROR_NOMATCH (-1) 3446 3447 The subject string did not match the pattern. 3448 3449 PCRE_ERROR_NULL (-2) 3450 3451 Either code or subject was passed as NULL, or ovector was NULL and 3452 ovecsize was not zero. 3453 3454 PCRE_ERROR_BADOPTION (-3) 3455 3456 An unrecognized bit was set in the options argument. 3457 3458 PCRE_ERROR_BADMAGIC (-4) 3459 3460 PCRE stores a 4-byte "magic number" at the start of the compiled code, 3461 to catch the case when it is passed a junk pointer and to detect when a 3462 pattern that was compiled in an environment of one endianness is run in 3463 an environment with the other endianness. This is the error that PCRE 3464 gives when the magic number is not present. 3465 3466 PCRE_ERROR_UNKNOWN_OPCODE (-5) 3467 3468 While running the pattern match, an unknown item was encountered in the 3469 compiled pattern. This error could be caused by a bug in PCRE or by 3470 overwriting of the compiled pattern. 3471 3472 PCRE_ERROR_NOMEMORY (-6) 3473 3474 If a pattern contains back references, but the ovector that is passed 3475 to pcre_exec() is not big enough to remember the referenced substrings, 3476 PCRE gets a block of memory at the start of matching to use for this 3477 purpose. If the call via pcre_malloc() fails, this error is given. The 3478 memory is automatically freed at the end of matching. 3479 3480 This error is also given if pcre_stack_malloc() fails in pcre_exec(). 3481 This can happen only when PCRE has been compiled with --disable-stack- 3482 for-recursion. 3483 3484 PCRE_ERROR_NOSUBSTRING (-7) 3485 3486 This error is used by the pcre_copy_substring(), pcre_get_substring(), 3487 and pcre_get_substring_list() functions (see below). It is never 3488 returned by pcre_exec(). 3489 3490 PCRE_ERROR_MATCHLIMIT (-8) 3491 3492 The backtracking limit, as specified by the match_limit field in a 3493 pcre_extra structure (or defaulted) was reached. See the description 3494 above. 3495 3496 PCRE_ERROR_CALLOUT (-9) 3497 3498 This error is never generated by pcre_exec() itself. It is provided for 3499 use by callout functions that want to yield a distinctive error code. 3500 See the pcrecallout documentation for details. 3501 3502 PCRE_ERROR_BADUTF8 (-10) 3503 3504 A string that contains an invalid UTF-8 byte sequence was passed as a 3505 subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of 3506 the output vector (ovecsize) is at least 2, the byte offset to the 3507 start of the the invalid UTF-8 character is placed in the first ele- 3508 ment, and a reason code is placed in the second element. The reason 3509 codes are listed in the following section. For backward compatibility, 3510 if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char- 3511 acter at the end of the subject (reason codes 1 to 5), 3512 PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8. 3513 3514 PCRE_ERROR_BADUTF8_OFFSET (-11) 3515 3516 The UTF-8 byte sequence that was passed as a subject was checked and 3517 found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the 3518 value of startoffset did not point to the beginning of a UTF-8 charac- 3519 ter or the end of the subject. 3520 3521 PCRE_ERROR_PARTIAL (-12) 3522 3523 The subject string did not match, but it did match partially. See the 3524 pcrepartial documentation for details of partial matching. 3525 3526 PCRE_ERROR_BADPARTIAL (-13) 3527 3528 This code is no longer in use. It was formerly returned when the 3529 PCRE_PARTIAL option was used with a compiled pattern containing items 3530 that were not supported for partial matching. From release 8.00 3531 onwards, there are no restrictions on partial matching. 3532 3533 PCRE_ERROR_INTERNAL (-14) 3534 3535 An unexpected internal error has occurred. This error could be caused 3536 by a bug in PCRE or by overwriting of the compiled pattern. 3537 3538 PCRE_ERROR_BADCOUNT (-15) 3539 3540 This error is given if the value of the ovecsize argument is negative. 3541 3542 PCRE_ERROR_RECURSIONLIMIT (-21) 3543 3544 The internal recursion limit, as specified by the match_limit_recursion 3545 field in a pcre_extra structure (or defaulted) was reached. See the 3546 description above. 3547 3548 PCRE_ERROR_BADNEWLINE (-23) 3549 3550 An invalid combination of PCRE_NEWLINE_xxx options was given. 3551 3552 PCRE_ERROR_BADOFFSET (-24) 3553 3554 The value of startoffset was negative or greater than the length of the 3555 subject, that is, the value in length. 3556 3557 PCRE_ERROR_SHORTUTF8 (-25) 3558 3559 This error is returned instead of PCRE_ERROR_BADUTF8 when the subject 3560 string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD 3561 option is set. Information about the failure is returned as for 3562 PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but 3563 this special error code for PCRE_PARTIAL_HARD precedes the implementa- 3564 tion of returned information; it is retained for backwards compatibil- 3565 ity. 3566 3567 PCRE_ERROR_RECURSELOOP (-26) 3568 3569 This error is returned when pcre_exec() detects a recursion loop within 3570 the pattern. Specifically, it means that either the whole pattern or a 3571 subpattern has been called recursively for the second time at the same 3572 position in the subject string. Some simple patterns that might do this 3573 are detected and faulted at compile time, but more complicated cases, 3574 in particular mutual recursions between two different subpatterns, can- 3575 not be detected until run time. 3576 3577 PCRE_ERROR_JIT_STACKLIMIT (-27) 3578 3579 This error is returned when a pattern that was successfully studied 3580 using a JIT compile option is being matched, but the memory available 3581 for the just-in-time processing stack is not large enough. See the 3582 pcrejit documentation for more details. 3583 3584 PCRE_ERROR_BADMODE (-28) 3585 3586 This error is given if a pattern that was compiled by the 8-bit library 3587 is passed to a 16-bit or 32-bit library function, or vice versa. 3588 3589 PCRE_ERROR_BADENDIANNESS (-29) 3590 3591 This error is given if a pattern that was compiled and saved is 3592 reloaded on a host with different endianness. The utility function 3593 pcre_pattern_to_host_byte_order() can be used to convert such a pattern 3594 so that it runs on the new host. 3595 3596 PCRE_ERROR_JIT_BADOPTION 3597 3598 This error is returned when a pattern that was successfully studied 3599 using a JIT compile option is being matched, but the matching mode 3600 (partial or complete match) does not correspond to any JIT compilation 3601 mode. When the JIT fast path function is used, this error may be also 3602 given for invalid options. See the pcrejit documentation for more 3603 details. 3604 3605 PCRE_ERROR_BADLENGTH (-32) 3606 3607 This error is given if pcre_exec() is called with a negative value for 3608 the length argument. 3609 3610 Error numbers -16 to -20, -22, and 30 are not used by pcre_exec(). 3611 3612 Reason codes for invalid UTF-8 strings 3613 3614 This section applies only to the 8-bit library. The corresponding 3615 information for the 16-bit and 32-bit libraries is given in the pcre16 3616 and pcre32 pages. 3617 3618 When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT- 3619 UTF8, and the size of the output vector (ovecsize) is at least 2, the 3620 offset of the start of the invalid UTF-8 character is placed in the 3621 first output vector element (ovector[0]) and a reason code is placed in 3622 the second element (ovector[1]). The reason codes are given names in 3623 the pcre.h header file: 3624 3625 PCRE_UTF8_ERR1 3626 PCRE_UTF8_ERR2 3627 PCRE_UTF8_ERR3 3628 PCRE_UTF8_ERR4 3629 PCRE_UTF8_ERR5 3630 3631 The string ends with a truncated UTF-8 character; the code specifies 3632 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 3633 characters to be no longer than 4 bytes, the encoding scheme (origi- 3634 nally defined by RFC 2279) allows for up to 6 bytes, and this is 3635 checked first; hence the possibility of 4 or 5 missing bytes. 3636 3637 PCRE_UTF8_ERR6 3638 PCRE_UTF8_ERR7 3639 PCRE_UTF8_ERR8 3640 PCRE_UTF8_ERR9 3641 PCRE_UTF8_ERR10 3642 3643 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of 3644 the character do not have the binary value 0b10 (that is, either the 3645 most significant bit is 0, or the next bit is 1). 3646 3647 PCRE_UTF8_ERR11 3648 PCRE_UTF8_ERR12 3649 3650 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes 3651 long; these code points are excluded by RFC 3629. 3652 3653 PCRE_UTF8_ERR13 3654 3655 A 4-byte character has a value greater than 0x10fff; these code points 3656 are excluded by RFC 3629. 3657 3658 PCRE_UTF8_ERR14 3659 3660 A 3-byte character has a value in the range 0xd800 to 0xdfff; this 3661 range of code points are reserved by RFC 3629 for use with UTF-16, and 3662 so are excluded from UTF-8. 3663 3664 PCRE_UTF8_ERR15 3665 PCRE_UTF8_ERR16 3666 PCRE_UTF8_ERR17 3667 PCRE_UTF8_ERR18 3668 PCRE_UTF8_ERR19 3669 3670 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes 3671 for a value that can be represented by fewer bytes, which is invalid. 3672 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- 3673 rect coding uses just one byte. 3674 3675 PCRE_UTF8_ERR20 3676 3677 The two most significant bits of the first byte of a character have the 3678 binary value 0b10 (that is, the most significant bit is 1 and the sec- 3679 ond is 0). Such a byte can only validly occur as the second or subse- 3680 quent byte of a multi-byte character. 3681 3682 PCRE_UTF8_ERR21 3683 3684 The first byte of a character has the value 0xfe or 0xff. These values 3685 can never occur in a valid UTF-8 string. 3686 3687 PCRE_UTF8_ERR2 3688 3689 Non-character. These are the last two characters in each plane (0xfffe, 3690 0xffff, 0x1fffe, 0x1ffff .. 0x10fffe, 0x10ffff), and the characters 3691 0xfdd0..0xfdef. 3692 3693 3694EXTRACTING CAPTURED SUBSTRINGS BY NUMBER 3695 3696 int pcre_copy_substring(const char *subject, int *ovector, 3697 int stringcount, int stringnumber, char *buffer, 3698 int buffersize); 3699 3700 int pcre_get_substring(const char *subject, int *ovector, 3701 int stringcount, int stringnumber, 3702 const char **stringptr); 3703 3704 int pcre_get_substring_list(const char *subject, 3705 int *ovector, int stringcount, const char ***listptr); 3706 3707 Captured substrings can be accessed directly by using the offsets 3708 returned by pcre_exec() in ovector. For convenience, the functions 3709 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- 3710 string_list() are provided for extracting captured substrings as new, 3711 separate, zero-terminated strings. These functions identify substrings 3712 by number. The next section describes functions for extracting named 3713 substrings. 3714 3715 A substring that contains a binary zero is correctly extracted and has 3716 a further zero added on the end, but the result is not, of course, a C 3717 string. However, you can process such a string by referring to the 3718 length that is returned by pcre_copy_substring() and pcre_get_sub- 3719 string(). Unfortunately, the interface to pcre_get_substring_list() is 3720 not adequate for handling strings containing binary zeros, because the 3721 end of the final string is not independently indicated. 3722 3723 The first three arguments are the same for all three of these func- 3724 tions: subject is the subject string that has just been successfully 3725 matched, ovector is a pointer to the vector of integer offsets that was 3726 passed to pcre_exec(), and stringcount is the number of substrings that 3727 were captured by the match, including the substring that matched the 3728 entire regular expression. This is the value returned by pcre_exec() if 3729 it is greater than zero. If pcre_exec() returned zero, indicating that 3730 it ran out of space in ovector, the value passed as stringcount should 3731 be the number of elements in the vector divided by three. 3732 3733 The functions pcre_copy_substring() and pcre_get_substring() extract a 3734 single substring, whose number is given as stringnumber. A value of 3735 zero extracts the substring that matched the entire pattern, whereas 3736 higher values extract the captured substrings. For pcre_copy_sub- 3737 string(), the string is placed in buffer, whose length is given by 3738 buffersize, while for pcre_get_substring() a new block of memory is 3739 obtained via pcre_malloc, and its address is returned via stringptr. 3740 The yield of the function is the length of the string, not including 3741 the terminating zero, or one of these error codes: 3742 3743 PCRE_ERROR_NOMEMORY (-6) 3744 3745 The buffer was too small for pcre_copy_substring(), or the attempt to 3746 get memory failed for pcre_get_substring(). 3747 3748 PCRE_ERROR_NOSUBSTRING (-7) 3749 3750 There is no substring whose number is stringnumber. 3751 3752 The pcre_get_substring_list() function extracts all available sub- 3753 strings and builds a list of pointers to them. All this is done in a 3754 single block of memory that is obtained via pcre_malloc. The address of 3755 the memory block is returned via listptr, which is also the start of 3756 the list of string pointers. The end of the list is marked by a NULL 3757 pointer. The yield of the function is zero if all went well, or the 3758 error code 3759 3760 PCRE_ERROR_NOMEMORY (-6) 3761 3762 if the attempt to get the memory block failed. 3763 3764 When any of these functions encounter a substring that is unset, which 3765 can happen when capturing subpattern number n+1 matches some part of 3766 the subject, but subpattern n has not been used at all, they return an 3767 empty string. This can be distinguished from a genuine zero-length sub- 3768 string by inspecting the appropriate offset in ovector, which is nega- 3769 tive for unset substrings. 3770 3771 The two convenience functions pcre_free_substring() and pcre_free_sub- 3772 string_list() can be used to free the memory returned by a previous 3773 call of pcre_get_substring() or pcre_get_substring_list(), respec- 3774 tively. They do nothing more than call the function pointed to by 3775 pcre_free, which of course could be called directly from a C program. 3776 However, PCRE is used in some situations where it is linked via a spe- 3777 cial interface to another programming language that cannot use 3778 pcre_free directly; it is for these cases that the functions are pro- 3779 vided. 3780 3781 3782EXTRACTING CAPTURED SUBSTRINGS BY NAME 3783 3784 int pcre_get_stringnumber(const pcre *code, 3785 const char *name); 3786 3787 int pcre_copy_named_substring(const pcre *code, 3788 const char *subject, int *ovector, 3789 int stringcount, const char *stringname, 3790 char *buffer, int buffersize); 3791 3792 int pcre_get_named_substring(const pcre *code, 3793 const char *subject, int *ovector, 3794 int stringcount, const char *stringname, 3795 const char **stringptr); 3796 3797 To extract a substring by name, you first have to find associated num- 3798 ber. For example, for this pattern 3799 3800 (a+)b(?<xxx>\d+)... 3801 3802 the number of the subpattern called "xxx" is 2. If the name is known to 3803 be unique (PCRE_DUPNAMES was not set), you can find the number from the 3804 name by calling pcre_get_stringnumber(). The first argument is the com- 3805 piled pattern, and the second is the name. The yield of the function is 3806 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no 3807 subpattern of that name. 3808 3809 Given the number, you can extract the substring directly, or use one of 3810 the functions described in the previous section. For convenience, there 3811 are also two functions that do the whole job. 3812 3813 Most of the arguments of pcre_copy_named_substring() and 3814 pcre_get_named_substring() are the same as those for the similarly 3815 named functions that extract by number. As these are described in the 3816 previous section, they are not re-described here. There are just two 3817 differences: 3818 3819 First, instead of a substring number, a substring name is given. Sec- 3820 ond, there is an extra argument, given at the start, which is a pointer 3821 to the compiled pattern. This is needed in order to gain access to the 3822 name-to-number translation table. 3823 3824 These functions call pcre_get_stringnumber(), and if it succeeds, they 3825 then call pcre_copy_substring() or pcre_get_substring(), as appropri- 3826 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the 3827 behaviour may not be what you want (see the next section). 3828 3829 Warning: If the pattern uses the (?| feature to set up multiple subpat- 3830 terns with the same number, as described in the section on duplicate 3831 subpattern numbers in the pcrepattern page, you cannot use names to 3832 distinguish the different subpatterns, because names are not included 3833 in the compiled code. The matching process uses only numbers. For this 3834 reason, the use of different names for subpatterns of the same number 3835 causes an error at compile time. 3836 3837 3838DUPLICATE SUBPATTERN NAMES 3839 3840 int pcre_get_stringtable_entries(const pcre *code, 3841 const char *name, char **first, char **last); 3842 3843 When a pattern is compiled with the PCRE_DUPNAMES option, names for 3844 subpatterns are not required to be unique. (Duplicate names are always 3845 allowed for subpatterns with the same number, created by using the (?| 3846 feature. Indeed, if such subpatterns are named, they are required to 3847 use the same names.) 3848 3849 Normally, patterns with duplicate names are such that in any one match, 3850 only one of the named subpatterns participates. An example is shown in 3851 the pcrepattern documentation. 3852 3853 When duplicates are present, pcre_copy_named_substring() and 3854 pcre_get_named_substring() return the first substring corresponding to 3855 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING 3856 (-7) is returned; no data is returned. The pcre_get_stringnumber() 3857 function returns one of the numbers that are associated with the name, 3858 but it is not defined which it is. 3859 3860 If you want to get full details of all captured substrings for a given 3861 name, you must use the pcre_get_stringtable_entries() function. The 3862 first argument is the compiled pattern, and the second is the name. The 3863 third and fourth are pointers to variables which are updated by the 3864 function. After it has run, they point to the first and last entries in 3865 the name-to-number table for the given name. The function itself 3866 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if 3867 there are none. The format of the table is described above in the sec- 3868 tion entitled Information about a pattern above. Given all the rele- 3869 vant entries for the name, you can extract each of their numbers, and 3870 hence the captured data, if any. 3871 3872 3873FINDING ALL POSSIBLE MATCHES 3874 3875 The traditional matching function uses a similar algorithm to Perl, 3876 which stops when it finds the first match, starting at a given point in 3877 the subject. If you want to find all possible matches, or the longest 3878 possible match, consider using the alternative matching function (see 3879 below) instead. If you cannot use the alternative function, but still 3880 need to find all possible matches, you can kludge it up by making use 3881 of the callout facility, which is described in the pcrecallout documen- 3882 tation. 3883 3884 What you have to do is to insert a callout right at the end of the pat- 3885 tern. When your callout function is called, extract and save the cur- 3886 rent matched substring. Then return 1, which forces pcre_exec() to 3887 backtrack and try other alternatives. Ultimately, when it runs out of 3888 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. 3889 3890 3891OBTAINING AN ESTIMATE OF STACK USAGE 3892 3893 Matching certain patterns using pcre_exec() can use a lot of process 3894 stack, which in certain environments can be rather limited in size. 3895 Some users find it helpful to have an estimate of the amount of stack 3896 that is used by pcre_exec(), to help them set recursion limits, as 3897 described in the pcrestack documentation. The estimate that is output 3898 by pcretest when called with the -m and -C options is obtained by call- 3899 ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its 3900 first five arguments. 3901 3902 Normally, if its first argument is NULL, pcre_exec() immediately 3903 returns the negative error code PCRE_ERROR_NULL, but with this special 3904 combination of arguments, it returns instead a negative number whose 3905 absolute value is the approximate stack frame size in bytes. (A nega- 3906 tive number is used so that it is clear that no match has happened.) 3907 The value is approximate because in some cases, recursive calls to 3908 pcre_exec() occur when there are one or two additional variables on the 3909 stack. 3910 3911 If PCRE has been compiled to use the heap instead of the stack for 3912 recursion, the value returned is the size of each block that is 3913 obtained from the heap. 3914 3915 3916MATCHING A PATTERN: THE ALTERNATIVE FUNCTION 3917 3918 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, 3919 const char *subject, int length, int startoffset, 3920 int options, int *ovector, int ovecsize, 3921 int *workspace, int wscount); 3922 3923 The function pcre_dfa_exec() is called to match a subject string 3924 against a compiled pattern, using a matching algorithm that scans the 3925 subject string just once, and does not backtrack. This has different 3926 characteristics to the normal algorithm, and is not compatible with 3927 Perl. Some of the features of PCRE patterns are not supported. Never- 3928 theless, there are times when this kind of matching can be useful. For 3929 a discussion of the two matching algorithms, and a list of features 3930 that pcre_dfa_exec() does not support, see the pcrematching documenta- 3931 tion. 3932 3933 The arguments for the pcre_dfa_exec() function are the same as for 3934 pcre_exec(), plus two extras. The ovector argument is used in a differ- 3935 ent way, and this is described below. The other common arguments are 3936 used in the same way as for pcre_exec(), so their description is not 3937 repeated here. 3938 3939 The two additional arguments provide workspace for the function. The 3940 workspace vector should contain at least 20 elements. It is used for 3941 keeping track of multiple paths through the pattern tree. More 3942 workspace will be needed for patterns and subjects where there are a 3943 lot of potential matches. 3944 3945 Here is an example of a simple call to pcre_dfa_exec(): 3946 3947 int rc; 3948 int ovector[10]; 3949 int wspace[20]; 3950 rc = pcre_dfa_exec( 3951 re, /* result of pcre_compile() */ 3952 NULL, /* we didn't study the pattern */ 3953 "some string", /* the subject string */ 3954 11, /* the length of the subject string */ 3955 0, /* start at offset 0 in the subject */ 3956 0, /* default options */ 3957 ovector, /* vector of integers for substring information */ 3958 10, /* number of elements (NOT size in bytes) */ 3959 wspace, /* working space vector */ 3960 20); /* number of elements (NOT size in bytes) */ 3961 3962 Option bits for pcre_dfa_exec() 3963 3964 The unused bits of the options argument for pcre_dfa_exec() must be 3965 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- 3966 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, 3967 PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, 3968 PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR- 3969 TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last 3970 four of these are exactly the same as for pcre_exec(), so their 3971 description is not repeated here. 3972 3973 PCRE_PARTIAL_HARD 3974 PCRE_PARTIAL_SOFT 3975 3976 These have the same general effect as they do for pcre_exec(), but the 3977 details are slightly different. When PCRE_PARTIAL_HARD is set for 3978 pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub- 3979 ject is reached and there is still at least one matching possibility 3980 that requires additional characters. This happens even if some complete 3981 matches have also been found. When PCRE_PARTIAL_SOFT is set, the return 3982 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end 3983 of the subject is reached, there have been no complete matches, but 3984 there is still at least one matching possibility. The portion of the 3985 string that was inspected when the longest partial match was found is 3986 set as the first matching string in both cases. There is a more 3987 detailed discussion of partial and multi-segment matching, with exam- 3988 ples, in the pcrepartial documentation. 3989 3990 PCRE_DFA_SHORTEST 3991 3992 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to 3993 stop as soon as it has found one match. Because of the way the alterna- 3994 tive algorithm works, this is necessarily the shortest possible match 3995 at the first possible matching point in the subject string. 3996 3997 PCRE_DFA_RESTART 3998 3999 When pcre_dfa_exec() returns a partial match, it is possible to call it 4000 again, with additional subject characters, and have it continue with 4001 the same match. The PCRE_DFA_RESTART option requests this action; when 4002 it is set, the workspace and wscount options must reference the same 4003 vector as before because data about the match so far is left in them 4004 after a partial match. There is more discussion of this facility in the 4005 pcrepartial documentation. 4006 4007 Successful returns from pcre_dfa_exec() 4008 4009 When pcre_dfa_exec() succeeds, it may have matched more than one sub- 4010 string in the subject. Note, however, that all the matches from one run 4011 of the function start at the same point in the subject. The shorter 4012 matches are all initial substrings of the longer matches. For example, 4013 if the pattern 4014 4015 <.*> 4016 4017 is matched against the string 4018 4019 This is <something> <something else> <something further> no more 4020 4021 the three matched strings are 4022 4023 <something> 4024 <something> <something else> 4025 <something> <something else> <something further> 4026 4027 On success, the yield of the function is a number greater than zero, 4028 which is the number of matched substrings. The substrings themselves 4029 are returned in ovector. Each string uses two elements; the first is 4030 the offset to the start, and the second is the offset to the end. In 4031 fact, all the strings have the same start offset. (Space could have 4032 been saved by giving this only once, but it was decided to retain some 4033 compatibility with the way pcre_exec() returns data, even though the 4034 meaning of the strings is different.) 4035 4036 The strings are returned in reverse order of length; that is, the long- 4037 est matching string is given first. If there were too many matches to 4038 fit into ovector, the yield of the function is zero, and the vector is 4039 filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec() 4040 can use the entire ovector for returning matched strings. 4041 4042 Error returns from pcre_dfa_exec() 4043 4044 The pcre_dfa_exec() function returns a negative number when it fails. 4045 Many of the errors are the same as for pcre_exec(), and these are 4046 described above. There are in addition the following errors that are 4047 specific to pcre_dfa_exec(): 4048 4049 PCRE_ERROR_DFA_UITEM (-16) 4050 4051 This return is given if pcre_dfa_exec() encounters an item in the pat- 4052 tern that it does not support, for instance, the use of \C or a back 4053 reference. 4054 4055 PCRE_ERROR_DFA_UCOND (-17) 4056 4057 This return is given if pcre_dfa_exec() encounters a condition item 4058 that uses a back reference for the condition, or a test for recursion 4059 in a specific group. These are not supported. 4060 4061 PCRE_ERROR_DFA_UMLIMIT (-18) 4062 4063 This return is given if pcre_dfa_exec() is called with an extra block 4064 that contains a setting of the match_limit or match_limit_recursion 4065 fields. This is not supported (these fields are meaningless for DFA 4066 matching). 4067 4068 PCRE_ERROR_DFA_WSSIZE (-19) 4069 4070 This return is given if pcre_dfa_exec() runs out of space in the 4071 workspace vector. 4072 4073 PCRE_ERROR_DFA_RECURSE (-20) 4074 4075 When a recursive subpattern is processed, the matching function calls 4076 itself recursively, using private vectors for ovector and workspace. 4077 This error is given if the output vector is not large enough. This 4078 should be extremely rare, as a vector of size 1000 is used. 4079 4080 PCRE_ERROR_DFA_BADRESTART (-30) 4081 4082 When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some 4083 plausibility checks are made on the contents of the workspace, which 4084 should contain data about the previous partial match. If any of these 4085 checks fail, this error is given. 4086 4087 4088SEE ALSO 4089 4090 pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), 4091 pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre- 4092 sample(3), pcrestack(3). 4093 4094 4095AUTHOR 4096 4097 Philip Hazel 4098 University Computing Service 4099 Cambridge CB2 3QH, England. 4100 4101 4102REVISION 4103 4104 Last updated: 08 November 2012 4105 Copyright (c) 1997-2012 University of Cambridge. 4106------------------------------------------------------------------------------ 4107 4108 4109PCRECALLOUT(3) PCRECALLOUT(3) 4110 4111 4112NAME 4113 PCRE - Perl-compatible regular expressions 4114 4115 4116SYNOPSIS 4117 4118 #include <pcre.h> 4119 4120 int (*pcre_callout)(pcre_callout_block *); 4121 4122 int (*pcre16_callout)(pcre16_callout_block *); 4123 4124 int (*pcre32_callout)(pcre32_callout_block *); 4125 4126 4127DESCRIPTION 4128 4129 PCRE provides a feature called "callout", which is a means of temporar- 4130 ily passing control to the caller of PCRE in the middle of pattern 4131 matching. The caller of PCRE provides an external function by putting 4132 its entry point in the global variable pcre_callout (pcre16_callout for 4133 the 16-bit library, pcre32_callout for the 32-bit library). By default, 4134 this variable contains NULL, which disables all calling out. 4135 4136 Within a regular expression, (?C) indicates the points at which the 4137 external function is to be called. Different callout points can be 4138 identified by putting a number less than 256 after the letter C. The 4139 default value is zero. For example, this pattern has two callout 4140 points: 4141 4142 (?C1)abc(?C2)def 4143 4144 If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled, 4145 PCRE automatically inserts callouts, all with number 255, before each 4146 item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the 4147 pattern 4148 4149 A(\d{2}|--) 4150 4151 it is processed as if it were 4152 4153 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) 4154 4155 Notice that there is a callout before and after each parenthesis and 4156 alternation bar. Automatic callouts can be used for tracking the 4157 progress of pattern matching. The pcretest command has an option that 4158 sets automatic callouts; when it is used, the output indicates how the 4159 pattern is matched. This is useful information when you are trying to 4160 optimize the performance of a particular pattern. 4161 4162 The use of callouts in a pattern makes it ineligible for optimization 4163 by the just-in-time compiler. Studying such a pattern with the 4164 PCRE_STUDY_JIT_COMPILE option always fails. 4165 4166 4167MISSING CALLOUTS 4168 4169 You should be aware that, because of optimizations in the way PCRE 4170 matches patterns by default, callouts sometimes do not happen. For 4171 example, if the pattern is 4172 4173 ab(?C4)cd 4174 4175 PCRE knows that any matching string must contain the letter "d". If the 4176 subject string is "abyz", the lack of "d" means that matching doesn't 4177 ever start, and the callout is never reached. However, with "abyd", 4178 though the result is still no match, the callout is obeyed. 4179 4180 If the pattern is studied, PCRE knows the minimum length of a matching 4181 string, and will immediately give a "no match" return without actually 4182 running a match if the subject is not long enough, or, for unanchored 4183 patterns, if it has been scanned far enough. 4184 4185 You can disable these optimizations by passing the PCRE_NO_START_OPTI- 4186 MIZE option to the matching function, or by starting the pattern with 4187 (*NO_START_OPT). This slows down the matching process, but does ensure 4188 that callouts such as the example above are obeyed. 4189 4190 4191THE CALLOUT INTERFACE 4192 4193 During matching, when PCRE reaches a callout point, the external func- 4194 tion defined by pcre_callout or pcre[16|32]_callout is called (if it is 4195 set). This applies to both normal and DFA matching. The only argument 4196 to the callout function is a pointer to a pcre_callout or 4197 pcre[16|32]_callout block. These structures contains the following 4198 fields: 4199 4200 int version; 4201 int callout_number; 4202 int *offset_vector; 4203 const char *subject; (8-bit version) 4204 PCRE_SPTR16 subject; (16-bit version) 4205 PCRE_SPTR32 subject; (32-bit version) 4206 int subject_length; 4207 int start_match; 4208 int current_position; 4209 int capture_top; 4210 int capture_last; 4211 void *callout_data; 4212 int pattern_position; 4213 int next_item_length; 4214 const unsigned char *mark; (8-bit version) 4215 const PCRE_UCHAR16 *mark; (16-bit version) 4216 const PCRE_UCHAR32 *mark; (32-bit version) 4217 4218 The version field is an integer containing the version number of the 4219 block format. The initial version was 0; the current version is 2. The 4220 version number will change again in future if additional fields are 4221 added, but the intention is never to remove any of the existing fields. 4222 4223 The callout_number field contains the number of the callout, as com- 4224 piled into the pattern (that is, the number after ?C for manual call- 4225 outs, and 255 for automatically generated callouts). 4226 4227 The offset_vector field is a pointer to the vector of offsets that was 4228 passed by the caller to the matching function. When pcre_exec() or 4229 pcre[16|32]_exec() is used, the contents can be inspected, in order to 4230 extract substrings that have been matched so far, in the same way as 4231 for extracting substrings after a match has completed. For the DFA 4232 matching functions, this field is not useful. 4233 4234 The subject and subject_length fields contain copies of the values that 4235 were passed to the matching function. 4236 4237 The start_match field normally contains the offset within the subject 4238 at which the current match attempt started. However, if the escape 4239 sequence \K has been encountered, this value is changed to reflect the 4240 modified starting point. If the pattern is not anchored, the callout 4241 function may be called several times from the same point in the pattern 4242 for different starting points in the subject. 4243 4244 The current_position field contains the offset within the subject of 4245 the current match pointer. 4246 4247 When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top 4248 field contains one more than the number of the highest numbered cap- 4249 tured substring so far. If no substrings have been captured, the value 4250 of capture_top is one. This is always the case when the DFA functions 4251 are used, because they do not support captured substrings. 4252 4253 The capture_last field contains the number of the most recently cap- 4254 tured substring. If no substrings have been captured, its value is -1. 4255 This is always the case for the DFA matching functions. 4256 4257 The callout_data field contains a value that is passed to a matching 4258 function specifically so that it can be passed back in callouts. It is 4259 passed in the callout_data field of a pcre_extra or pcre[16|32]_extra 4260 data structure. If no such data was passed, the value of callout_data 4261 in a callout block is NULL. There is a description of the pcre_extra 4262 structure in the pcreapi documentation. 4263 4264 The pattern_position field is present from version 1 of the callout 4265 structure. It contains the offset to the next item to be matched in the 4266 pattern string. 4267 4268 The next_item_length field is present from version 1 of the callout 4269 structure. It contains the length of the next item to be matched in the 4270 pattern string. When the callout immediately precedes an alternation 4271 bar, a closing parenthesis, or the end of the pattern, the length is 4272 zero. When the callout precedes an opening parenthesis, the length is 4273 that of the entire subpattern. 4274 4275 The pattern_position and next_item_length fields are intended to help 4276 in distinguishing between different automatic callouts, which all have 4277 the same callout number. However, they are set for all callouts. 4278 4279 The mark field is present from version 2 of the callout structure. In 4280 callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer 4281 to the zero-terminated name of the most recently passed (*MARK), 4282 (*PRUNE), or (*THEN) item in the match, or NULL if no such items have 4283 been passed. Instances of (*PRUNE) or (*THEN) without a name do not 4284 obliterate a previous (*MARK). In callouts from the DFA matching func- 4285 tions this field always contains NULL. 4286 4287 4288RETURN VALUES 4289 4290 The external callout function returns an integer to PCRE. If the value 4291 is zero, matching proceeds as normal. If the value is greater than 4292 zero, matching fails at the current point, but the testing of other 4293 matching possibilities goes ahead, just as if a lookahead assertion had 4294 failed. If the value is less than zero, the match is abandoned, the 4295 matching function returns the negative value. 4296 4297 Negative values should normally be chosen from the set of 4298 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan- 4299 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is 4300 reserved for use by callout functions; it will never be used by PCRE 4301 itself. 4302 4303 4304AUTHOR 4305 4306 Philip Hazel 4307 University Computing Service 4308 Cambridge CB2 3QH, England. 4309 4310 4311REVISION 4312 4313 Last updated: 24 June 2012 4314 Copyright (c) 1997-2012 University of Cambridge. 4315------------------------------------------------------------------------------ 4316 4317 4318PCRECOMPAT(3) PCRECOMPAT(3) 4319 4320 4321NAME 4322 PCRE - Perl-compatible regular expressions 4323 4324 4325DIFFERENCES BETWEEN PCRE AND PERL 4326 4327 This document describes the differences in the ways that PCRE and Perl 4328 handle regular expressions. The differences described here are with 4329 respect to Perl versions 5.10 and above. 4330 4331 1. PCRE has only a subset of Perl's Unicode support. Details of what it 4332 does have are given in the pcreunicode page. 4333 4334 2. PCRE allows repeat quantifiers only on parenthesized assertions, but 4335 they do not mean what you might think. For example, (?!a){3} does not 4336 assert that the next three characters are not "a". It just asserts that 4337 the next character is not "a" three times (in principle: PCRE optimizes 4338 this to run the assertion just once). Perl allows repeat quantifiers on 4339 other assertions such as \b, but these do not seem to have any use. 4340 4341 3. Capturing subpatterns that occur inside negative lookahead asser- 4342 tions are counted, but their entries in the offsets vector are never 4343 set. Perl sets its numerical variables from any such patterns that are 4344 matched before the assertion fails to match something (thereby succeed- 4345 ing), but only if the negative lookahead assertion contains just one 4346 branch. 4347 4348 4. Though binary zero characters are supported in the subject string, 4349 they are not allowed in a pattern string because it is passed as a nor- 4350 mal C string, terminated by zero. The escape sequence \0 can be used in 4351 the pattern to represent a binary zero. 4352 4353 5. The following Perl escape sequences are not supported: \l, \u, \L, 4354 \U, and \N when followed by a character name or Unicode value. (\N on 4355 its own, matching a non-newline character, is supported.) In fact these 4356 are implemented by Perl's general string-handling and are not part of 4357 its pattern matching engine. If any of these are encountered by PCRE, 4358 an error is generated by default. However, if the PCRE_JAVASCRIPT_COM- 4359 PAT option is set, \U and \u are interpreted as JavaScript interprets 4360 them. 4361 4362 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE 4363 is built with Unicode character property support. The properties that 4364 can be tested with \p and \P are limited to the general category prop- 4365 erties such as Lu and Nd, script names such as Greek or Han, and the 4366 derived properties Any and L&. PCRE does support the Cs (surrogate) 4367 property, which Perl does not; the Perl documentation says "Because 4368 Perl hides the need for the user to understand the internal representa- 4369 tion of Unicode characters, there is no need to implement the somewhat 4370 messy concept of surrogates." 4371 4372 7. PCRE does support the \Q...\E escape for quoting substrings. Charac- 4373 ters in between are treated as literals. This is slightly different 4374 from Perl in that $ and @ are also handled as literals inside the 4375 quotes. In Perl, they cause variable interpolation (but of course PCRE 4376 does not have variables). Note the following examples: 4377 4378 Pattern PCRE matches Perl matches 4379 4380 \Qabc$xyz\E abc$xyz abc followed by the 4381 contents of $xyz 4382 \Qabc\$xyz\E abc\$xyz abc\$xyz 4383 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 4384 4385 The \Q...\E sequence is recognized both inside and outside character 4386 classes. 4387 4388 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) 4389 constructions. However, there is support for recursive patterns. This 4390 is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE 4391 "callout" feature allows an external function to be called during pat- 4392 tern matching. See the pcrecallout documentation for details. 4393 4394 9. Subpatterns that are called as subroutines (whether or not recur- 4395 sively) are always treated as atomic groups in PCRE. This is like 4396 Python, but unlike Perl. Captured values that are set outside a sub- 4397 routine call can be reference from inside in PCRE, but not in Perl. 4398 There is a discussion that explains these differences in more detail in 4399 the section on recursion differences from Perl in the pcrepattern page. 4400 4401 10. If any of the backtracking control verbs are used in an assertion 4402 or in a subpattern that is called as a subroutine (whether or not 4403 recursively), their effect is confined to that subpattern; it does not 4404 extend to the surrounding pattern. This is not always the case in Perl. 4405 In particular, if (*THEN) is present in a group that is called as a 4406 subroutine, its action is limited to that group, even if the group does 4407 not contain any | characters. There is one exception to this: the name 4408 from a *(MARK), (*PRUNE), or (*THEN) that is encountered in a success- 4409 ful positive assertion is passed back when a match succeeds (compare 4410 capturing parentheses in assertions). Note that such subpatterns are 4411 processed as anchored at the point where they are tested. 4412 4413 11. There are some differences that are concerned with the settings of 4414 captured strings when part of a pattern is repeated. For example, 4415 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 4416 unset, but in PCRE it is set to "b". 4417 4418 12. PCRE's handling of duplicate subpattern numbers and duplicate sub- 4419 pattern names is not as general as Perl's. This is a consequence of the 4420 fact the PCRE works internally just with numbers, using an external ta- 4421 ble to translate between numbers and names. In particular, a pattern 4422 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have 4423 the same number but different names, is not supported, and causes an 4424 error at compile time. If it were allowed, it would not be possible to 4425 distinguish which parentheses matched, because both names map to cap- 4426 turing subpattern number 1. To avoid this confusing situation, an error 4427 is given at compile time. 4428 4429 13. Perl recognizes comments in some places that PCRE does not, for 4430 example, between the ( and ? at the start of a subpattern. If the /x 4431 modifier is set, Perl allows white space between ( and ? but PCRE never 4432 does, even if the PCRE_EXTENDED option is set. 4433 4434 14. PCRE provides some extensions to the Perl regular expression facil- 4435 ities. Perl 5.10 includes new features that are not in earlier ver- 4436 sions of Perl, some of which (such as named parentheses) have been in 4437 PCRE for some time. This list is with respect to Perl 5.10: 4438 4439 (a) Although lookbehind assertions in PCRE must match fixed length 4440 strings, each alternative branch of a lookbehind assertion can match a 4441 different length of string. Perl requires them all to have the same 4442 length. 4443 4444 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ 4445 meta-character matches only at the very end of the string. 4446 4447 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- 4448 cial meaning is faulted. Otherwise, like Perl, the backslash is quietly 4449 ignored. (Perl can be made to issue a warning.) 4450 4451 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- 4452 fiers is inverted, that is, by default they are not greedy, but if fol- 4453 lowed by a question mark they are. 4454 4455 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be 4456 tried only at the first matching position in the subject string. 4457 4458 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, 4459 and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva- 4460 lents. 4461 4462 (g) The \R escape sequence can be restricted to match only CR, LF, or 4463 CRLF by the PCRE_BSR_ANYCRLF option. 4464 4465 (h) The callout facility is PCRE-specific. 4466 4467 (i) The partial matching facility is PCRE-specific. 4468 4469 (j) Patterns compiled by PCRE can be saved and re-used at a later time, 4470 even on different hosts that have the other endianness. However, this 4471 does not apply to optimized data created by the just-in-time compiler. 4472 4473 (k) The alternative matching functions (pcre_dfa_exec(), 4474 pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way and 4475 are not Perl-compatible. 4476 4477 (l) PCRE recognizes some special sequences such as (*CR) at the start 4478 of a pattern that set overall options that cannot be changed within the 4479 pattern. 4480 4481 4482AUTHOR 4483 4484 Philip Hazel 4485 University Computing Service 4486 Cambridge CB2 3QH, England. 4487 4488 4489REVISION 4490 4491 Last updated: 25 August 2012 4492 Copyright (c) 1997-2012 University of Cambridge. 4493------------------------------------------------------------------------------ 4494 4495 4496PCREPATTERN(3) PCREPATTERN(3) 4497 4498 4499NAME 4500 PCRE - Perl-compatible regular expressions 4501 4502 4503PCRE REGULAR EXPRESSION DETAILS 4504 4505 The syntax and semantics of the regular expressions that are supported 4506 by PCRE are described in detail below. There is a quick-reference syn- 4507 tax summary in the pcresyntax page. PCRE tries to match Perl syntax and 4508 semantics as closely as it can. PCRE also supports some alternative 4509 regular expression syntax (which does not conflict with the Perl syn- 4510 tax) in order to provide some compatibility with regular expressions in 4511 Python, .NET, and Oniguruma. 4512 4513 Perl's regular expressions are described in its own documentation, and 4514 regular expressions in general are covered in a number of books, some 4515 of which have copious examples. Jeffrey Friedl's "Mastering Regular 4516 Expressions", published by O'Reilly, covers regular expressions in 4517 great detail. This description of PCRE's regular expressions is 4518 intended as reference material. 4519 4520 The original operation of PCRE was on strings of one-byte characters. 4521 However, there is now also support for UTF-8 strings in the original 4522 library, an extra library that supports 16-bit and UTF-16 character 4523 strings, and a third library that supports 32-bit and UTF-32 character 4524 strings. To use these features, PCRE must be built to include appropri- 4525 ate support. When using UTF strings you must either call the compiling 4526 function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the 4527 pattern must start with one of these special sequences: 4528 4529 (*UTF8) 4530 (*UTF16) 4531 (*UTF32) 4532 (*UTF) 4533 4534 (*UTF) is a generic sequence that can be used with any of the 4535 libraries. Starting a pattern with such a sequence is equivalent to 4536 setting the relevant option. This feature is not Perl-compatible. How 4537 setting a UTF mode affects pattern matching is mentioned in several 4538 places below. There is also a summary of features in the pcreunicode 4539 page. 4540 4541 Another special sequence that may appear at the start of a pattern or 4542 in combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is: 4543 4544 (*UCP) 4545 4546 This has the same effect as setting the PCRE_UCP option: it causes 4547 sequences such as \d and \w to use Unicode properties to determine 4548 character types, instead of recognizing only characters with codes less 4549 than 128 via a lookup table. 4550 4551 If a pattern starts with (*NO_START_OPT), it has the same effect as 4552 setting the PCRE_NO_START_OPTIMIZE option either at compile or matching 4553 time. There are also some more of these special sequences that are con- 4554 cerned with the handling of newlines; they are described below. 4555 4556 The remainder of this document discusses the patterns that are sup- 4557 ported by PCRE when one its main matching functions, pcre_exec() 4558 (8-bit) or pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has 4559 alternative matching functions, pcre_dfa_exec() and 4560 pcre[16|32_dfa_exec(), which match using a different algorithm that is 4561 not Perl-compatible. Some of the features discussed below are not 4562 available when DFA matching is used. The advantages and disadvantages 4563 of the alternative functions, and how they differ from the normal func- 4564 tions, are discussed in the pcrematching page. 4565 4566 4567EBCDIC CHARACTER CODES 4568 4569 PCRE can be compiled to run in an environment that uses EBCDIC as its 4570 character code rather than ASCII or Unicode (typically a mainframe sys- 4571 tem). In the sections below, character code values are ASCII or Uni- 4572 code; in an EBCDIC environment these characters may have different code 4573 values, and there are no code points greater than 255. 4574 4575 4576NEWLINE CONVENTIONS 4577 4578 PCRE supports five different conventions for indicating line breaks in 4579 strings: a single CR (carriage return) character, a single LF (line- 4580 feed) character, the two-character sequence CRLF, any of the three pre- 4581 ceding, or any Unicode newline sequence. The pcreapi page has further 4582 discussion about newlines, and shows how to set the newline convention 4583 in the options arguments for the compiling and matching functions. 4584 4585 It is also possible to specify a newline convention by starting a pat- 4586 tern string with one of the following five sequences: 4587 4588 (*CR) carriage return 4589 (*LF) linefeed 4590 (*CRLF) carriage return, followed by linefeed 4591 (*ANYCRLF) any of the three above 4592 (*ANY) all Unicode newline sequences 4593 4594 These override the default and the options given to the compiling func- 4595 tion. For example, on a Unix system where LF is the default newline 4596 sequence, the pattern 4597 4598 (*CR)a.b 4599 4600 changes the convention to CR. That pattern matches "a\nb" because LF is 4601 no longer a newline. Note that these special settings, which are not 4602 Perl-compatible, are recognized only at the very start of a pattern, 4603 and that they must be in upper case. If more than one of them is 4604 present, the last one is used. 4605 4606 The newline convention affects where the circumflex and dollar asser- 4607 tions are true. It also affects the interpretation of the dot metachar- 4608 acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it 4609 does not affect what the \R escape sequence matches. By default, this 4610 is any Unicode newline sequence, for Perl compatibility. However, this 4611 can be changed; see the description of \R in the section entitled "New- 4612 line sequences" below. A change of \R setting can be combined with a 4613 change of newline convention. 4614 4615 4616CHARACTERS AND METACHARACTERS 4617 4618 A regular expression is a pattern that is matched against a subject 4619 string from left to right. Most characters stand for themselves in a 4620 pattern, and match the corresponding characters in the subject. As a 4621 trivial example, the pattern 4622 4623 The quick brown fox 4624 4625 matches a portion of a subject string that is identical to itself. When 4626 caseless matching is specified (the PCRE_CASELESS option), letters are 4627 matched independently of case. In a UTF mode, PCRE always understands 4628 the concept of case for characters whose values are less than 128, so 4629 caseless matching is always possible. For characters with higher val- 4630 ues, the concept of case is supported if PCRE is compiled with Unicode 4631 property support, but not otherwise. If you want to use caseless 4632 matching for characters 128 and above, you must ensure that PCRE is 4633 compiled with Unicode property support as well as with UTF support. 4634 4635 The power of regular expressions comes from the ability to include 4636 alternatives and repetitions in the pattern. These are encoded in the 4637 pattern by the use of metacharacters, which do not stand for themselves 4638 but instead are interpreted in some special way. 4639 4640 There are two different sets of metacharacters: those that are recog- 4641 nized anywhere in the pattern except within square brackets, and those 4642 that are recognized within square brackets. Outside square brackets, 4643 the metacharacters are as follows: 4644 4645 \ general escape character with several uses 4646 ^ assert start of string (or line, in multiline mode) 4647 $ assert end of string (or line, in multiline mode) 4648 . match any character except newline (by default) 4649 [ start character class definition 4650 | start of alternative branch 4651 ( start subpattern 4652 ) end subpattern 4653 ? extends the meaning of ( 4654 also 0 or 1 quantifier 4655 also quantifier minimizer 4656 * 0 or more quantifier 4657 + 1 or more quantifier 4658 also "possessive quantifier" 4659 { start min/max quantifier 4660 4661 Part of a pattern that is in square brackets is called a "character 4662 class". In a character class the only metacharacters are: 4663 4664 \ general escape character 4665 ^ negate the class, but only if the first character 4666 - indicates character range 4667 [ POSIX character class (only if followed by POSIX 4668 syntax) 4669 ] terminates the character class 4670 4671 The following sections describe the use of each of the metacharacters. 4672 4673 4674BACKSLASH 4675 4676 The backslash character has several uses. Firstly, if it is followed by 4677 a character that is not a number or a letter, it takes away any special 4678 meaning that character may have. This use of backslash as an escape 4679 character applies both inside and outside character classes. 4680 4681 For example, if you want to match a * character, you write \* in the 4682 pattern. This escaping action applies whether or not the following 4683 character would otherwise be interpreted as a metacharacter, so it is 4684 always safe to precede a non-alphanumeric with backslash to specify 4685 that it stands for itself. In particular, if you want to match a back- 4686 slash, you write \\. 4687 4688 In a UTF mode, only ASCII numbers and letters have any special meaning 4689 after a backslash. All other characters (in particular, those whose 4690 codepoints are greater than 127) are treated as literals. 4691 4692 If a pattern is compiled with the PCRE_EXTENDED option, white space in 4693 the pattern (other than in a character class) and characters between a 4694 # outside a character class and the next newline are ignored. An escap- 4695 ing backslash can be used to include a white space or # character as 4696 part of the pattern. 4697 4698 If you want to remove the special meaning from a sequence of charac- 4699 ters, you can do so by putting them between \Q and \E. This is differ- 4700 ent from Perl in that $ and @ are handled as literals in \Q...\E 4701 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- 4702 tion. Note the following examples: 4703 4704 Pattern PCRE matches Perl matches 4705 4706 \Qabc$xyz\E abc$xyz abc followed by the 4707 contents of $xyz 4708 \Qabc\$xyz\E abc\$xyz abc\$xyz 4709 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 4710 4711 The \Q...\E sequence is recognized both inside and outside character 4712 classes. An isolated \E that is not preceded by \Q is ignored. If \Q 4713 is not followed by \E later in the pattern, the literal interpretation 4714 continues to the end of the pattern (that is, \E is assumed at the 4715 end). If the isolated \Q is inside a character class, this causes an 4716 error, because the character class is not terminated. 4717 4718 Non-printing characters 4719 4720 A second use of backslash provides a way of encoding non-printing char- 4721 acters in patterns in a visible manner. There is no restriction on the 4722 appearance of non-printing characters, apart from the binary zero that 4723 terminates a pattern, but when a pattern is being prepared by text 4724 editing, it is often easier to use one of the following escape 4725 sequences than the binary character it represents: 4726 4727 \a alarm, that is, the BEL character (hex 07) 4728 \cx "control-x", where x is any ASCII character 4729 \e escape (hex 1B) 4730 \f form feed (hex 0C) 4731 \n linefeed (hex 0A) 4732 \r carriage return (hex 0D) 4733 \t tab (hex 09) 4734 \ddd character with octal code ddd, or back reference 4735 \xhh character with hex code hh 4736 \x{hhh..} character with hex code hhh.. (non-JavaScript mode) 4737 \uhhhh character with hex code hhhh (JavaScript mode only) 4738 4739 The precise effect of \cx on ASCII characters is as follows: if x is a 4740 lower case letter, it is converted to upper case. Then bit 6 of the 4741 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A 4742 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes 4743 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c 4744 has a value greater than 127, a compile-time error occurs. This locks 4745 out non-ASCII characters in all modes. 4746 4747 The \c facility was designed for use with ASCII characters, but with 4748 the extension to Unicode it is even less useful than it once was. It 4749 is, however, recognized when PCRE is compiled in EBCDIC mode, where 4750 data items are always bytes. In this mode, all values are valid after 4751 \c. If the next character is a lower case letter, it is converted to 4752 upper case. Then the 0xc0 bits of the byte are inverted. Thus \cA 4753 becomes hex 01, as in ASCII (A is C1), but because the EBCDIC letters 4754 are disjoint, \cZ becomes hex 29 (Z is E9), and other characters also 4755 generate different values. 4756 4757 By default, after \x, from zero to two hexadecimal digits are read 4758 (letters can be in upper or lower case). Any number of hexadecimal dig- 4759 its may appear between \x{ and }, but the character code is constrained 4760 as follows: 4761 4762 8-bit non-UTF mode less than 0x100 4763 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 4764 16-bit non-UTF mode less than 0x10000 4765 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint 4766 32-bit non-UTF mode less than 0x80000000 4767 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint 4768 4769 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- 4770 called "surrogate" codepoints), and 0xffef. 4771 4772 If characters other than hexadecimal digits appear between \x{ and }, 4773 or if there is no terminating }, this form of escape is not recognized. 4774 Instead, the initial \x will be interpreted as a basic hexadecimal 4775 escape, with no following digits, giving a character whose value is 4776 zero. 4777 4778 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x 4779 is as just described only when it is followed by two hexadecimal dig- 4780 its. Otherwise, it matches a literal "x" character. In JavaScript 4781 mode, support for code points greater than 256 is provided by \u, which 4782 must be followed by four hexadecimal digits; otherwise it matches a 4783 literal "u" character. Character codes specified by \u in JavaScript 4784 mode are constrained in the same was as those specified by \x in non- 4785 JavaScript mode. 4786 4787 Characters whose value is less than 256 can be defined by either of the 4788 two syntaxes for \x (or by \u in JavaScript mode). There is no differ- 4789 ence in the way they are handled. For example, \xdc is exactly the same 4790 as \x{dc} (or \u00dc in JavaScript mode). 4791 4792 After \0 up to two further octal digits are read. If there are fewer 4793 than two digits, just those that are present are used. Thus the 4794 sequence \0\x\07 specifies two binary zeros followed by a BEL character 4795 (code value 7). Make sure you supply two digits after the initial zero 4796 if the pattern character that follows is itself an octal digit. 4797 4798 The handling of a backslash followed by a digit other than 0 is compli- 4799 cated. Outside a character class, PCRE reads it and any following dig- 4800 its as a decimal number. If the number is less than 10, or if there 4801 have been at least that many previous capturing left parentheses in the 4802 expression, the entire sequence is taken as a back reference. A 4803 description of how this works is given later, following the discussion 4804 of parenthesized subpatterns. 4805 4806 Inside a character class, or if the decimal number is greater than 9 4807 and there have not been that many capturing subpatterns, PCRE re-reads 4808 up to three octal digits following the backslash, and uses them to gen- 4809 erate a data character. Any subsequent digits stand for themselves. The 4810 value of the character is constrained in the same way as characters 4811 specified in hexadecimal. For example: 4812 4813 \040 is another way of writing an ASCII space 4814 \40 is the same, provided there are fewer than 40 4815 previous capturing subpatterns 4816 \7 is always a back reference 4817 \11 might be a back reference, or another way of 4818 writing a tab 4819 \011 is always a tab 4820 \0113 is a tab followed by the character "3" 4821 \113 might be a back reference, otherwise the 4822 character with octal code 113 4823 \377 might be a back reference, otherwise 4824 the value 255 (decimal) 4825 \81 is either a back reference, or a binary zero 4826 followed by the two characters "8" and "1" 4827 4828 Note that octal values of 100 or greater must not be introduced by a 4829 leading zero, because no more than three octal digits are ever read. 4830 4831 All the sequences that define a single character value can be used both 4832 inside and outside character classes. In addition, inside a character 4833 class, \b is interpreted as the backspace character (hex 08). 4834 4835 \N is not allowed in a character class. \B, \R, and \X are not special 4836 inside a character class. Like other unrecognized escape sequences, 4837 they are treated as the literal characters "B", "R", and "X" by 4838 default, but cause an error if the PCRE_EXTRA option is set. Outside a 4839 character class, these sequences have different meanings. 4840 4841 Unsupported escape sequences 4842 4843 In Perl, the sequences \l, \L, \u, and \U are recognized by its string 4844 handler and used to modify the case of following characters. By 4845 default, PCRE does not support these escape sequences. However, if the 4846 PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and 4847 \u can be used to define a character by code point, as described in the 4848 previous section. 4849 4850 Absolute and relative back references 4851 4852 The sequence \g followed by an unsigned or a negative number, option- 4853 ally enclosed in braces, is an absolute or relative back reference. A 4854 named back reference can be coded as \g{name}. Back references are dis- 4855 cussed later, following the discussion of parenthesized subpatterns. 4856 4857 Absolute and relative subroutine calls 4858 4859 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 4860 name or a number enclosed either in angle brackets or single quotes, is 4861 an alternative syntax for referencing a subpattern as a "subroutine". 4862 Details are discussed later. Note that \g{...} (Perl syntax) and 4863 \g<...> (Oniguruma syntax) are not synonymous. The former is a back 4864 reference; the latter is a subroutine call. 4865 4866 Generic character types 4867 4868 Another use of backslash is for specifying generic character types: 4869 4870 \d any decimal digit 4871 \D any character that is not a decimal digit 4872 \h any horizontal white space character 4873 \H any character that is not a horizontal white space character 4874 \s any white space character 4875 \S any character that is not a white space character 4876 \v any vertical white space character 4877 \V any character that is not a vertical white space character 4878 \w any "word" character 4879 \W any "non-word" character 4880 4881 There is also the single sequence \N, which matches a non-newline char- 4882 acter. This is the same as the "." metacharacter when PCRE_DOTALL is 4883 not set. Perl also uses \N to match characters by name; PCRE does not 4884 support this. 4885 4886 Each pair of lower and upper case escape sequences partitions the com- 4887 plete set of characters into two disjoint sets. Any given character 4888 matches one, and only one, of each pair. The sequences can appear both 4889 inside and outside character classes. They each match one character of 4890 the appropriate type. If the current matching point is at the end of 4891 the subject string, all of them fail, because there is no character to 4892 match. 4893 4894 For compatibility with Perl, \s does not match the VT character (code 4895 11). This makes it different from the the POSIX "space" class. The \s 4896 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If 4897 "use locale;" is included in a Perl script, \s may match the VT charac- 4898 ter. In PCRE, it never does. 4899 4900 A "word" character is an underscore or any character that is a letter 4901 or digit. By default, the definition of letters and digits is con- 4902 trolled by PCRE's low-valued character tables, and may vary if locale- 4903 specific matching is taking place (see "Locale support" in the pcreapi 4904 page). For example, in a French locale such as "fr_FR" in Unix-like 4905 systems, or "french" in Windows, some character codes greater than 128 4906 are used for accented letters, and these are then matched by \w. The 4907 use of locales with Unicode is discouraged. 4908 4909 By default, in a UTF mode, characters with values greater than 128 4910 never match \d, \s, or \w, and always match \D, \S, and \W. These 4911 sequences retain their original meanings from before UTF support was 4912 available, mainly for efficiency reasons. However, if PCRE is compiled 4913 with Unicode property support, and the PCRE_UCP option is set, the be- 4914 haviour is changed so that Unicode properties are used to determine 4915 character types, as follows: 4916 4917 \d any character that \p{Nd} matches (decimal digit) 4918 \s any character that \p{Z} matches, plus HT, LF, FF, CR 4919 \w any character that \p{L} or \p{N} matches, plus underscore 4920 4921 The upper case escapes match the inverse sets of characters. Note that 4922 \d matches only decimal digits, whereas \w matches any Unicode digit, 4923 as well as any Unicode letter, and underscore. Note also that PCRE_UCP 4924 affects \b, and \B because they are defined in terms of \w and \W. 4925 Matching these sequences is noticeably slower when PCRE_UCP is set. 4926 4927 The sequences \h, \H, \v, and \V are features that were added to Perl 4928 at release 5.10. In contrast to the other sequences, which match only 4929 ASCII characters by default, these always match certain high-valued 4930 codepoints, whether or not PCRE_UCP is set. The horizontal space char- 4931 acters are: 4932 4933 U+0009 Horizontal tab (HT) 4934 U+0020 Space 4935 U+00A0 Non-break space 4936 U+1680 Ogham space mark 4937 U+180E Mongolian vowel separator 4938 U+2000 En quad 4939 U+2001 Em quad 4940 U+2002 En space 4941 U+2003 Em space 4942 U+2004 Three-per-em space 4943 U+2005 Four-per-em space 4944 U+2006 Six-per-em space 4945 U+2007 Figure space 4946 U+2008 Punctuation space 4947 U+2009 Thin space 4948 U+200A Hair space 4949 U+202F Narrow no-break space 4950 U+205F Medium mathematical space 4951 U+3000 Ideographic space 4952 4953 The vertical space characters are: 4954 4955 U+000A Linefeed (LF) 4956 U+000B Vertical tab (VT) 4957 U+000C Form feed (FF) 4958 U+000D Carriage return (CR) 4959 U+0085 Next line (NEL) 4960 U+2028 Line separator 4961 U+2029 Paragraph separator 4962 4963 In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 4964 256 are relevant. 4965 4966 Newline sequences 4967 4968 Outside a character class, by default, the escape sequence \R matches 4969 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent 4970 to the following: 4971 4972 (?>\r\n|\n|\x0b|\f|\r|\x85) 4973 4974 This is an example of an "atomic group", details of which are given 4975 below. This particular group matches either the two-character sequence 4976 CR followed by LF, or one of the single characters LF (linefeed, 4977 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- 4978 riage return, U+000D), or NEL (next line, U+0085). The two-character 4979 sequence is treated as a single unit that cannot be split. 4980 4981 In other modes, two additional characters whose codepoints are greater 4982 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- 4983 rator, U+2029). Unicode character property support is not needed for 4984 these characters to be recognized. 4985 4986 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 4987 the complete set of Unicode line endings) by setting the option 4988 PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. 4989 (BSR is an abbrevation for "backslash R".) This can be made the default 4990 when PCRE is built; if this is the case, the other behaviour can be 4991 requested via the PCRE_BSR_UNICODE option. It is also possible to 4992 specify these settings by starting a pattern string with one of the 4993 following sequences: 4994 4995 (*BSR_ANYCRLF) CR, LF, or CRLF only 4996 (*BSR_UNICODE) any Unicode newline sequence 4997 4998 These override the default and the options given to the compiling func- 4999 tion, but they can themselves be overridden by options given to a 5000 matching function. Note that these special settings, which are not 5001 Perl-compatible, are recognized only at the very start of a pattern, 5002 and that they must be in upper case. If more than one of them is 5003 present, the last one is used. They can be combined with a change of 5004 newline convention; for example, a pattern can start with: 5005 5006 (*ANY)(*BSR_ANYCRLF) 5007 5008 They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) 5009 or (*UCP) special sequences. Inside a character class, \R is treated as 5010 an unrecognized escape sequence, and so matches the letter "R" by 5011 default, but causes an error if PCRE_EXTRA is set. 5012 5013 Unicode character properties 5014 5015 When PCRE is built with Unicode character property support, three addi- 5016 tional escape sequences that match characters with specific properties 5017 are available. When in 8-bit non-UTF-8 mode, these sequences are of 5018 course limited to testing characters whose codepoints are less than 5019 256, but they do work in this mode. The extra escape sequences are: 5020 5021 \p{xx} a character with the xx property 5022 \P{xx} a character without the xx property 5023 \X a Unicode extended grapheme cluster 5024 5025 The property names represented by xx above are limited to the Unicode 5026 script names, the general category properties, "Any", which matches any 5027 character (including newline), and some special PCRE properties 5028 (described in the next section). Other Perl properties such as "InMu- 5029 sicalSymbols" are not currently supported by PCRE. Note that \P{Any} 5030 does not match any characters, so always causes a match failure. 5031 5032 Sets of Unicode characters are defined as belonging to certain scripts. 5033 A character from one of these sets can be matched using a script name. 5034 For example: 5035 5036 \p{Greek} 5037 \P{Han} 5038 5039 Those that are not part of an identified script are lumped together as 5040 "Common". The current list of scripts is: 5041 5042 Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, 5043 Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, 5044 Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, 5045 Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, 5046 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- 5047 gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- 5048 tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, 5049 Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, 5050 Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, 5051 Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, 5052 Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, 5053 Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- 5054 tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, 5055 Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, 5056 Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, 5057 Yi. 5058 5059 Each character has exactly one Unicode general category property, spec- 5060 ified by a two-letter abbreviation. For compatibility with Perl, nega- 5061 tion can be specified by including a circumflex between the opening 5062 brace and the property name. For example, \p{^Lu} is the same as 5063 \P{Lu}. 5064 5065 If only one letter is specified with \p or \P, it includes all the gen- 5066 eral category properties that start with that letter. In this case, in 5067 the absence of negation, the curly brackets in the escape sequence are 5068 optional; these two examples have the same effect: 5069 5070 \p{L} 5071 \pL 5072 5073 The following general category property codes are supported: 5074 5075 C Other 5076 Cc Control 5077 Cf Format 5078 Cn Unassigned 5079 Co Private use 5080 Cs Surrogate 5081 5082 L Letter 5083 Ll Lower case letter 5084 Lm Modifier letter 5085 Lo Other letter 5086 Lt Title case letter 5087 Lu Upper case letter 5088 5089 M Mark 5090 Mc Spacing mark 5091 Me Enclosing mark 5092 Mn Non-spacing mark 5093 5094 N Number 5095 Nd Decimal number 5096 Nl Letter number 5097 No Other number 5098 5099 P Punctuation 5100 Pc Connector punctuation 5101 Pd Dash punctuation 5102 Pe Close punctuation 5103 Pf Final punctuation 5104 Pi Initial punctuation 5105 Po Other punctuation 5106 Ps Open punctuation 5107 5108 S Symbol 5109 Sc Currency symbol 5110 Sk Modifier symbol 5111 Sm Mathematical symbol 5112 So Other symbol 5113 5114 Z Separator 5115 Zl Line separator 5116 Zp Paragraph separator 5117 Zs Space separator 5118 5119 The special property L& is also supported: it matches a character that 5120 has the Lu, Ll, or Lt property, in other words, a letter that is not 5121 classified as a modifier or "other". 5122 5123 The Cs (Surrogate) property applies only to characters in the range 5124 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and 5125 so cannot be tested by PCRE, unless UTF validity checking has been 5126 turned off (see the discussion of PCRE_NO_UTF8_CHECK, 5127 PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl 5128 does not support the Cs property. 5129 5130 The long synonyms for property names that Perl supports (such as 5131 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix 5132 any of these properties with "Is". 5133 5134 No character that is in the Unicode table has the Cn (unassigned) prop- 5135 erty. Instead, this property is assumed for any code point that is not 5136 in the Unicode table. 5137 5138 Specifying caseless matching does not affect these escape sequences. 5139 For example, \p{Lu} always matches only upper case letters. 5140 5141 Matching characters by Unicode property is not fast, because PCRE has 5142 to do a multistage table lookup in order to find a character's prop- 5143 erty. That is why the traditional escape sequences such as \d and \w do 5144 not use Unicode properties in PCRE by default, though you can make them 5145 do so by setting the PCRE_UCP option or by starting the pattern with 5146 (*UCP). 5147 5148 Extended grapheme clusters 5149 5150 The \X escape matches any number of Unicode characters that form an 5151 "extended grapheme cluster", and treats the sequence as an atomic group 5152 (see below). Up to and including release 8.31, PCRE matched an ear- 5153 lier, simpler definition that was equivalent to 5154 5155 (?>\PM\pM*) 5156 5157 That is, it matched a character without the "mark" property, followed 5158 by zero or more characters with the "mark" property. Characters with 5159 the "mark" property are typically non-spacing accents that affect the 5160 preceding character. 5161 5162 This simple definition was extended in Unicode to include more compli- 5163 cated kinds of composite character by giving each character a grapheme 5164 breaking property, and creating rules that use these properties to 5165 define the boundaries of extended grapheme clusters. In releases of 5166 PCRE later than 8.31, \X matches one of these clusters. 5167 5168 \X always matches at least one character. Then it decides whether to 5169 add additional characters according to the following rules for ending a 5170 cluster: 5171 5172 1. End at the end of the subject string. 5173 5174 2. Do not end between CR and LF; otherwise end after any control char- 5175 acter. 5176 5177 3. Do not break Hangul (a Korean script) syllable sequences. Hangul 5178 characters are of five types: L, V, T, LV, and LVT. An L character may 5179 be followed by an L, V, LV, or LVT character; an LV or V character may 5180 be followed by a V or T character; an LVT or T character may be follwed 5181 only by a T character. 5182 5183 4. Do not end before extending characters or spacing marks. Characters 5184 with the "mark" property always have the "extend" grapheme breaking 5185 property. 5186 5187 5. Do not end after prepend characters. 5188 5189 6. Otherwise, end the cluster. 5190 5191 PCRE's additional properties 5192 5193 As well as the standard Unicode properties described above, PCRE sup- 5194 ports four more that make it possible to convert traditional escape 5195 sequences such as \w and \s and POSIX character classes to use Unicode 5196 properties. PCRE uses these non-standard, non-Perl properties inter- 5197 nally when PCRE_UCP is set. They are: 5198 5199 Xan Any alphanumeric character 5200 Xps Any POSIX space character 5201 Xsp Any Perl space character 5202 Xwd Any Perl "word" character 5203 5204 Xan matches characters that have either the L (letter) or the N (num- 5205 ber) property. Xps matches the characters tab, linefeed, vertical tab, 5206 form feed, or carriage return, and any other character that has the Z 5207 (separator) property. Xsp is the same as Xps, except that vertical tab 5208 is excluded. Xwd matches the same characters as Xan, plus underscore. 5209 5210 Resetting the match start 5211 5212 The escape sequence \K causes any previously matched characters not to 5213 be included in the final matched sequence. For example, the pattern: 5214 5215 foo\Kbar 5216 5217 matches "foobar", but reports that it has matched "bar". This feature 5218 is similar to a lookbehind assertion (described below). However, in 5219 this case, the part of the subject before the real match does not have 5220 to be of fixed length, as lookbehind assertions do. The use of \K does 5221 not interfere with the setting of captured substrings. For example, 5222 when the pattern 5223 5224 (foo)\Kbar 5225 5226 matches "foobar", the first substring is still set to "foo". 5227 5228 Perl documents that the use of \K within assertions is "not well 5229 defined". In PCRE, \K is acted upon when it occurs inside positive 5230 assertions, but is ignored in negative assertions. 5231 5232 Simple assertions 5233 5234 The final use of backslash is for certain simple assertions. An asser- 5235 tion specifies a condition that has to be met at a particular point in 5236 a match, without consuming any characters from the subject string. The 5237 use of subpatterns for more complicated assertions is described below. 5238 The backslashed assertions are: 5239 5240 \b matches at a word boundary 5241 \B matches when not at a word boundary 5242 \A matches at the start of the subject 5243 \Z matches at the end of the subject 5244 also matches before a newline at the end of the subject 5245 \z matches only at the end of the subject 5246 \G matches at the first matching position in the subject 5247 5248 Inside a character class, \b has a different meaning; it matches the 5249 backspace character. If any other of these assertions appears in a 5250 character class, by default it matches the corresponding literal char- 5251 acter (for example, \B matches the letter B). However, if the 5252 PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- 5253 ated instead. 5254 5255 A word boundary is a position in the subject string where the current 5256 character and the previous character do not both match \w or \W (i.e. 5257 one matches \w and the other matches \W), or the start or end of the 5258 string if the first or last character matches \w, respectively. In a 5259 UTF mode, the meanings of \w and \W can be changed by setting the 5260 PCRE_UCP option. When this is done, it also affects \b and \B. Neither 5261 PCRE nor Perl has a separate "start of word" or "end of word" metase- 5262 quence. However, whatever follows \b normally determines which it is. 5263 For example, the fragment \ba matches "a" at the start of a word. 5264 5265 The \A, \Z, and \z assertions differ from the traditional circumflex 5266 and dollar (described in the next section) in that they only ever match 5267 at the very start and end of the subject string, whatever options are 5268 set. Thus, they are independent of multiline mode. These three asser- 5269 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which 5270 affect only the behaviour of the circumflex and dollar metacharacters. 5271 However, if the startoffset argument of pcre_exec() is non-zero, indi- 5272 cating that matching is to start at a point other than the beginning of 5273 the subject, \A can never match. The difference between \Z and \z is 5274 that \Z matches before a newline at the end of the string as well as at 5275 the very end, whereas \z matches only at the end. 5276 5277 The \G assertion is true only when the current matching position is at 5278 the start point of the match, as specified by the startoffset argument 5279 of pcre_exec(). It differs from \A when the value of startoffset is 5280 non-zero. By calling pcre_exec() multiple times with appropriate argu- 5281 ments, you can mimic Perl's /g option, and it is in this kind of imple- 5282 mentation where \G can be useful. 5283 5284 Note, however, that PCRE's interpretation of \G, as the start of the 5285 current match, is subtly different from Perl's, which defines it as the 5286 end of the previous match. In Perl, these can be different when the 5287 previously matched string was empty. Because PCRE does just one match 5288 at a time, it cannot reproduce this behaviour. 5289 5290 If all the alternatives of a pattern begin with \G, the expression is 5291 anchored to the starting match position, and the "anchored" flag is set 5292 in the compiled regular expression. 5293 5294 5295CIRCUMFLEX AND DOLLAR 5296 5297 The circumflex and dollar metacharacters are zero-width assertions. 5298 That is, they test for a particular condition being true without con- 5299 suming any characters from the subject string. 5300 5301 Outside a character class, in the default matching mode, the circumflex 5302 character is an assertion that is true only if the current matching 5303 point is at the start of the subject string. If the startoffset argu- 5304 ment of pcre_exec() is non-zero, circumflex can never match if the 5305 PCRE_MULTILINE option is unset. Inside a character class, circumflex 5306 has an entirely different meaning (see below). 5307 5308 Circumflex need not be the first character of the pattern if a number 5309 of alternatives are involved, but it should be the first thing in each 5310 alternative in which it appears if the pattern is ever to match that 5311 branch. If all possible alternatives start with a circumflex, that is, 5312 if the pattern is constrained to match only at the start of the sub- 5313 ject, it is said to be an "anchored" pattern. (There are also other 5314 constructs that can cause a pattern to be anchored.) 5315 5316 The dollar character is an assertion that is true only if the current 5317 matching point is at the end of the subject string, or immediately 5318 before a newline at the end of the string (by default). Note, however, 5319 that it does not actually match the newline. Dollar need not be the 5320 last character of the pattern if a number of alternatives are involved, 5321 but it should be the last item in any branch in which it appears. Dol- 5322 lar has no special meaning in a character class. 5323 5324 The meaning of dollar can be changed so that it matches only at the 5325 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at 5326 compile time. This does not affect the \Z assertion. 5327 5328 The meanings of the circumflex and dollar characters are changed if the 5329 PCRE_MULTILINE option is set. When this is the case, a circumflex 5330 matches immediately after internal newlines as well as at the start of 5331 the subject string. It does not match after a newline that ends the 5332 string. A dollar matches before any newlines in the string, as well as 5333 at the very end, when PCRE_MULTILINE is set. When newline is specified 5334 as the two-character sequence CRLF, isolated CR and LF characters do 5335 not indicate newlines. 5336 5337 For example, the pattern /^abc$/ matches the subject string "def\nabc" 5338 (where \n represents a newline) in multiline mode, but not otherwise. 5339 Consequently, patterns that are anchored in single line mode because 5340 all branches start with ^ are not anchored in multiline mode, and a 5341 match for circumflex is possible when the startoffset argument of 5342 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if 5343 PCRE_MULTILINE is set. 5344 5345 Note that the sequences \A, \Z, and \z can be used to match the start 5346 and end of the subject in both modes, and if all branches of a pattern 5347 start with \A it is always anchored, whether or not PCRE_MULTILINE is 5348 set. 5349 5350 5351FULL STOP (PERIOD, DOT) AND \N 5352 5353 Outside a character class, a dot in the pattern matches any one charac- 5354 ter in the subject string except (by default) a character that signi- 5355 fies the end of a line. 5356 5357 When a line ending is defined as a single character, dot never matches 5358 that character; when the two-character sequence CRLF is used, dot does 5359 not match CR if it is immediately followed by LF, but otherwise it 5360 matches all characters (including isolated CRs and LFs). When any Uni- 5361 code line endings are being recognized, dot does not match CR or LF or 5362 any of the other line ending characters. 5363 5364 The behaviour of dot with regard to newlines can be changed. If the 5365 PCRE_DOTALL option is set, a dot matches any one character, without 5366 exception. If the two-character sequence CRLF is present in the subject 5367 string, it takes two dots to match it. 5368 5369 The handling of dot is entirely independent of the handling of circum- 5370 flex and dollar, the only relationship being that they both involve 5371 newlines. Dot has no special meaning in a character class. 5372 5373 The escape sequence \N behaves like a dot, except that it is not 5374 affected by the PCRE_DOTALL option. In other words, it matches any 5375 character except one that signifies the end of a line. Perl also uses 5376 \N to match characters by name; PCRE does not support this. 5377 5378 5379MATCHING A SINGLE DATA UNIT 5380 5381 Outside a character class, the escape sequence \C matches any one data 5382 unit, whether or not a UTF mode is set. In the 8-bit library, one data 5383 unit is one byte; in the 16-bit library it is a 16-bit unit; in the 5384 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 5385 line-ending characters. The feature is provided in Perl in order to 5386 match individual bytes in UTF-8 mode, but it is unclear how it can use- 5387 fully be used. Because \C breaks up characters into individual data 5388 units, matching one unit with \C in a UTF mode means that the rest of 5389 the string may start with a malformed UTF character. This has undefined 5390 results, because PCRE assumes that it is dealing with valid UTF strings 5391 (and by default it checks this at the start of processing unless the 5392 PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option 5393 is used). 5394 5395 PCRE does not allow \C to appear in lookbehind assertions (described 5396 below) in a UTF mode, because this would make it impossible to calcu- 5397 late the length of the lookbehind. 5398 5399 In general, the \C escape sequence is best avoided. However, one way of 5400 using it that avoids the problem of malformed UTF characters is to use 5401 a lookahead to check the length of the next character, as in this pat- 5402 tern, which could be used with a UTF-8 string (ignore white space and 5403 line breaks): 5404 5405 (?| (?=[\x00-\x7f])(\C) | 5406 (?=[\x80-\x{7ff}])(\C)(\C) | 5407 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 5408 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 5409 5410 A group that starts with (?| resets the capturing parentheses numbers 5411 in each alternative (see "Duplicate Subpattern Numbers" below). The 5412 assertions at the start of each branch check the next UTF-8 character 5413 for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The 5414 character's individual bytes are then captured by the appropriate num- 5415 ber of groups. 5416 5417 5418SQUARE BRACKETS AND CHARACTER CLASSES 5419 5420 An opening square bracket introduces a character class, terminated by a 5421 closing square bracket. A closing square bracket on its own is not spe- 5422 cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, 5423 a lone closing square bracket causes a compile-time error. If a closing 5424 square bracket is required as a member of the class, it should be the 5425 first data character in the class (after an initial circumflex, if 5426 present) or escaped with a backslash. 5427 5428 A character class matches a single character in the subject. In a UTF 5429 mode, the character may be more than one data unit long. A matched 5430 character must be in the set of characters defined by the class, unless 5431 the first character in the class definition is a circumflex, in which 5432 case the subject character must not be in the set defined by the class. 5433 If a circumflex is actually required as a member of the class, ensure 5434 it is not the first character, or escape it with a backslash. 5435 5436 For example, the character class [aeiou] matches any lower case vowel, 5437 while [^aeiou] matches any character that is not a lower case vowel. 5438 Note that a circumflex is just a convenient notation for specifying the 5439 characters that are in the class by enumerating those that are not. A 5440 class that starts with a circumflex is not an assertion; it still con- 5441 sumes a character from the subject string, and therefore it fails if 5442 the current pointer is at the end of the string. 5443 5444 In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 5445 (0xffff) can be included in a class as a literal string of data units, 5446 or by using the \x{ escaping mechanism. 5447 5448 When caseless matching is set, any letters in a class represent both 5449 their upper case and lower case versions, so for example, a caseless 5450 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not 5451 match "A", whereas a caseful version would. In a UTF mode, PCRE always 5452 understands the concept of case for characters whose values are less 5453 than 128, so caseless matching is always possible. For characters with 5454 higher values, the concept of case is supported if PCRE is compiled 5455 with Unicode property support, but not otherwise. If you want to use 5456 caseless matching in a UTF mode for characters 128 and above, you must 5457 ensure that PCRE is compiled with Unicode property support as well as 5458 with UTF support. 5459 5460 Characters that might indicate line breaks are never treated in any 5461 special way when matching character classes, whatever line-ending 5462 sequence is in use, and whatever setting of the PCRE_DOTALL and 5463 PCRE_MULTILINE options is used. A class such as [^a] always matches one 5464 of these characters. 5465 5466 The minus (hyphen) character can be used to specify a range of charac- 5467 ters in a character class. For example, [d-m] matches any letter 5468 between d and m, inclusive. If a minus character is required in a 5469 class, it must be escaped with a backslash or appear in a position 5470 where it cannot be interpreted as indicating a range, typically as the 5471 first or last character in the class. 5472 5473 It is not possible to have the literal character "]" as the end charac- 5474 ter of a range. A pattern such as [W-]46] is interpreted as a class of 5475 two characters ("W" and "-") followed by a literal string "46]", so it 5476 would match "W46]" or "-46]". However, if the "]" is escaped with a 5477 backslash it is interpreted as the end of range, so [W-\]46] is inter- 5478 preted as a class containing a range followed by two other characters. 5479 The octal or hexadecimal representation of "]" can also be used to end 5480 a range. 5481 5482 Ranges operate in the collating sequence of character values. They can 5483 also be used for characters specified numerically, for example 5484 [\000-\037]. Ranges can include any characters that are valid for the 5485 current mode. 5486 5487 If a range that includes letters is used when caseless matching is set, 5488 it matches the letters in either case. For example, [W-c] is equivalent 5489 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if 5490 character tables for a French locale are in use, [\xc8-\xcb] matches 5491 accented E characters in both cases. In UTF modes, PCRE supports the 5492 concept of case for characters with values greater than 128 only when 5493 it is compiled with Unicode property support. 5494 5495 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, 5496 \w, and \W may appear in a character class, and add the characters that 5497 they match to the class. For example, [\dABCDEF] matches any hexadeci- 5498 mal digit. In UTF modes, the PCRE_UCP option affects the meanings of 5499 \d, \s, \w and their upper case partners, just as it does when they 5500 appear outside a character class, as described in the section entitled 5501 "Generic character types" above. The escape sequence \b has a different 5502 meaning inside a character class; it matches the backspace character. 5503 The sequences \B, \N, \R, and \X are not special inside a character 5504 class. Like any other unrecognized escape sequences, they are treated 5505 as the literal characters "B", "N", "R", and "X" by default, but cause 5506 an error if the PCRE_EXTRA option is set. 5507 5508 A circumflex can conveniently be used with the upper case character 5509 types to specify a more restricted set of characters than the matching 5510 lower case type. For example, the class [^\W_] matches any letter or 5511 digit, but not underscore, whereas [\w] includes underscore. A positive 5512 character class should be read as "something OR something OR ..." and a 5513 negative class as "NOT something AND NOT something AND NOT ...". 5514 5515 The only metacharacters that are recognized in character classes are 5516 backslash, hyphen (only where it can be interpreted as specifying a 5517 range), circumflex (only at the start), opening square bracket (only 5518 when it can be interpreted as introducing a POSIX class name - see the 5519 next section), and the terminating closing square bracket. However, 5520 escaping other non-alphanumeric characters does no harm. 5521 5522 5523POSIX CHARACTER CLASSES 5524 5525 Perl supports the POSIX notation for character classes. This uses names 5526 enclosed by [: and :] within the enclosing square brackets. PCRE also 5527 supports this notation. For example, 5528 5529 [01[:alpha:]%] 5530 5531 matches "0", "1", any alphabetic character, or "%". The supported class 5532 names are: 5533 5534 alnum letters and digits 5535 alpha letters 5536 ascii character codes 0 - 127 5537 blank space or tab only 5538 cntrl control characters 5539 digit decimal digits (same as \d) 5540 graph printing characters, excluding space 5541 lower lower case letters 5542 print printing characters, including space 5543 punct printing characters, excluding letters and digits and space 5544 space white space (not quite the same as \s) 5545 upper upper case letters 5546 word "word" characters (same as \w) 5547 xdigit hexadecimal digits 5548 5549 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), 5550 and space (32). Notice that this list includes the VT character (code 5551 11). This makes "space" different to \s, which does not include VT (for 5552 Perl compatibility). 5553 5554 The name "word" is a Perl extension, and "blank" is a GNU extension 5555 from Perl 5.8. Another Perl extension is negation, which is indicated 5556 by a ^ character after the colon. For example, 5557 5558 [12[:^digit:]] 5559 5560 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the 5561 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but 5562 these are not supported, and an error is given if they are encountered. 5563 5564 By default, in UTF modes, characters with values greater than 128 do 5565 not match any of the POSIX character classes. However, if the PCRE_UCP 5566 option is passed to pcre_compile(), some of the classes are changed so 5567 that Unicode character properties are used. This is achieved by replac- 5568 ing the POSIX classes by other sequences, as follows: 5569 5570 [:alnum:] becomes \p{Xan} 5571 [:alpha:] becomes \p{L} 5572 [:blank:] becomes \h 5573 [:digit:] becomes \p{Nd} 5574 [:lower:] becomes \p{Ll} 5575 [:space:] becomes \p{Xps} 5576 [:upper:] becomes \p{Lu} 5577 [:word:] becomes \p{Xwd} 5578 5579 Negated versions, such as [:^alpha:] use \P instead of \p. The other 5580 POSIX classes are unchanged, and match only characters with code points 5581 less than 128. 5582 5583 5584VERTICAL BAR 5585 5586 Vertical bar characters are used to separate alternative patterns. For 5587 example, the pattern 5588 5589 gilbert|sullivan 5590 5591 matches either "gilbert" or "sullivan". Any number of alternatives may 5592 appear, and an empty alternative is permitted (matching the empty 5593 string). The matching process tries each alternative in turn, from left 5594 to right, and the first one that succeeds is used. If the alternatives 5595 are within a subpattern (defined below), "succeeds" means matching the 5596 rest of the main pattern as well as the alternative in the subpattern. 5597 5598 5599INTERNAL OPTION SETTING 5600 5601 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and 5602 PCRE_EXTENDED options (which are Perl-compatible) can be changed from 5603 within the pattern by a sequence of Perl option letters enclosed 5604 between "(?" and ")". The option letters are 5605 5606 i for PCRE_CASELESS 5607 m for PCRE_MULTILINE 5608 s for PCRE_DOTALL 5609 x for PCRE_EXTENDED 5610 5611 For example, (?im) sets caseless, multiline matching. It is also possi- 5612 ble to unset these options by preceding the letter with a hyphen, and a 5613 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- 5614 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, 5615 is also permitted. If a letter appears both before and after the 5616 hyphen, the option is unset. 5617 5618 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA 5619 can be changed in the same way as the Perl-compatible options by using 5620 the characters J, U and X respectively. 5621 5622 When one of these option changes occurs at top level (that is, not 5623 inside subpattern parentheses), the change applies to the remainder of 5624 the pattern that follows. If the change is placed right at the start of 5625 a pattern, PCRE extracts it into the global options (and it will there- 5626 fore show up in data extracted by the pcre_fullinfo() function). 5627 5628 An option change within a subpattern (see below for a description of 5629 subpatterns) affects only that part of the subpattern that follows it, 5630 so 5631 5632 (a(?i)b)c 5633 5634 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not 5635 used). By this means, options can be made to have different settings 5636 in different parts of the pattern. Any changes made in one alternative 5637 do carry on into subsequent branches within the same subpattern. For 5638 example, 5639 5640 (a(?i)b|c) 5641 5642 matches "ab", "aB", "c", and "C", even though when matching "C" the 5643 first branch is abandoned before the option setting. This is because 5644 the effects of option settings happen at compile time. There would be 5645 some very weird behaviour otherwise. 5646 5647 Note: There are other PCRE-specific options that can be set by the 5648 application when the compiling or matching functions are called. In 5649 some cases the pattern can contain special leading sequences such as 5650 (*CRLF) to override what the application has set or what has been 5651 defaulted. Details are given in the section entitled "Newline 5652 sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and 5653 (*UCP) leading sequences that can be used to set UTF and Unicode prop- 5654 erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, 5655 PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence 5656 is a generic version that can be used with any of the libraries. 5657 5658 5659SUBPATTERNS 5660 5661 Subpatterns are delimited by parentheses (round brackets), which can be 5662 nested. Turning part of a pattern into a subpattern does two things: 5663 5664 1. It localizes a set of alternatives. For example, the pattern 5665 5666 cat(aract|erpillar|) 5667 5668 matches "cataract", "caterpillar", or "cat". Without the parentheses, 5669 it would match "cataract", "erpillar" or an empty string. 5670 5671 2. It sets up the subpattern as a capturing subpattern. This means 5672 that, when the whole pattern matches, that portion of the subject 5673 string that matched the subpattern is passed back to the caller via the 5674 ovector argument of the matching function. (This applies only to the 5675 traditional matching functions; the DFA matching functions do not sup- 5676 port capturing.) 5677 5678 Opening parentheses are counted from left to right (starting from 1) to 5679 obtain numbers for the capturing subpatterns. For example, if the 5680 string "the red king" is matched against the pattern 5681 5682 the ((red|white) (king|queen)) 5683 5684 the captured substrings are "red king", "red", and "king", and are num- 5685 bered 1, 2, and 3, respectively. 5686 5687 The fact that plain parentheses fulfil two functions is not always 5688 helpful. There are often times when a grouping subpattern is required 5689 without a capturing requirement. If an opening parenthesis is followed 5690 by a question mark and a colon, the subpattern does not do any captur- 5691 ing, and is not counted when computing the number of any subsequent 5692 capturing subpatterns. For example, if the string "the white queen" is 5693 matched against the pattern 5694 5695 the ((?:red|white) (king|queen)) 5696 5697 the captured substrings are "white queen" and "queen", and are numbered 5698 1 and 2. The maximum number of capturing subpatterns is 65535. 5699 5700 As a convenient shorthand, if any option settings are required at the 5701 start of a non-capturing subpattern, the option letters may appear 5702 between the "?" and the ":". Thus the two patterns 5703 5704 (?i:saturday|sunday) 5705 (?:(?i)saturday|sunday) 5706 5707 match exactly the same set of strings. Because alternative branches are 5708 tried from left to right, and options are not reset until the end of 5709 the subpattern is reached, an option setting in one branch does affect 5710 subsequent branches, so the above patterns match "SUNDAY" as well as 5711 "Saturday". 5712 5713 5714DUPLICATE SUBPATTERN NUMBERS 5715 5716 Perl 5.10 introduced a feature whereby each alternative in a subpattern 5717 uses the same numbers for its capturing parentheses. Such a subpattern 5718 starts with (?| and is itself a non-capturing subpattern. For example, 5719 consider this pattern: 5720 5721 (?|(Sat)ur|(Sun))day 5722 5723 Because the two alternatives are inside a (?| group, both sets of cap- 5724 turing parentheses are numbered one. Thus, when the pattern matches, 5725 you can look at captured substring number one, whichever alternative 5726 matched. This construct is useful when you want to capture part, but 5727 not all, of one of a number of alternatives. Inside a (?| group, paren- 5728 theses are numbered as usual, but the number is reset at the start of 5729 each branch. The numbers of any capturing parentheses that follow the 5730 subpattern start after the highest number used in any branch. The fol- 5731 lowing example is taken from the Perl documentation. The numbers under- 5732 neath show in which buffer the captured content will be stored. 5733 5734 # before ---------------branch-reset----------- after 5735 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 5736 # 1 2 2 3 2 3 4 5737 5738 A back reference to a numbered subpattern uses the most recent value 5739 that is set for that number by any subpattern. The following pattern 5740 matches "abcabc" or "defdef": 5741 5742 /(?|(abc)|(def))\1/ 5743 5744 In contrast, a subroutine call to a numbered subpattern always refers 5745 to the first one in the pattern with the given number. The following 5746 pattern matches "abcabc" or "defabc": 5747 5748 /(?|(abc)|(def))(?1)/ 5749 5750 If a condition test for a subpattern's having matched refers to a non- 5751 unique number, the test is true if any of the subpatterns of that num- 5752 ber have matched. 5753 5754 An alternative approach to using this "branch reset" feature is to use 5755 duplicate named subpatterns, as described in the next section. 5756 5757 5758NAMED SUBPATTERNS 5759 5760 Identifying capturing parentheses by number is simple, but it can be 5761 very hard to keep track of the numbers in complicated regular expres- 5762 sions. Furthermore, if an expression is modified, the numbers may 5763 change. To help with this difficulty, PCRE supports the naming of sub- 5764 patterns. This feature was not added to Perl until release 5.10. Python 5765 had the feature earlier, and PCRE introduced it at release 4.0, using 5766 the Python syntax. PCRE now supports both the Perl and the Python syn- 5767 tax. Perl allows identically numbered subpatterns to have different 5768 names, but PCRE does not. 5769 5770 In PCRE, a subpattern can be named in one of three ways: (?<name>...) 5771 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References 5772 to capturing parentheses from other parts of the pattern, such as back 5773 references, recursion, and conditions, can be made by name as well as 5774 by number. 5775 5776 Names consist of up to 32 alphanumeric characters and underscores. 5777 Named capturing parentheses are still allocated numbers as well as 5778 names, exactly as if the names were not present. The PCRE API provides 5779 function calls for extracting the name-to-number translation table from 5780 a compiled pattern. There is also a convenience function for extracting 5781 a captured substring by name. 5782 5783 By default, a name must be unique within a pattern, but it is possible 5784 to relax this constraint by setting the PCRE_DUPNAMES option at compile 5785 time. (Duplicate names are also always permitted for subpatterns with 5786 the same number, set up as described in the previous section.) Dupli- 5787 cate names can be useful for patterns where only one instance of the 5788 named parentheses can match. Suppose you want to match the name of a 5789 weekday, either as a 3-letter abbreviation or as the full name, and in 5790 both cases you want to extract the abbreviation. This pattern (ignoring 5791 the line breaks) does the job: 5792 5793 (?<DN>Mon|Fri|Sun)(?:day)?| 5794 (?<DN>Tue)(?:sday)?| 5795 (?<DN>Wed)(?:nesday)?| 5796 (?<DN>Thu)(?:rsday)?| 5797 (?<DN>Sat)(?:urday)? 5798 5799 There are five capturing substrings, but only one is ever set after a 5800 match. (An alternative way of solving this problem is to use a "branch 5801 reset" subpattern, as described in the previous section.) 5802 5803 The convenience function for extracting the data by name returns the 5804 substring for the first (and in this example, the only) subpattern of 5805 that name that matched. This saves searching to find which numbered 5806 subpattern it was. 5807 5808 If you make a back reference to a non-unique named subpattern from 5809 elsewhere in the pattern, the one that corresponds to the first occur- 5810 rence of the name is used. In the absence of duplicate numbers (see the 5811 previous section) this is the one with the lowest number. If you use a 5812 named reference in a condition test (see the section about conditions 5813 below), either to check whether a subpattern has matched, or to check 5814 for recursion, all subpatterns with the same name are tested. If the 5815 condition is true for any one of them, the overall condition is true. 5816 This is the same behaviour as testing by number. For further details of 5817 the interfaces for handling named subpatterns, see the pcreapi documen- 5818 tation. 5819 5820 Warning: You cannot use different names to distinguish between two sub- 5821 patterns with the same number because PCRE uses only the numbers when 5822 matching. For this reason, an error is given at compile time if differ- 5823 ent names are given to subpatterns with the same number. However, you 5824 can give the same name to subpatterns with the same number, even when 5825 PCRE_DUPNAMES is not set. 5826 5827 5828REPETITION 5829 5830 Repetition is specified by quantifiers, which can follow any of the 5831 following items: 5832 5833 a literal data character 5834 the dot metacharacter 5835 the \C escape sequence 5836 the \X escape sequence 5837 the \R escape sequence 5838 an escape such as \d or \pL that matches a single character 5839 a character class 5840 a back reference (see next section) 5841 a parenthesized subpattern (including assertions) 5842 a subroutine call to a subpattern (recursive or otherwise) 5843 5844 The general repetition quantifier specifies a minimum and maximum num- 5845 ber of permitted matches, by giving the two numbers in curly brackets 5846 (braces), separated by a comma. The numbers must be less than 65536, 5847 and the first must be less than or equal to the second. For example: 5848 5849 z{2,4} 5850 5851 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a 5852 special character. If the second number is omitted, but the comma is 5853 present, there is no upper limit; if the second number and the comma 5854 are both omitted, the quantifier specifies an exact number of required 5855 matches. Thus 5856 5857 [aeiou]{3,} 5858 5859 matches at least 3 successive vowels, but may match many more, while 5860 5861 \d{8} 5862 5863 matches exactly 8 digits. An opening curly bracket that appears in a 5864 position where a quantifier is not allowed, or one that does not match 5865 the syntax of a quantifier, is taken as a literal character. For exam- 5866 ple, {,6} is not a quantifier, but a literal string of four characters. 5867 5868 In UTF modes, quantifiers apply to characters rather than to individual 5869 data units. Thus, for example, \x{100}{2} matches two characters, each 5870 of which is represented by a two-byte sequence in a UTF-8 string. Simi- 5871 larly, \X{3} matches three Unicode extended grapheme clusters, each of 5872 which may be several data units long (and they may be of different 5873 lengths). 5874 5875 The quantifier {0} is permitted, causing the expression to behave as if 5876 the previous item and the quantifier were not present. This may be use- 5877 ful for subpatterns that are referenced as subroutines from elsewhere 5878 in the pattern (but see also the section entitled "Defining subpatterns 5879 for use by reference only" below). Items other than subpatterns that 5880 have a {0} quantifier are omitted from the compiled pattern. 5881 5882 For convenience, the three most common quantifiers have single-charac- 5883 ter abbreviations: 5884 5885 * is equivalent to {0,} 5886 + is equivalent to {1,} 5887 ? is equivalent to {0,1} 5888 5889 It is possible to construct infinite loops by following a subpattern 5890 that can match no characters with a quantifier that has no upper limit, 5891 for example: 5892 5893 (a?)* 5894 5895 Earlier versions of Perl and PCRE used to give an error at compile time 5896 for such patterns. However, because there are cases where this can be 5897 useful, such patterns are now accepted, but if any repetition of the 5898 subpattern does in fact match no characters, the loop is forcibly bro- 5899 ken. 5900 5901 By default, the quantifiers are "greedy", that is, they match as much 5902 as possible (up to the maximum number of permitted times), without 5903 causing the rest of the pattern to fail. The classic example of where 5904 this gives problems is in trying to match comments in C programs. These 5905 appear between /* and */ and within the comment, individual * and / 5906 characters may appear. An attempt to match C comments by applying the 5907 pattern 5908 5909 /\*.*\*/ 5910 5911 to the string 5912 5913 /* first comment */ not comment /* second comment */ 5914 5915 fails, because it matches the entire string owing to the greediness of 5916 the .* item. 5917 5918 However, if a quantifier is followed by a question mark, it ceases to 5919 be greedy, and instead matches the minimum number of times possible, so 5920 the pattern 5921 5922 /\*.*?\*/ 5923 5924 does the right thing with the C comments. The meaning of the various 5925 quantifiers is not otherwise changed, just the preferred number of 5926 matches. Do not confuse this use of question mark with its use as a 5927 quantifier in its own right. Because it has two uses, it can sometimes 5928 appear doubled, as in 5929 5930 \d??\d 5931 5932 which matches one digit by preference, but can match two if that is the 5933 only way the rest of the pattern matches. 5934 5935 If the PCRE_UNGREEDY option is set (an option that is not available in 5936 Perl), the quantifiers are not greedy by default, but individual ones 5937 can be made greedy by following them with a question mark. In other 5938 words, it inverts the default behaviour. 5939 5940 When a parenthesized subpattern is quantified with a minimum repeat 5941 count that is greater than 1 or with a limited maximum, more memory is 5942 required for the compiled pattern, in proportion to the size of the 5943 minimum or maximum. 5944 5945 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- 5946 alent to Perl's /s) is set, thus allowing the dot to match newlines, 5947 the pattern is implicitly anchored, because whatever follows will be 5948 tried against every character position in the subject string, so there 5949 is no point in retrying the overall match at any position after the 5950 first. PCRE normally treats such a pattern as though it were preceded 5951 by \A. 5952 5953 In cases where it is known that the subject string contains no new- 5954 lines, it is worth setting PCRE_DOTALL in order to obtain this opti- 5955 mization, or alternatively using ^ to indicate anchoring explicitly. 5956 5957 However, there are some cases where the optimization cannot be used. 5958 When .* is inside capturing parentheses that are the subject of a back 5959 reference elsewhere in the pattern, a match at the start may fail where 5960 a later one succeeds. Consider, for example: 5961 5962 (.*)abc\1 5963 5964 If the subject is "xyz123abc123" the match point is the fourth charac- 5965 ter. For this reason, such a pattern is not implicitly anchored. 5966 5967 Another case where implicit anchoring is not applied is when the lead- 5968 ing .* is inside an atomic group. Once again, a match at the start may 5969 fail where a later one succeeds. Consider this pattern: 5970 5971 (?>.*?a)b 5972 5973 It matches "ab" in the subject "aab". The use of the backtracking con- 5974 trol verbs (*PRUNE) and (*SKIP) also disable this optimization. 5975 5976 When a capturing subpattern is repeated, the value captured is the sub- 5977 string that matched the final iteration. For example, after 5978 5979 (tweedle[dume]{3}\s*)+ 5980 5981 has matched "tweedledum tweedledee" the value of the captured substring 5982 is "tweedledee". However, if there are nested capturing subpatterns, 5983 the corresponding captured values may have been set in previous itera- 5984 tions. For example, after 5985 5986 /(a|(b))+/ 5987 5988 matches "aba" the value of the second captured substring is "b". 5989 5990 5991ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS 5992 5993 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 5994 repetition, failure of what follows normally causes the repeated item 5995 to be re-evaluated to see if a different number of repeats allows the 5996 rest of the pattern to match. Sometimes it is useful to prevent this, 5997 either to change the nature of the match, or to cause it fail earlier 5998 than it otherwise might, when the author of the pattern knows there is 5999 no point in carrying on. 6000 6001 Consider, for example, the pattern \d+foo when applied to the subject 6002 line 6003 6004 123456bar 6005 6006 After matching all 6 digits and then failing to match "foo", the normal 6007 action of the matcher is to try again with only 5 digits matching the 6008 \d+ item, and then with 4, and so on, before ultimately failing. 6009 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides 6010 the means for specifying that once a subpattern has matched, it is not 6011 to be re-evaluated in this way. 6012 6013 If we use atomic grouping for the previous example, the matcher gives 6014 up immediately on failing to match "foo" the first time. The notation 6015 is a kind of special parenthesis, starting with (?> as in this example: 6016 6017 (?>\d+)foo 6018 6019 This kind of parenthesis "locks up" the part of the pattern it con- 6020 tains once it has matched, and a failure further into the pattern is 6021 prevented from backtracking into it. Backtracking past it to previous 6022 items, however, works as normal. 6023 6024 An alternative description is that a subpattern of this type matches 6025 the string of characters that an identical standalone pattern would 6026 match, if anchored at the current point in the subject string. 6027 6028 Atomic grouping subpatterns are not capturing subpatterns. Simple cases 6029 such as the above example can be thought of as a maximizing repeat that 6030 must swallow everything it can. So, while both \d+ and \d+? are pre- 6031 pared to adjust the number of digits they match in order to make the 6032 rest of the pattern match, (?>\d+) can only match an entire sequence of 6033 digits. 6034 6035 Atomic groups in general can of course contain arbitrarily complicated 6036 subpatterns, and can be nested. However, when the subpattern for an 6037 atomic group is just a single repeated item, as in the example above, a 6038 simpler notation, called a "possessive quantifier" can be used. This 6039 consists of an additional + character following a quantifier. Using 6040 this notation, the previous example can be rewritten as 6041 6042 \d++foo 6043 6044 Note that a possessive quantifier can be used with an entire group, for 6045 example: 6046 6047 (abc|xyz){2,3}+ 6048 6049 Possessive quantifiers are always greedy; the setting of the 6050 PCRE_UNGREEDY option is ignored. They are a convenient notation for the 6051 simpler forms of atomic group. However, there is no difference in the 6052 meaning of a possessive quantifier and the equivalent atomic group, 6053 though there may be a performance difference; possessive quantifiers 6054 should be slightly faster. 6055 6056 The possessive quantifier syntax is an extension to the Perl 5.8 syn- 6057 tax. Jeffrey Friedl originated the idea (and the name) in the first 6058 edition of his book. Mike McCloskey liked it, so implemented it when he 6059 built Sun's Java package, and PCRE copied it from there. It ultimately 6060 found its way into Perl at release 5.10. 6061 6062 PCRE has an optimization that automatically "possessifies" certain sim- 6063 ple pattern constructs. For example, the sequence A+B is treated as 6064 A++B because there is no point in backtracking into a sequence of A's 6065 when B must follow. 6066 6067 When a pattern contains an unlimited repeat inside a subpattern that 6068 can itself be repeated an unlimited number of times, the use of an 6069 atomic group is the only way to avoid some failing matches taking a 6070 very long time indeed. The pattern 6071 6072 (\D+|<\d+>)*[!?] 6073 6074 matches an unlimited number of substrings that either consist of non- 6075 digits, or digits enclosed in <>, followed by either ! or ?. When it 6076 matches, it runs quickly. However, if it is applied to 6077 6078 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 6079 6080 it takes a long time before reporting failure. This is because the 6081 string can be divided between the internal \D+ repeat and the external 6082 * repeat in a large number of ways, and all have to be tried. (The 6083 example uses [!?] rather than a single character at the end, because 6084 both PCRE and Perl have an optimization that allows for fast failure 6085 when a single character is used. They remember the last single charac- 6086 ter that is required for a match, and fail early if it is not present 6087 in the string.) If the pattern is changed so that it uses an atomic 6088 group, like this: 6089 6090 ((?>\D+)|<\d+>)*[!?] 6091 6092 sequences of non-digits cannot be broken, and failure happens quickly. 6093 6094 6095BACK REFERENCES 6096 6097 Outside a character class, a backslash followed by a digit greater than 6098 0 (and possibly further digits) is a back reference to a capturing sub- 6099 pattern earlier (that is, to its left) in the pattern, provided there 6100 have been that many previous capturing left parentheses. 6101 6102 However, if the decimal number following the backslash is less than 10, 6103 it is always taken as a back reference, and causes an error only if 6104 there are not that many capturing left parentheses in the entire pat- 6105 tern. In other words, the parentheses that are referenced need not be 6106 to the left of the reference for numbers less than 10. A "forward back 6107 reference" of this type can make sense when a repetition is involved 6108 and the subpattern to the right has participated in an earlier itera- 6109 tion. 6110 6111 It is not possible to have a numerical "forward back reference" to a 6112 subpattern whose number is 10 or more using this syntax because a 6113 sequence such as \50 is interpreted as a character defined in octal. 6114 See the subsection entitled "Non-printing characters" above for further 6115 details of the handling of digits following a backslash. There is no 6116 such problem when named parentheses are used. A back reference to any 6117 subpattern is possible using named parentheses (see below). 6118 6119 Another way of avoiding the ambiguity inherent in the use of digits 6120 following a backslash is to use the \g escape sequence. This escape 6121 must be followed by an unsigned number or a negative number, optionally 6122 enclosed in braces. These examples are all identical: 6123 6124 (ring), \1 6125 (ring), \g1 6126 (ring), \g{1} 6127 6128 An unsigned number specifies an absolute reference without the ambigu- 6129 ity that is present in the older syntax. It is also useful when literal 6130 digits follow the reference. A negative number is a relative reference. 6131 Consider this example: 6132 6133 (abc(def)ghi)\g{-1} 6134 6135 The sequence \g{-1} is a reference to the most recently started captur- 6136 ing subpattern before \g, that is, is it equivalent to \2 in this exam- 6137 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative 6138 references can be helpful in long patterns, and also in patterns that 6139 are created by joining together fragments that contain references 6140 within themselves. 6141 6142 A back reference matches whatever actually matched the capturing sub- 6143 pattern in the current subject string, rather than anything matching 6144 the subpattern itself (see "Subpatterns as subroutines" below for a way 6145 of doing that). So the pattern 6146 6147 (sens|respons)e and \1ibility 6148 6149 matches "sense and sensibility" and "response and responsibility", but 6150 not "sense and responsibility". If caseful matching is in force at the 6151 time of the back reference, the case of letters is relevant. For exam- 6152 ple, 6153 6154 ((?i)rah)\s+\1 6155 6156 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the 6157 original capturing subpattern is matched caselessly. 6158 6159 There are several different ways of writing back references to named 6160 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or 6161 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's 6162 unified back reference syntax, in which \g can be used for both numeric 6163 and named references, is also supported. We could rewrite the above 6164 example in any of the following ways: 6165 6166 (?<p1>(?i)rah)\s+\k<p1> 6167 (?'p1'(?i)rah)\s+\k{p1} 6168 (?P<p1>(?i)rah)\s+(?P=p1) 6169 (?<p1>(?i)rah)\s+\g{p1} 6170 6171 A subpattern that is referenced by name may appear in the pattern 6172 before or after the reference. 6173 6174 There may be more than one back reference to the same subpattern. If a 6175 subpattern has not actually been used in a particular match, any back 6176 references to it always fail by default. For example, the pattern 6177 6178 (a|(bc))\2 6179 6180 always fails if it starts to match "a" rather than "bc". However, if 6181 the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- 6182 ence to an unset value matches an empty string. 6183 6184 Because there may be many capturing parentheses in a pattern, all dig- 6185 its following a backslash are taken as part of a potential back refer- 6186 ence number. If the pattern continues with a digit character, some 6187 delimiter must be used to terminate the back reference. If the 6188 PCRE_EXTENDED option is set, this can be white space. Otherwise, the 6189 \g{ syntax or an empty comment (see "Comments" below) can be used. 6190 6191 Recursive back references 6192 6193 A back reference that occurs inside the parentheses to which it refers 6194 fails when the subpattern is first used, so, for example, (a\1) never 6195 matches. However, such references can be useful inside repeated sub- 6196 patterns. For example, the pattern 6197 6198 (a|b\1)+ 6199 6200 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- 6201 ation of the subpattern, the back reference matches the character 6202 string corresponding to the previous iteration. In order for this to 6203 work, the pattern must be such that the first iteration does not need 6204 to match the back reference. This can be done using alternation, as in 6205 the example above, or by a quantifier with a minimum of zero. 6206 6207 Back references of this type cause the group that they reference to be 6208 treated as an atomic group. Once the whole group has been matched, a 6209 subsequent matching failure cannot cause backtracking into the middle 6210 of the group. 6211 6212 6213ASSERTIONS 6214 6215 An assertion is a test on the characters following or preceding the 6216 current matching point that does not actually consume any characters. 6217 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are 6218 described above. 6219 6220 More complicated assertions are coded as subpatterns. There are two 6221 kinds: those that look ahead of the current position in the subject 6222 string, and those that look behind it. An assertion subpattern is 6223 matched in the normal way, except that it does not cause the current 6224 matching position to be changed. 6225 6226 Assertion subpatterns are not capturing subpatterns. If such an asser- 6227 tion contains capturing subpatterns within it, these are counted for 6228 the purposes of numbering the capturing subpatterns in the whole pat- 6229 tern. However, substring capturing is carried out only for positive 6230 assertions, because it does not make sense for negative assertions. 6231 6232 For compatibility with Perl, assertion subpatterns may be repeated; 6233 though it makes no sense to assert the same thing several times, the 6234 side effect of capturing parentheses may occasionally be useful. In 6235 practice, there only three cases: 6236 6237 (1) If the quantifier is {0}, the assertion is never obeyed during 6238 matching. However, it may contain internal capturing parenthesized 6239 groups that are called from elsewhere via the subroutine mechanism. 6240 6241 (2) If quantifier is {0,n} where n is greater than zero, it is treated 6242 as if it were {0,1}. At run time, the rest of the pattern match is 6243 tried with and without the assertion, the order depending on the greed- 6244 iness of the quantifier. 6245 6246 (3) If the minimum repetition is greater than zero, the quantifier is 6247 ignored. The assertion is obeyed just once when encountered during 6248 matching. 6249 6250 Lookahead assertions 6251 6252 Lookahead assertions start with (?= for positive assertions and (?! for 6253 negative assertions. For example, 6254 6255 \w+(?=;) 6256 6257 matches a word followed by a semicolon, but does not include the semi- 6258 colon in the match, and 6259 6260 foo(?!bar) 6261 6262 matches any occurrence of "foo" that is not followed by "bar". Note 6263 that the apparently similar pattern 6264 6265 (?!foo)bar 6266 6267 does not find an occurrence of "bar" that is preceded by something 6268 other than "foo"; it finds any occurrence of "bar" whatsoever, because 6269 the assertion (?!foo) is always true when the next three characters are 6270 "bar". A lookbehind assertion is needed to achieve the other effect. 6271 6272 If you want to force a matching failure at some point in a pattern, the 6273 most convenient way to do it is with (?!) because an empty string 6274 always matches, so an assertion that requires there not to be an empty 6275 string must always fail. The backtracking control verb (*FAIL) or (*F) 6276 is a synonym for (?!). 6277 6278 Lookbehind assertions 6279 6280 Lookbehind assertions start with (?<= for positive assertions and (?<! 6281 for negative assertions. For example, 6282 6283 (?<!foo)bar 6284 6285 does find an occurrence of "bar" that is not preceded by "foo". The 6286 contents of a lookbehind assertion are restricted such that all the 6287 strings it matches must have a fixed length. However, if there are sev- 6288 eral top-level alternatives, they do not all have to have the same 6289 fixed length. Thus 6290 6291 (?<=bullock|donkey) 6292 6293 is permitted, but 6294 6295 (?<!dogs?|cats?) 6296 6297 causes an error at compile time. Branches that match different length 6298 strings are permitted only at the top level of a lookbehind assertion. 6299 This is an extension compared with Perl, which requires all branches to 6300 match the same length of string. An assertion such as 6301 6302 (?<=ab(c|de)) 6303 6304 is not permitted, because its single top-level branch can match two 6305 different lengths, but it is acceptable to PCRE if rewritten to use two 6306 top-level branches: 6307 6308 (?<=abc|abde) 6309 6310 In some cases, the escape sequence \K (see above) can be used instead 6311 of a lookbehind assertion to get round the fixed-length restriction. 6312 6313 The implementation of lookbehind assertions is, for each alternative, 6314 to temporarily move the current position back by the fixed length and 6315 then try to match. If there are insufficient characters before the cur- 6316 rent position, the assertion fails. 6317 6318 In a UTF mode, PCRE does not allow the \C escape (which matches a sin- 6319 gle data unit even in a UTF mode) to appear in lookbehind assertions, 6320 because it makes it impossible to calculate the length of the lookbe- 6321 hind. The \X and \R escapes, which can match different numbers of data 6322 units, are also not permitted. 6323 6324 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in 6325 lookbehinds, as long as the subpattern matches a fixed-length string. 6326 Recursion, however, is not supported. 6327 6328 Possessive quantifiers can be used in conjunction with lookbehind 6329 assertions to specify efficient matching of fixed-length strings at the 6330 end of subject strings. Consider a simple pattern such as 6331 6332 abcd$ 6333 6334 when applied to a long string that does not match. Because matching 6335 proceeds from left to right, PCRE will look for each "a" in the subject 6336 and then see if what follows matches the rest of the pattern. If the 6337 pattern is specified as 6338 6339 ^.*abcd$ 6340 6341 the initial .* matches the entire string at first, but when this fails 6342 (because there is no following "a"), it backtracks to match all but the 6343 last character, then all but the last two characters, and so on. Once 6344 again the search for "a" covers the entire string, from right to left, 6345 so we are no better off. However, if the pattern is written as 6346 6347 ^.*+(?<=abcd) 6348 6349 there can be no backtracking for the .*+ item; it can match only the 6350 entire string. The subsequent lookbehind assertion does a single test 6351 on the last four characters. If it fails, the match fails immediately. 6352 For long strings, this approach makes a significant difference to the 6353 processing time. 6354 6355 Using multiple assertions 6356 6357 Several assertions (of any sort) may occur in succession. For example, 6358 6359 (?<=\d{3})(?<!999)foo 6360 6361 matches "foo" preceded by three digits that are not "999". Notice that 6362 each of the assertions is applied independently at the same point in 6363 the subject string. First there is a check that the previous three 6364 characters are all digits, and then there is a check that the same 6365 three characters are not "999". This pattern does not match "foo" pre- 6366 ceded by six characters, the first of which are digits and the last 6367 three of which are not "999". For example, it doesn't match "123abc- 6368 foo". A pattern to do that is 6369 6370 (?<=\d{3}...)(?<!999)foo 6371 6372 This time the first assertion looks at the preceding six characters, 6373 checking that the first three are digits, and then the second assertion 6374 checks that the preceding three characters are not "999". 6375 6376 Assertions can be nested in any combination. For example, 6377 6378 (?<=(?<!foo)bar)baz 6379 6380 matches an occurrence of "baz" that is preceded by "bar" which in turn 6381 is not preceded by "foo", while 6382 6383 (?<=\d{3}(?!999)...)foo 6384 6385 is another pattern that matches "foo" preceded by three digits and any 6386 three characters that are not "999". 6387 6388 6389CONDITIONAL SUBPATTERNS 6390 6391 It is possible to cause the matching process to obey a subpattern con- 6392 ditionally or to choose between two alternative subpatterns, depending 6393 on the result of an assertion, or whether a specific capturing subpat- 6394 tern has already been matched. The two possible forms of conditional 6395 subpattern are: 6396 6397 (?(condition)yes-pattern) 6398 (?(condition)yes-pattern|no-pattern) 6399 6400 If the condition is satisfied, the yes-pattern is used; otherwise the 6401 no-pattern (if present) is used. If there are more than two alterna- 6402 tives in the subpattern, a compile-time error occurs. Each of the two 6403 alternatives may itself contain nested subpatterns of any form, includ- 6404 ing conditional subpatterns; the restriction to two alternatives 6405 applies only at the level of the condition. This pattern fragment is an 6406 example where the alternatives are complex: 6407 6408 (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 6409 6410 6411 There are four kinds of condition: references to subpatterns, refer- 6412 ences to recursion, a pseudo-condition called DEFINE, and assertions. 6413 6414 Checking for a used subpattern by number 6415 6416 If the text between the parentheses consists of a sequence of digits, 6417 the condition is true if a capturing subpattern of that number has pre- 6418 viously matched. If there is more than one capturing subpattern with 6419 the same number (see the earlier section about duplicate subpattern 6420 numbers), the condition is true if any of them have matched. An alter- 6421 native notation is to precede the digits with a plus or minus sign. In 6422 this case, the subpattern number is relative rather than absolute. The 6423 most recently opened parentheses can be referenced by (?(-1), the next 6424 most recent by (?(-2), and so on. Inside loops it can also make sense 6425 to refer to subsequent groups. The next parentheses to be opened can be 6426 referenced as (?(+1), and so on. (The value zero in any of these forms 6427 is not used; it provokes a compile-time error.) 6428 6429 Consider the following pattern, which contains non-significant white 6430 space to make it more readable (assume the PCRE_EXTENDED option) and to 6431 divide it into three parts for ease of discussion: 6432 6433 ( \( )? [^()]+ (?(1) \) ) 6434 6435 The first part matches an optional opening parenthesis, and if that 6436 character is present, sets it as the first captured substring. The sec- 6437 ond part matches one or more characters that are not parentheses. The 6438 third part is a conditional subpattern that tests whether or not the 6439 first set of parentheses matched. If they did, that is, if subject 6440 started with an opening parenthesis, the condition is true, and so the 6441 yes-pattern is executed and a closing parenthesis is required. Other- 6442 wise, since no-pattern is not present, the subpattern matches nothing. 6443 In other words, this pattern matches a sequence of non-parentheses, 6444 optionally enclosed in parentheses. 6445 6446 If you were embedding this pattern in a larger one, you could use a 6447 relative reference: 6448 6449 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 6450 6451 This makes the fragment independent of the parentheses in the larger 6452 pattern. 6453 6454 Checking for a used subpattern by name 6455 6456 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a 6457 used subpattern by name. For compatibility with earlier versions of 6458 PCRE, which had this facility before Perl, the syntax (?(name)...) is 6459 also recognized. However, there is a possible ambiguity with this syn- 6460 tax, because subpattern names may consist entirely of digits. PCRE 6461 looks first for a named subpattern; if it cannot find one and the name 6462 consists entirely of digits, PCRE looks for a subpattern of that num- 6463 ber, which must be greater than zero. Using subpattern names that con- 6464 sist entirely of digits is not recommended. 6465 6466 Rewriting the above example to use a named subpattern gives this: 6467 6468 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 6469 6470 If the name used in a condition of this kind is a duplicate, the test 6471 is applied to all subpatterns of the same name, and is true if any one 6472 of them has matched. 6473 6474 Checking for pattern recursion 6475 6476 If the condition is the string (R), and there is no subpattern with the 6477 name R, the condition is true if a recursive call to the whole pattern 6478 or any subpattern has been made. If digits or a name preceded by amper- 6479 sand follow the letter R, for example: 6480 6481 (?(R3)...) or (?(R&name)...) 6482 6483 the condition is true if the most recent recursion is into a subpattern 6484 whose number or name is given. This condition does not check the entire 6485 recursion stack. If the name used in a condition of this kind is a 6486 duplicate, the test is applied to all subpatterns of the same name, and 6487 is true if any one of them is the most recent recursion. 6488 6489 At "top level", all these recursion test conditions are false. The 6490 syntax for recursive patterns is described below. 6491 6492 Defining subpatterns for use by reference only 6493 6494 If the condition is the string (DEFINE), and there is no subpattern 6495 with the name DEFINE, the condition is always false. In this case, 6496 there may be only one alternative in the subpattern. It is always 6497 skipped if control reaches this point in the pattern; the idea of 6498 DEFINE is that it can be used to define subroutines that can be refer- 6499 enced from elsewhere. (The use of subroutines is described below.) For 6500 example, a pattern to match an IPv4 address such as "192.168.23.245" 6501 could be written like this (ignore white space and line breaks): 6502 6503 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 6504 \b (?&byte) (\.(?&byte)){3} \b 6505 6506 The first part of the pattern is a DEFINE group inside which a another 6507 group named "byte" is defined. This matches an individual component of 6508 an IPv4 address (a number less than 256). When matching takes place, 6509 this part of the pattern is skipped because DEFINE acts like a false 6510 condition. The rest of the pattern uses references to the named group 6511 to match the four dot-separated components of an IPv4 address, insist- 6512 ing on a word boundary at each end. 6513 6514 Assertion conditions 6515 6516 If the condition is not in any of the above formats, it must be an 6517 assertion. This may be a positive or negative lookahead or lookbehind 6518 assertion. Consider this pattern, again containing non-significant 6519 white space, and with the two alternatives on the second line: 6520 6521 (?(?=[^a-z]*[a-z]) 6522 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 6523 6524 The condition is a positive lookahead assertion that matches an 6525 optional sequence of non-letters followed by a letter. In other words, 6526 it tests for the presence of at least one letter in the subject. If a 6527 letter is found, the subject is matched against the first alternative; 6528 otherwise it is matched against the second. This pattern matches 6529 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are 6530 letters and dd are digits. 6531 6532 6533COMMENTS 6534 6535 There are two ways of including comments in patterns that are processed 6536 by PCRE. In both cases, the start of the comment must not be in a char- 6537 acter class, nor in the middle of any other sequence of related charac- 6538 ters such as (?: or a subpattern name or number. The characters that 6539 make up a comment play no part in the pattern matching. 6540 6541 The sequence (?# marks the start of a comment that continues up to the 6542 next closing parenthesis. Nested parentheses are not permitted. If the 6543 PCRE_EXTENDED option is set, an unescaped # character also introduces a 6544 comment, which in this case continues to immediately after the next 6545 newline character or character sequence in the pattern. Which charac- 6546 ters are interpreted as newlines is controlled by the options passed to 6547 a compiling function or by a special sequence at the start of the pat- 6548 tern, as described in the section entitled "Newline conventions" above. 6549 Note that the end of this type of comment is a literal newline sequence 6550 in the pattern; escape sequences that happen to represent a newline do 6551 not count. For example, consider this pattern when PCRE_EXTENDED is 6552 set, and the default newline convention is in force: 6553 6554 abc #comment \n still comment 6555 6556 On encountering the # character, pcre_compile() skips along, looking 6557 for a newline in the pattern. The sequence \n is still literal at this 6558 stage, so it does not terminate the comment. Only an actual character 6559 with the code value 0x0a (the default newline) does so. 6560 6561 6562RECURSIVE PATTERNS 6563 6564 Consider the problem of matching a string in parentheses, allowing for 6565 unlimited nested parentheses. Without the use of recursion, the best 6566 that can be done is to use a pattern that matches up to some fixed 6567 depth of nesting. It is not possible to handle an arbitrary nesting 6568 depth. 6569 6570 For some time, Perl has provided a facility that allows regular expres- 6571 sions to recurse (amongst other things). It does this by interpolating 6572 Perl code in the expression at run time, and the code can refer to the 6573 expression itself. A Perl pattern using code interpolation to solve the 6574 parentheses problem can be created like this: 6575 6576 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 6577 6578 The (?p{...}) item interpolates Perl code at run time, and in this case 6579 refers recursively to the pattern in which it appears. 6580 6581 Obviously, PCRE cannot support the interpolation of Perl code. Instead, 6582 it supports special syntax for recursion of the entire pattern, and 6583 also for individual subpattern recursion. After its introduction in 6584 PCRE and Python, this kind of recursion was subsequently introduced 6585 into Perl at release 5.10. 6586 6587 A special item that consists of (? followed by a number greater than 6588 zero and a closing parenthesis is a recursive subroutine call of the 6589 subpattern of the given number, provided that it occurs inside that 6590 subpattern. (If not, it is a non-recursive subroutine call, which is 6591 described in the next section.) The special item (?R) or (?0) is a 6592 recursive call of the entire regular expression. 6593 6594 This PCRE pattern solves the nested parentheses problem (assume the 6595 PCRE_EXTENDED option is set so that white space is ignored): 6596 6597 \( ( [^()]++ | (?R) )* \) 6598 6599 First it matches an opening parenthesis. Then it matches any number of 6600 substrings which can either be a sequence of non-parentheses, or a 6601 recursive match of the pattern itself (that is, a correctly parenthe- 6602 sized substring). Finally there is a closing parenthesis. Note the use 6603 of a possessive quantifier to avoid backtracking into sequences of non- 6604 parentheses. 6605 6606 If this were part of a larger pattern, you would not want to recurse 6607 the entire pattern, so instead you could use this: 6608 6609 ( \( ( [^()]++ | (?1) )* \) ) 6610 6611 We have put the pattern into parentheses, and caused the recursion to 6612 refer to them instead of the whole pattern. 6613 6614 In a larger pattern, keeping track of parenthesis numbers can be 6615 tricky. This is made easier by the use of relative references. Instead 6616 of (?1) in the pattern above you can write (?-2) to refer to the second 6617 most recently opened parentheses preceding the recursion. In other 6618 words, a negative number counts capturing parentheses leftwards from 6619 the point at which it is encountered. 6620 6621 It is also possible to refer to subsequently opened parentheses, by 6622 writing references such as (?+2). However, these cannot be recursive 6623 because the reference is not inside the parentheses that are refer- 6624 enced. They are always non-recursive subroutine calls, as described in 6625 the next section. 6626 6627 An alternative approach is to use named parentheses instead. The Perl 6628 syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also 6629 supported. We could rewrite the above example as follows: 6630 6631 (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 6632 6633 If there is more than one subpattern with the same name, the earliest 6634 one is used. 6635 6636 This particular example pattern that we have been looking at contains 6637 nested unlimited repeats, and so the use of a possessive quantifier for 6638 matching strings of non-parentheses is important when applying the pat- 6639 tern to strings that do not match. For example, when this pattern is 6640 applied to 6641 6642 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 6643 6644 it yields "no match" quickly. However, if a possessive quantifier is 6645 not used, the match runs for a very long time indeed because there are 6646 so many different ways the + and * repeats can carve up the subject, 6647 and all have to be tested before failure can be reported. 6648 6649 At the end of a match, the values of capturing parentheses are those 6650 from the outermost level. If you want to obtain intermediate values, a 6651 callout function can be used (see below and the pcrecallout documenta- 6652 tion). If the pattern above is matched against 6653 6654 (ab(cd)ef) 6655 6656 the value for the inner capturing parentheses (numbered 2) is "ef", 6657 which is the last value taken on at the top level. If a capturing sub- 6658 pattern is not matched at the top level, its final captured value is 6659 unset, even if it was (temporarily) set at a deeper level during the 6660 matching process. 6661 6662 If there are more than 15 capturing parentheses in a pattern, PCRE has 6663 to obtain extra memory to store data during a recursion, which it does 6664 by using pcre_malloc, freeing it via pcre_free afterwards. If no memory 6665 can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. 6666 6667 Do not confuse the (?R) item with the condition (R), which tests for 6668 recursion. Consider this pattern, which matches text in angle brack- 6669 ets, allowing for arbitrary nesting. Only digits are allowed in nested 6670 brackets (that is, when recursing), whereas any characters are permit- 6671 ted at the outer level. 6672 6673 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 6674 6675 In this pattern, (?(R) is the start of a conditional subpattern, with 6676 two different alternatives for the recursive and non-recursive cases. 6677 The (?R) item is the actual recursive call. 6678 6679 Differences in recursion processing between PCRE and Perl 6680 6681 Recursion processing in PCRE differs from Perl in two important ways. 6682 In PCRE (like Python, but unlike Perl), a recursive subpattern call is 6683 always treated as an atomic group. That is, once it has matched some of 6684 the subject string, it is never re-entered, even if it contains untried 6685 alternatives and there is a subsequent matching failure. This can be 6686 illustrated by the following pattern, which purports to match a palin- 6687 dromic string that contains an odd number of characters (for example, 6688 "a", "aba", "abcba", "abcdcba"): 6689 6690 ^(.|(.)(?1)\2)$ 6691 6692 The idea is that it either matches a single character, or two identical 6693 characters surrounding a sub-palindrome. In Perl, this pattern works; 6694 in PCRE it does not if the pattern is longer than three characters. 6695 Consider the subject string "abcba": 6696 6697 At the top level, the first character is matched, but as it is not at 6698 the end of the string, the first alternative fails; the second alterna- 6699 tive is taken and the recursion kicks in. The recursive call to subpat- 6700 tern 1 successfully matches the next character ("b"). (Note that the 6701 beginning and end of line tests are not part of the recursion). 6702 6703 Back at the top level, the next character ("c") is compared with what 6704 subpattern 2 matched, which was "a". This fails. Because the recursion 6705 is treated as an atomic group, there are now no backtracking points, 6706 and so the entire match fails. (Perl is able, at this point, to re- 6707 enter the recursion and try the second alternative.) However, if the 6708 pattern is written with the alternatives in the other order, things are 6709 different: 6710 6711 ^((.)(?1)\2|.)$ 6712 6713 This time, the recursing alternative is tried first, and continues to 6714 recurse until it runs out of characters, at which point the recursion 6715 fails. But this time we do have another alternative to try at the 6716 higher level. That is the big difference: in the previous case the 6717 remaining alternative is at a deeper recursion level, which PCRE cannot 6718 use. 6719 6720 To change the pattern so that it matches all palindromic strings, not 6721 just those with an odd number of characters, it is tempting to change 6722 the pattern to this: 6723 6724 ^((.)(?1)\2|.?)$ 6725 6726 Again, this works in Perl, but not in PCRE, and for the same reason. 6727 When a deeper recursion has matched a single character, it cannot be 6728 entered again in order to match an empty string. The solution is to 6729 separate the two cases, and write out the odd and even cases as alter- 6730 natives at the higher level: 6731 6732 ^(?:((.)(?1)\2|)|((.)(?3)\4|.)) 6733 6734 If you want to match typical palindromic phrases, the pattern has to 6735 ignore all non-word characters, which can be done like this: 6736 6737 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ 6738 6739 If run with the PCRE_CASELESS option, this pattern matches phrases such 6740 as "A man, a plan, a canal: Panama!" and it works well in both PCRE and 6741 Perl. Note the use of the possessive quantifier *+ to avoid backtrack- 6742 ing into sequences of non-word characters. Without this, PCRE takes a 6743 great deal longer (ten times or more) to match typical phrases, and 6744 Perl takes so long that you think it has gone into a loop. 6745 6746 WARNING: The palindrome-matching patterns above work only if the sub- 6747 ject string does not start with a palindrome that is shorter than the 6748 entire string. For example, although "abcba" is correctly matched, if 6749 the subject is "ababa", PCRE finds the palindrome "aba" at the start, 6750 then fails at top level because the end of the string does not follow. 6751 Once again, it cannot jump back into the recursion to try other alter- 6752 natives, so the entire match fails. 6753 6754 The second way in which PCRE and Perl differ in their recursion pro- 6755 cessing is in the handling of captured values. In Perl, when a subpat- 6756 tern is called recursively or as a subpattern (see the next section), 6757 it has no access to any values that were captured outside the recur- 6758 sion, whereas in PCRE these values can be referenced. Consider this 6759 pattern: 6760 6761 ^(.)(\1|a(?2)) 6762 6763 In PCRE, this pattern matches "bab". The first capturing parentheses 6764 match "b", then in the second group, when the back reference \1 fails 6765 to match "b", the second alternative matches "a" and then recurses. In 6766 the recursion, \1 does now match "b" and so the whole match succeeds. 6767 In Perl, the pattern fails to match because inside the recursive call 6768 \1 cannot access the externally set value. 6769 6770 6771SUBPATTERNS AS SUBROUTINES 6772 6773 If the syntax for a recursive subpattern call (either by number or by 6774 name) is used outside the parentheses to which it refers, it operates 6775 like a subroutine in a programming language. The called subpattern may 6776 be defined before or after the reference. A numbered reference can be 6777 absolute or relative, as in these examples: 6778 6779 (...(absolute)...)...(?2)... 6780 (...(relative)...)...(?-1)... 6781 (...(?+1)...(relative)... 6782 6783 An earlier example pointed out that the pattern 6784 6785 (sens|respons)e and \1ibility 6786 6787 matches "sense and sensibility" and "response and responsibility", but 6788 not "sense and responsibility". If instead the pattern 6789 6790 (sens|respons)e and (?1)ibility 6791 6792 is used, it does match "sense and responsibility" as well as the other 6793 two strings. Another example is given in the discussion of DEFINE 6794 above. 6795 6796 All subroutine calls, whether recursive or not, are always treated as 6797 atomic groups. That is, once a subroutine has matched some of the sub- 6798 ject string, it is never re-entered, even if it contains untried alter- 6799 natives and there is a subsequent matching failure. Any capturing 6800 parentheses that are set during the subroutine call revert to their 6801 previous values afterwards. 6802 6803 Processing options such as case-independence are fixed when a subpat- 6804 tern is defined, so if it is used as a subroutine, such options cannot 6805 be changed for different calls. For example, consider this pattern: 6806 6807 (abc)(?i:(?-1)) 6808 6809 It matches "abcabc". It does not match "abcABC" because the change of 6810 processing option does not affect the called subpattern. 6811 6812 6813ONIGURUMA SUBROUTINE SYNTAX 6814 6815 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 6816 name or a number enclosed either in angle brackets or single quotes, is 6817 an alternative syntax for referencing a subpattern as a subroutine, 6818 possibly recursively. Here are two of the examples used above, rewrit- 6819 ten using this syntax: 6820 6821 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 6822 (sens|respons)e and \g'1'ibility 6823 6824 PCRE supports an extension to Oniguruma: if a number is preceded by a 6825 plus or a minus sign it is taken as a relative reference. For example: 6826 6827 (abc)(?i:\g<-1>) 6828 6829 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not 6830 synonymous. The former is a back reference; the latter is a subroutine 6831 call. 6832 6833 6834CALLOUTS 6835 6836 Perl has a feature whereby using the sequence (?{...}) causes arbitrary 6837 Perl code to be obeyed in the middle of matching a regular expression. 6838 This makes it possible, amongst other things, to extract different sub- 6839 strings that match the same pair of parentheses when there is a repeti- 6840 tion. 6841 6842 PCRE provides a similar feature, but of course it cannot obey arbitrary 6843 Perl code. The feature is called "callout". The caller of PCRE provides 6844 an external function by putting its entry point in the global variable 6845 pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit or 32-bit 6846 library). By default, this variable contains NULL, which disables all 6847 calling out. 6848 6849 Within a regular expression, (?C) indicates the points at which the 6850 external function is to be called. If you want to identify different 6851 callout points, you can put a number less than 256 after the letter C. 6852 The default value is zero. For example, this pattern has two callout 6853 points: 6854 6855 (?C1)abc(?C2)def 6856 6857 If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call- 6858 outs are automatically installed before each item in the pattern. They 6859 are all numbered 255. 6860 6861 During matching, when PCRE reaches a callout point, the external func- 6862 tion is called. It is provided with the number of the callout, the 6863 position in the pattern, and, optionally, one item of data originally 6864 supplied by the caller of the matching function. The callout function 6865 may cause matching to proceed, to backtrack, or to fail altogether. A 6866 complete description of the interface to the callout function is given 6867 in the pcrecallout documentation. 6868 6869 6870BACKTRACKING CONTROL 6871 6872 Perl 5.10 introduced a number of "Special Backtracking Control Verbs", 6873 which are described in the Perl documentation as "experimental and sub- 6874 ject to change or removal in a future version of Perl". It goes on to 6875 say: "Their usage in production code should be noted to avoid problems 6876 during upgrades." The same remarks apply to the PCRE features described 6877 in this section. 6878 6879 Since these verbs are specifically related to backtracking, most of 6880 them can be used only when the pattern is to be matched using one of 6881 the traditional matching functions, which use a backtracking algorithm. 6882 With the exception of (*FAIL), which behaves like a failing negative 6883 assertion, they cause an error if encountered by a DFA matching func- 6884 tion. 6885 6886 If any of these verbs are used in an assertion or in a subpattern that 6887 is called as a subroutine (whether or not recursively), their effect is 6888 confined to that subpattern; it does not extend to the surrounding pat- 6889 tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN) 6890 that is encountered in a successful positive assertion is passed back 6891 when a match succeeds (compare capturing parentheses in assertions). 6892 Note that such subpatterns are processed as anchored at the point where 6893 they are tested. Note also that Perl's treatment of subroutines and 6894 assertions is different in some cases. 6895 6896 The new verbs make use of what was previously invalid syntax: an open- 6897 ing parenthesis followed by an asterisk. They are generally of the form 6898 (*VERB) or (*VERB:NAME). Some may take either form, with differing be- 6899 haviour, depending on whether or not an argument is present. A name is 6900 any sequence of characters that does not include a closing parenthesis. 6901 The maximum length of name is 255 in the 8-bit library and 65535 in the 6902 16-bit and 32-bit library. If the name is empty, that is, if the clos- 6903 ing parenthesis immediately follows the colon, the effect is as if the 6904 colon were not there. Any number of these verbs may occur in a pattern. 6905 6906 Optimizations that affect backtracking verbs 6907 6908 PCRE contains some optimizations that are used to speed up matching by 6909 running some checks at the start of each match attempt. For example, it 6910 may know the minimum length of matching subject, or that a particular 6911 character must be present. When one of these optimizations suppresses 6912 the running of a match, any included backtracking verbs will not, of 6913 course, be processed. You can suppress the start-of-match optimizations 6914 by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- 6915 pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). 6916 There is more discussion of this option in the section entitled "Option 6917 bits for pcre_exec()" in the pcreapi documentation. 6918 6919 Experiments with Perl suggest that it too has similar optimizations, 6920 sometimes leading to anomalous results. 6921 6922 Verbs that act immediately 6923 6924 The following verbs act as soon as they are encountered. They may not 6925 be followed by a name. 6926 6927 (*ACCEPT) 6928 6929 This verb causes the match to end successfully, skipping the remainder 6930 of the pattern. However, when it is inside a subpattern that is called 6931 as a subroutine, only that subpattern is ended successfully. Matching 6932 then continues at the outer level. If (*ACCEPT) is inside capturing 6933 parentheses, the data so far is captured. For example: 6934 6935 A((?:A|B(*ACCEPT)|C)D) 6936 6937 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- 6938 tured by the outer parentheses. 6939 6940 (*FAIL) or (*F) 6941 6942 This verb causes a matching failure, forcing backtracking to occur. It 6943 is equivalent to (?!) but easier to read. The Perl documentation notes 6944 that it is probably useful only when combined with (?{}) or (??{}). 6945 Those are, of course, Perl features that are not present in PCRE. The 6946 nearest equivalent is the callout feature, as for example in this pat- 6947 tern: 6948 6949 a+(?C)(*FAIL) 6950 6951 A match with the string "aaaa" always fails, but the callout is taken 6952 before each backtrack happens (in this example, 10 times). 6953 6954 Recording which path was taken 6955 6956 There is one verb whose main purpose is to track how a match was 6957 arrived at, though it also has a secondary use in conjunction with 6958 advancing the match starting point (see (*SKIP) below). 6959 6960 (*MARK:NAME) or (*:NAME) 6961 6962 A name is always required with this verb. There may be as many 6963 instances of (*MARK) as you like in a pattern, and their names do not 6964 have to be unique. 6965 6966 When a match succeeds, the name of the last-encountered (*MARK) on the 6967 matching path is passed back to the caller as described in the section 6968 entitled "Extra data for pcre_exec()" in the pcreapi documentation. 6969 Here is an example of pcretest output, where the /K modifier requests 6970 the retrieval and outputting of (*MARK) data: 6971 6972 re> /X(*MARK:A)Y|X(*MARK:B)Z/K 6973 data> XY 6974 0: XY 6975 MK: A 6976 XZ 6977 0: XZ 6978 MK: B 6979 6980 The (*MARK) name is tagged with "MK:" in this output, and in this exam- 6981 ple it indicates which of the two alternatives matched. This is a more 6982 efficient way of obtaining this information than putting each alterna- 6983 tive in its own capturing parentheses. 6984 6985 If (*MARK) is encountered in a positive assertion, its name is recorded 6986 and passed back if it is the last-encountered. This does not happen for 6987 negative assertions. 6988 6989 After a partial match or a failed match, the name of the last encoun- 6990 tered (*MARK) in the entire match process is returned. For example: 6991 6992 re> /X(*MARK:A)Y|X(*MARK:B)Z/K 6993 data> XP 6994 No match, mark = B 6995 6996 Note that in this unanchored example the mark is retained from the 6997 match attempt that started at the letter "X" in the subject. Subsequent 6998 match attempts starting at "P" and then with an empty string do not get 6999 as far as the (*MARK) item, but nevertheless do not reset it. 7000 7001 If you are interested in (*MARK) values after failed matches, you 7002 should probably set the PCRE_NO_START_OPTIMIZE option (see above) to 7003 ensure that the match is always attempted. 7004 7005 Verbs that act after backtracking 7006 7007 The following verbs do nothing when they are encountered. Matching con- 7008 tinues with what follows, but if there is no subsequent match, causing 7009 a backtrack to the verb, a failure is forced. That is, backtracking 7010 cannot pass to the left of the verb. However, when one of these verbs 7011 appears inside an atomic group, its effect is confined to that group, 7012 because once the group has been matched, there is never any backtrack- 7013 ing into it. In this situation, backtracking can "jump back" to the 7014 left of the entire atomic group. (Remember also, as stated above, that 7015 this localization also applies in subroutine calls and assertions.) 7016 7017 These verbs differ in exactly what kind of failure occurs when back- 7018 tracking reaches them. 7019 7020 (*COMMIT) 7021 7022 This verb, which may not be followed by a name, causes the whole match 7023 to fail outright if the rest of the pattern does not match. Even if the 7024 pattern is unanchored, no further attempts to find a match by advancing 7025 the starting point take place. Once (*COMMIT) has been passed, 7026 pcre_exec() is committed to finding a match at the current starting 7027 point, or not at all. For example: 7028 7029 a+(*COMMIT)b 7030 7031 This matches "xxaab" but not "aacaab". It can be thought of as a kind 7032 of dynamic anchor, or "I've started, so I must finish." The name of the 7033 most recently passed (*MARK) in the path is passed back when (*COMMIT) 7034 forces a match failure. 7035 7036 Note that (*COMMIT) at the start of a pattern is not the same as an 7037 anchor, unless PCRE's start-of-match optimizations are turned off, as 7038 shown in this pcretest example: 7039 7040 re> /(*COMMIT)abc/ 7041 data> xyzabc 7042 0: abc 7043 xyzabc\Y 7044 No match 7045 7046 PCRE knows that any match must start with "a", so the optimization 7047 skips along the subject to "a" before running the first match attempt, 7048 which succeeds. When the optimization is disabled by the \Y escape in 7049 the second subject, the match starts at "x" and so the (*COMMIT) causes 7050 it to fail without trying any other starting points. 7051 7052 (*PRUNE) or (*PRUNE:NAME) 7053 7054 This verb causes the match to fail at the current starting position in 7055 the subject if the rest of the pattern does not match. If the pattern 7056 is unanchored, the normal "bumpalong" advance to the next starting 7057 character then happens. Backtracking can occur as usual to the left of 7058 (*PRUNE), before it is reached, or when matching to the right of 7059 (*PRUNE), but if there is no match to the right, backtracking cannot 7060 cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- 7061 native to an atomic group or possessive quantifier, but there are some 7062 uses of (*PRUNE) that cannot be expressed in any other way. The behav- 7063 iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an 7064 anchored pattern (*PRUNE) has the same effect as (*COMMIT). 7065 7066 (*SKIP) 7067 7068 This verb, when given without a name, is like (*PRUNE), except that if 7069 the pattern is unanchored, the "bumpalong" advance is not to the next 7070 character, but to the position in the subject where (*SKIP) was encoun- 7071 tered. (*SKIP) signifies that whatever text was matched leading up to 7072 it cannot be part of a successful match. Consider: 7073 7074 a+(*SKIP)b 7075 7076 If the subject is "aaaac...", after the first match attempt fails 7077 (starting at the first character in the string), the starting point 7078 skips on to start the next attempt at "c". Note that a possessive quan- 7079 tifer does not have the same effect as this example; although it would 7080 suppress backtracking during the first match attempt, the second 7081 attempt would start at the second character instead of skipping on to 7082 "c". 7083 7084 (*SKIP:NAME) 7085 7086 When (*SKIP) has an associated name, its behaviour is modified. If the 7087 following pattern fails to match, the previous path through the pattern 7088 is searched for the most recent (*MARK) that has the same name. If one 7089 is found, the "bumpalong" advance is to the subject position that cor- 7090 responds to that (*MARK) instead of to where (*SKIP) was encountered. 7091 If no (*MARK) with a matching name is found, the (*SKIP) is ignored. 7092 7093 (*THEN) or (*THEN:NAME) 7094 7095 This verb causes a skip to the next innermost alternative if the rest 7096 of the pattern does not match. That is, it cancels pending backtrack- 7097 ing, but only within the current alternative. Its name comes from the 7098 observation that it can be used for a pattern-based if-then-else block: 7099 7100 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 7101 7102 If the COND1 pattern matches, FOO is tried (and possibly further items 7103 after the end of the group if FOO succeeds); on failure, the matcher 7104 skips to the second alternative and tries COND2, without backtracking 7105 into COND1. The behaviour of (*THEN:NAME) is exactly the same as 7106 (*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts 7107 like (*PRUNE). 7108 7109 Note that a subpattern that does not contain a | character is just a 7110 part of the enclosing alternative; it is not a nested alternation with 7111 only one alternative. The effect of (*THEN) extends beyond such a sub- 7112 pattern to the enclosing alternative. Consider this pattern, where A, 7113 B, etc. are complex pattern fragments that do not contain any | charac- 7114 ters at this level: 7115 7116 A (B(*THEN)C) | D 7117 7118 If A and B are matched, but there is a failure in C, matching does not 7119 backtrack into A; instead it moves to the next alternative, that is, D. 7120 However, if the subpattern containing (*THEN) is given an alternative, 7121 it behaves differently: 7122 7123 A (B(*THEN)C | (*FAIL)) | D 7124 7125 The effect of (*THEN) is now confined to the inner subpattern. After a 7126 failure in C, matching moves to (*FAIL), which causes the whole subpat- 7127 tern to fail because there are no more alternatives to try. In this 7128 case, matching does now backtrack into A. 7129 7130 Note also that a conditional subpattern is not considered as having two 7131 alternatives, because only one is ever used. In other words, the | 7132 character in a conditional subpattern has a different meaning. Ignoring 7133 white space, consider: 7134 7135 ^.*? (?(?=a) a | b(*THEN)c ) 7136 7137 If the subject is "ba", this pattern does not match. Because .*? is 7138 ungreedy, it initially matches zero characters. The condition (?=a) 7139 then fails, the character "b" is matched, but "c" is not. At this 7140 point, matching does not backtrack to .*? as might perhaps be expected 7141 from the presence of the | character. The conditional subpattern is 7142 part of the single alternative that comprises the whole pattern, and so 7143 the match fails. (If there was a backtrack into .*?, allowing it to 7144 match "b", the match would succeed.) 7145 7146 The verbs just described provide four different "strengths" of control 7147 when subsequent matching fails. (*THEN) is the weakest, carrying on the 7148 match at the next alternative. (*PRUNE) comes next, failing the match 7149 at the current starting position, but allowing an advance to the next 7150 character (for an unanchored pattern). (*SKIP) is similar, except that 7151 the advance may be more than one character. (*COMMIT) is the strongest, 7152 causing the entire match to fail. 7153 7154 If more than one such verb is present in a pattern, the "strongest" one 7155 wins. For example, consider this pattern, where A, B, etc. are complex 7156 pattern fragments: 7157 7158 (A(*COMMIT)B(*THEN)C|D) 7159 7160 Once A has matched, PCRE is committed to this match, at the current 7161 starting position. If subsequently B matches, but C does not, the nor- 7162 mal (*THEN) action of trying the next alternative (that is, D) does not 7163 happen because (*COMMIT) overrides. 7164 7165 7166SEE ALSO 7167 7168 pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3), 7169 pcre16(3), pcre32(3). 7170 7171 7172AUTHOR 7173 7174 Philip Hazel 7175 University Computing Service 7176 Cambridge CB2 3QH, England. 7177 7178 7179REVISION 7180 7181 Last updated: 11 November 2012 7182 Copyright (c) 1997-2012 University of Cambridge. 7183------------------------------------------------------------------------------ 7184 7185 7186PCRESYNTAX(3) PCRESYNTAX(3) 7187 7188 7189NAME 7190 PCRE - Perl-compatible regular expressions 7191 7192 7193PCRE REGULAR EXPRESSION SYNTAX SUMMARY 7194 7195 The full syntax and semantics of the regular expressions that are sup- 7196 ported by PCRE are described in the pcrepattern documentation. This 7197 document contains a quick-reference summary of the syntax. 7198 7199 7200QUOTING 7201 7202 \x where x is non-alphanumeric is a literal x 7203 \Q...\E treat enclosed characters as literal 7204 7205 7206CHARACTERS 7207 7208 \a alarm, that is, the BEL character (hex 07) 7209 \cx "control-x", where x is any ASCII character 7210 \e escape (hex 1B) 7211 \f form feed (hex 0C) 7212 \n newline (hex 0A) 7213 \r carriage return (hex 0D) 7214 \t tab (hex 09) 7215 \ddd character with octal code ddd, or backreference 7216 \xhh character with hex code hh 7217 \x{hhh..} character with hex code hhh.. 7218 7219 7220CHARACTER TYPES 7221 7222 . any character except newline; 7223 in dotall mode, any character whatsoever 7224 \C one data unit, even in UTF mode (best avoided) 7225 \d a decimal digit 7226 \D a character that is not a decimal digit 7227 \h a horizontal white space character 7228 \H a character that is not a horizontal white space character 7229 \N a character that is not a newline 7230 \p{xx} a character with the xx property 7231 \P{xx} a character without the xx property 7232 \R a newline sequence 7233 \s a white space character 7234 \S a character that is not a white space character 7235 \v a vertical white space character 7236 \V a character that is not a vertical white space character 7237 \w a "word" character 7238 \W a "non-word" character 7239 \X a Unicode extended grapheme cluster 7240 7241 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII 7242 characters, even in a UTF mode. However, this can be changed by setting 7243 the PCRE_UCP option. 7244 7245 7246GENERAL CATEGORY PROPERTIES FOR \p and \P 7247 7248 C Other 7249 Cc Control 7250 Cf Format 7251 Cn Unassigned 7252 Co Private use 7253 Cs Surrogate 7254 7255 L Letter 7256 Ll Lower case letter 7257 Lm Modifier letter 7258 Lo Other letter 7259 Lt Title case letter 7260 Lu Upper case letter 7261 L& Ll, Lu, or Lt 7262 7263 M Mark 7264 Mc Spacing mark 7265 Me Enclosing mark 7266 Mn Non-spacing mark 7267 7268 N Number 7269 Nd Decimal number 7270 Nl Letter number 7271 No Other number 7272 7273 P Punctuation 7274 Pc Connector punctuation 7275 Pd Dash punctuation 7276 Pe Close punctuation 7277 Pf Final punctuation 7278 Pi Initial punctuation 7279 Po Other punctuation 7280 Ps Open punctuation 7281 7282 S Symbol 7283 Sc Currency symbol 7284 Sk Modifier symbol 7285 Sm Mathematical symbol 7286 So Other symbol 7287 7288 Z Separator 7289 Zl Line separator 7290 Zp Paragraph separator 7291 Zs Space separator 7292 7293 7294PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P 7295 7296 Xan Alphanumeric: union of properties L and N 7297 Xps POSIX space: property Z or tab, NL, VT, FF, CR 7298 Xsp Perl space: property Z or tab, NL, FF, CR 7299 Xwd Perl word: property Xan or underscore 7300 7301 7302SCRIPT NAMES FOR \p AND \P 7303 7304 Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, 7305 Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, 7306 Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, 7307 Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, 7308 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- 7309 gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- 7310 tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, 7311 Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, 7312 Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, 7313 Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, 7314 Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, 7315 Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- 7316 tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, 7317 Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, 7318 Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, 7319 Yi. 7320 7321 7322CHARACTER CLASSES 7323 7324 [...] positive character class 7325 [^...] negative character class 7326 [x-y] range (can be used for hex characters) 7327 [[:xxx:]] positive POSIX named set 7328 [[:^xxx:]] negative POSIX named set 7329 7330 alnum alphanumeric 7331 alpha alphabetic 7332 ascii 0-127 7333 blank space or tab 7334 cntrl control character 7335 digit decimal digit 7336 graph printing, excluding space 7337 lower lower case letter 7338 print printing, including space 7339 punct printing, excluding alphanumeric 7340 space white space 7341 upper upper case letter 7342 word same as \w 7343 xdigit hexadecimal digit 7344 7345 In PCRE, POSIX character set names recognize only ASCII characters by 7346 default, but some of them use Unicode properties if PCRE_UCP is set. 7347 You can use \Q...\E inside a character class. 7348 7349 7350QUANTIFIERS 7351 7352 ? 0 or 1, greedy 7353 ?+ 0 or 1, possessive 7354 ?? 0 or 1, lazy 7355 * 0 or more, greedy 7356 *+ 0 or more, possessive 7357 *? 0 or more, lazy 7358 + 1 or more, greedy 7359 ++ 1 or more, possessive 7360 +? 1 or more, lazy 7361 {n} exactly n 7362 {n,m} at least n, no more than m, greedy 7363 {n,m}+ at least n, no more than m, possessive 7364 {n,m}? at least n, no more than m, lazy 7365 {n,} n or more, greedy 7366 {n,}+ n or more, possessive 7367 {n,}? n or more, lazy 7368 7369 7370ANCHORS AND SIMPLE ASSERTIONS 7371 7372 \b word boundary 7373 \B not a word boundary 7374 ^ start of subject 7375 also after internal newline in multiline mode 7376 \A start of subject 7377 $ end of subject 7378 also before newline at end of subject 7379 also before internal newline in multiline mode 7380 \Z end of subject 7381 also before newline at end of subject 7382 \z end of subject 7383 \G first matching position in subject 7384 7385 7386MATCH POINT RESET 7387 7388 \K reset start of match 7389 7390 7391ALTERNATION 7392 7393 expr|expr|expr... 7394 7395 7396CAPTURING 7397 7398 (...) capturing group 7399 (?<name>...) named capturing group (Perl) 7400 (?'name'...) named capturing group (Perl) 7401 (?P<name>...) named capturing group (Python) 7402 (?:...) non-capturing group 7403 (?|...) non-capturing group; reset group numbers for 7404 capturing groups in each alternative 7405 7406 7407ATOMIC GROUPS 7408 7409 (?>...) atomic, non-capturing group 7410 7411 7412COMMENT 7413 7414 (?#....) comment (not nestable) 7415 7416 7417OPTION SETTING 7418 7419 (?i) caseless 7420 (?J) allow duplicate names 7421 (?m) multiline 7422 (?s) single line (dotall) 7423 (?U) default ungreedy (lazy) 7424 (?x) extended (ignore white space) 7425 (?-...) unset option(s) 7426 7427 The following are recognized only at the start of a pattern or after 7428 one of the newline-setting options with similar syntax: 7429 7430 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) 7431 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) 7432 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) 7433 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) 7434 (*UTF) set appropriate UTF mode for the library in use 7435 (*UCP) set PCRE_UCP (use Unicode properties for \d etc) 7436 7437 7438LOOKAHEAD AND LOOKBEHIND ASSERTIONS 7439 7440 (?=...) positive look ahead 7441 (?!...) negative look ahead 7442 (?<=...) positive look behind 7443 (?<!...) negative look behind 7444 7445 Each top-level branch of a look behind must be of a fixed length. 7446 7447 7448BACKREFERENCES 7449 7450 \n reference by number (can be ambiguous) 7451 \gn reference by number 7452 \g{n} reference by number 7453 \g{-n} relative reference by number 7454 \k<name> reference by name (Perl) 7455 \k'name' reference by name (Perl) 7456 \g{name} reference by name (Perl) 7457 \k{name} reference by name (.NET) 7458 (?P=name) reference by name (Python) 7459 7460 7461SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) 7462 7463 (?R) recurse whole pattern 7464 (?n) call subpattern by absolute number 7465 (?+n) call subpattern by relative number 7466 (?-n) call subpattern by relative number 7467 (?&name) call subpattern by name (Perl) 7468 (?P>name) call subpattern by name (Python) 7469 \g<name> call subpattern by name (Oniguruma) 7470 \g'name' call subpattern by name (Oniguruma) 7471 \g<n> call subpattern by absolute number (Oniguruma) 7472 \g'n' call subpattern by absolute number (Oniguruma) 7473 \g<+n> call subpattern by relative number (PCRE extension) 7474 \g'+n' call subpattern by relative number (PCRE extension) 7475 \g<-n> call subpattern by relative number (PCRE extension) 7476 \g'-n' call subpattern by relative number (PCRE extension) 7477 7478 7479CONDITIONAL PATTERNS 7480 7481 (?(condition)yes-pattern) 7482 (?(condition)yes-pattern|no-pattern) 7483 7484 (?(n)... absolute reference condition 7485 (?(+n)... relative reference condition 7486 (?(-n)... relative reference condition 7487 (?(<name>)... named reference condition (Perl) 7488 (?('name')... named reference condition (Perl) 7489 (?(name)... named reference condition (PCRE) 7490 (?(R)... overall recursion condition 7491 (?(Rn)... specific group recursion condition 7492 (?(R&name)... specific recursion condition 7493 (?(DEFINE)... define subpattern for reference 7494 (?(assert)... assertion condition 7495 7496 7497BACKTRACKING CONTROL 7498 7499 The following act immediately they are reached: 7500 7501 (*ACCEPT) force successful match 7502 (*FAIL) force backtrack; synonym (*F) 7503 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 7504 7505 The following act only when a subsequent match failure causes a back- 7506 track to reach them. They all force a match failure, but they differ in 7507 what happens afterwards. Those that advance the start-of-match point do 7508 so only if the pattern is not anchored. 7509 7510 (*COMMIT) overall failure, no advance of starting point 7511 (*PRUNE) advance to next starting character 7512 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 7513 (*SKIP) advance to current matching position 7514 (*SKIP:NAME) advance to position corresponding to an earlier 7515 (*MARK:NAME); if not found, the (*SKIP) is ignored 7516 (*THEN) local failure, backtrack to next alternation 7517 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 7518 7519 7520NEWLINE CONVENTIONS 7521 7522 These are recognized only at the very start of the pattern or after a 7523 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. 7524 7525 (*CR) carriage return only 7526 (*LF) linefeed only 7527 (*CRLF) carriage return followed by linefeed 7528 (*ANYCRLF) all three of the above 7529 (*ANY) any Unicode newline sequence 7530 7531 7532WHAT \R MATCHES 7533 7534 These are recognized only at the very start of the pattern or after a 7535 (*...) option that sets the newline convention or a UTF or UCP mode. 7536 7537 (*BSR_ANYCRLF) CR, LF, or CRLF 7538 (*BSR_UNICODE) any Unicode newline sequence 7539 7540 7541CALLOUTS 7542 7543 (?C) callout 7544 (?Cn) callout with data n 7545 7546 7547SEE ALSO 7548 7549 pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). 7550 7551 7552AUTHOR 7553 7554 Philip Hazel 7555 University Computing Service 7556 Cambridge CB2 3QH, England. 7557 7558 7559REVISION 7560 7561 Last updated: 11 November 2012 7562 Copyright (c) 1997-2012 University of Cambridge. 7563------------------------------------------------------------------------------ 7564 7565 7566PCREUNICODE(3) PCREUNICODE(3) 7567 7568 7569NAME 7570 PCRE - Perl-compatible regular expressions 7571 7572 7573UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT 7574 7575 As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30) 7576 and UTF-32 (from release 8.32), by means of two additional libraries. 7577 They can be built as well as, or instead of, the 8-bit library. 7578 7579 7580UTF-8 SUPPORT 7581 7582 In order process UTF-8 strings, you must build PCRE's 8-bit library 7583 with UTF support, and, in addition, you must call pcre_compile() with 7584 the PCRE_UTF8 option flag, or the pattern must start with the sequence 7585 (*UTF8) or (*UTF). When either of these is the case, both the pattern 7586 and any subject strings that are matched against it are treated as 7587 UTF-8 strings instead of strings of individual 1-byte characters. 7588 7589 7590UTF-16 AND UTF-32 SUPPORT 7591 7592 In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit 7593 or 32-bit library with UTF support, and, in addition, you must call 7594 pcre16_compile() or pcre32_compile() with the PCRE_UTF16 or PCRE_UTF32 7595 option flag, as appropriate. Alternatively, the pattern must start with 7596 the sequence (*UTF16), (*UTF32), as appropriate, or (*UTF), which can 7597 be used with either library. When UTF mode is set, both the pattern and 7598 any subject strings that are matched against it are treated as UTF-16 7599 or UTF-32 strings instead of strings of individual 16-bit or 32-bit 7600 characters. 7601 7602 7603UTF SUPPORT OVERHEAD 7604 7605 If you compile PCRE with UTF support, but do not use it at run time, 7606 the library will be a bit bigger, but the additional run time overhead 7607 is limited to testing the PCRE_UTF[8|16|32] flag occasionally, so 7608 should not be very big. 7609 7610 7611UNICODE PROPERTY SUPPORT 7612 7613 If PCRE is built with Unicode character property support (which implies 7614 UTF support), the escape sequences \p{..}, \P{..}, and \X can be used. 7615 The available properties that can be tested are limited to the general 7616 category properties such as Lu for an upper case letter or Nd for a 7617 decimal number, the Unicode script names such as Arabic or Han, and the 7618 derived properties Any and L&. Full lists is given in the pcrepattern 7619 and pcresyntax documentation. Only the short names for properties are 7620 supported. For example, \p{L} matches a letter. Its Perl synonym, 7621 \p{Letter}, is not supported. Furthermore, in Perl, many properties 7622 may optionally be prefixed by "Is", for compatibility with Perl 5.6. 7623 PCRE does not support this. 7624 7625 Validity of UTF-8 strings 7626 7627 When you set the PCRE_UTF8 flag, the byte strings passed as patterns 7628 and subjects are (by default) checked for validity on entry to the rel- 7629 evant functions. The entire string is checked before any other process- 7630 ing takes place. From release 7.3 of PCRE, the check is according the 7631 rules of RFC 3629, which are themselves derived from the Unicode speci- 7632 fication. Earlier releases of PCRE followed the rules of RFC 2279, 7633 which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The 7634 current check allows only values in the range U+0 to U+10FFFF, exclud- 7635 ing the surrogate area and the non-characters. 7636 7637 Characters in the "Surrogate Area" of Unicode are reserved for use by 7638 UTF-16, where they are used in pairs to encode codepoints with values 7639 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs 7640 are available independently in the UTF-8 and UTF-32 encodings. (In 7641 other words, the whole surrogate thing is a fudge for UTF-16 which 7642 unfortunately messes up UTF-8 and UTF-32.) 7643 7644 Also excluded are the "Non-Character" code points, which are U+FDD0 to 7645 U+FDEF and the last two code points in each plane, U+??FFFE and 7646 U+??FFFF. 7647 7648 If an invalid UTF-8 string is passed to PCRE, an error return is given. 7649 At compile time, the only additional information is the offset to the 7650 first byte of the failing character. The run-time functions pcre_exec() 7651 and pcre_dfa_exec() also pass back this information, as well as a more 7652 detailed reason code if the caller has provided memory in which to do 7653 this. 7654 7655 In some situations, you may already know that your strings are valid, 7656 and therefore want to skip these checks in order to improve perfor- 7657 mance, for example in the case of a long subject string that is being 7658 scanned repeatedly. If you set the PCRE_NO_UTF8_CHECK flag at compile 7659 time or at run time, PCRE assumes that the pattern or subject it is 7660 given (respectively) contains only valid UTF-8 codes. In this case, it 7661 does not diagnose an invalid UTF-8 string. 7662 7663 Note that passing PCRE_NO_UTF8_CHECK to pcre_compile() just disables 7664 the check for the pattern; it does not also apply to subject strings. 7665 If you want to disable the check for a subject string you must pass 7666 this option to pcre_exec() or pcre_dfa_exec(). 7667 7668 If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the 7669 result is undefined and your program may crash. 7670 7671 Validity of UTF-16 strings 7672 7673 When you set the PCRE_UTF16 flag, the strings of 16-bit data units that 7674 are passed as patterns and subjects are (by default) checked for valid- 7675 ity on entry to the relevant functions. Values other than those in the 7676 surrogate range U+D800 to U+DFFF are independent code points. Values in 7677 the surrogate range must be used in pairs in the correct manner. 7678 7679 Excluded are the "Non-Character" code points, which are U+FDD0 to 7680 U+FDEF and the last two code points in each plane, U+??FFFE and 7681 U+??FFFF. 7682 7683 If an invalid UTF-16 string is passed to PCRE, an error return is 7684 given. At compile time, the only additional information is the offset 7685 to the first data unit of the failing character. The run-time functions 7686 pcre16_exec() and pcre16_dfa_exec() also pass back this information, as 7687 well as a more detailed reason code if the caller has provided memory 7688 in which to do this. 7689 7690 In some situations, you may already know that your strings are valid, 7691 and therefore want to skip these checks in order to improve perfor- 7692 mance. If you set the PCRE_NO_UTF16_CHECK flag at compile time or at 7693 run time, PCRE assumes that the pattern or subject it is given (respec- 7694 tively) contains only valid UTF-16 sequences. In this case, it does not 7695 diagnose an invalid UTF-16 string. However, if an invalid string is 7696 passed, the result is undefined. 7697 7698 Validity of UTF-32 strings 7699 7700 When you set the PCRE_UTF32 flag, the strings of 32-bit data units that 7701 are passed as patterns and subjects are (by default) checked for valid- 7702 ity on entry to the relevant functions. This check allows only values 7703 in the range U+0 to U+10FFFF, excluding the surrogate area U+D800 to 7704 U+DFFF, and the "Non-Character" code points, which are U+FDD0 to U+FDEF 7705 and the last two characters in each plane, U+??FFFE and U+??FFFF. 7706 7707 If an invalid UTF-32 string is passed to PCRE, an error return is 7708 given. At compile time, the only additional information is the offset 7709 to the first data unit of the failing character. The run-time functions 7710 pcre32_exec() and pcre32_dfa_exec() also pass back this information, as 7711 well as a more detailed reason code if the caller has provided memory 7712 in which to do this. 7713 7714 In some situations, you may already know that your strings are valid, 7715 and therefore want to skip these checks in order to improve perfor- 7716 mance. If you set the PCRE_NO_UTF32_CHECK flag at compile time or at 7717 run time, PCRE assumes that the pattern or subject it is given (respec- 7718 tively) contains only valid UTF-32 sequences. In this case, it does not 7719 diagnose an invalid UTF-32 string. However, if an invalid string is 7720 passed, the result is undefined. 7721 7722 General comments about UTF modes 7723 7724 1. Codepoints less than 256 can be specified in patterns by either 7725 braced or unbraced hexadecimal escape sequences (for example, \x{b3} or 7726 \xb3). Larger values have to use braced sequences. 7727 7728 2. Octal numbers up to \777 are recognized, and in UTF-8 mode they 7729 match two-byte characters for values greater than \177. 7730 7731 3. Repeat quantifiers apply to complete UTF characters, not to individ- 7732 ual data units, for example: \x{100}{3}. 7733 7734 4. The dot metacharacter matches one UTF character instead of a single 7735 data unit. 7736 7737 5. The escape sequence \C can be used to match a single byte in UTF-8 7738 mode, or a single 16-bit data unit in UTF-16 mode, or a single 32-bit 7739 data unit in UTF-32 mode, but its use can lead to some strange effects 7740 because it breaks up multi-unit characters (see the description of \C 7741 in the pcrepattern documentation). The use of \C is not supported in 7742 the alternative matching function pcre[16|32]_dfa_exec(), nor is it 7743 supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If 7744 JIT optimization is requested for a UTF pattern that contains \C, it 7745 will not succeed, and so the matching will be carried out by the normal 7746 interpretive function. 7747 7748 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly 7749 test characters of any code value, but, by default, the characters that 7750 PCRE recognizes as digits, spaces, or word characters remain the same 7751 set as in non-UTF mode, all with values less than 256. This remains 7752 true even when PCRE is built to include Unicode property support, 7753 because to do otherwise would slow down PCRE in many common cases. Note 7754 in particular that this applies to \b and \B, because they are defined 7755 in terms of \w and \W. If you really want to test for a wider sense of, 7756 say, "digit", you can use explicit Unicode property tests such as 7757 \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the 7758 character escapes work is changed so that Unicode properties are used 7759 to determine which characters match. There are more details in the sec- 7760 tion on generic character types in the pcrepattern documentation. 7761 7762 7. Similarly, characters that match the POSIX named character classes 7763 are all low-valued characters, unless the PCRE_UCP option is set. 7764 7765 8. However, the horizontal and vertical white space matching escapes 7766 (\h, \H, \v, and \V) do match all the appropriate Unicode characters, 7767 whether or not PCRE_UCP is set. 7768 7769 9. Case-insensitive matching applies only to characters whose values 7770 are less than 128, unless PCRE is built with Unicode property support. 7771 A few Unicode characters such as Greek sigma have more than two code- 7772 points that are case-equivalent. Up to and including PCRE release 8.31, 7773 only one-to-one case mappings were supported, but later releases (with 7774 Unicode property support) do treat as case-equivalent all versions of 7775 characters such as Greek sigma. 7776 7777 7778AUTHOR 7779 7780 Philip Hazel 7781 University Computing Service 7782 Cambridge CB2 3QH, England. 7783 7784 7785REVISION 7786 7787 Last updated: 11 November 2012 7788 Copyright (c) 1997-2012 University of Cambridge. 7789------------------------------------------------------------------------------ 7790 7791 7792PCREJIT(3) PCREJIT(3) 7793 7794 7795NAME 7796 PCRE - Perl-compatible regular expressions 7797 7798 7799PCRE JUST-IN-TIME COMPILER SUPPORT 7800 7801 Just-in-time compiling is a heavyweight optimization that can greatly 7802 speed up pattern matching. However, it comes at the cost of extra pro- 7803 cessing before the match is performed. Therefore, it is of most benefit 7804 when the same pattern is going to be matched many times. This does not 7805 necessarily mean many calls of a matching function; if the pattern is 7806 not anchored, matching attempts may take place many times at various 7807 positions in the subject, even for a single call. Therefore, if the 7808 subject string is very long, it may still pay to use JIT for one-off 7809 matches. 7810 7811 JIT support applies only to the traditional Perl-compatible matching 7812 function. It does not apply when the DFA matching function is being 7813 used. The code for this support was written by Zoltan Herczeg. 7814 7815 78168-BIT, 16-BIT AND 32-BIT SUPPORT 7817 7818 JIT support is available for all of the 8-bit, 16-bit and 32-bit PCRE 7819 libraries. To keep this documentation simple, only the 8-bit interface 7820 is described in what follows. If you are using the 16-bit library, sub- 7821 stitute the 16-bit functions and 16-bit structures (for example, 7822 pcre16_jit_stack instead of pcre_jit_stack). If you are using the 7823 32-bit library, substitute the 32-bit functions and 32-bit structures 7824 (for example, pcre32_jit_stack instead of pcre_jit_stack). 7825 7826 7827AVAILABILITY OF JIT SUPPORT 7828 7829 JIT support is an optional feature of PCRE. The "configure" option 7830 --enable-jit (or equivalent CMake option) must be set when PCRE is 7831 built if you want to use JIT. The support is limited to the following 7832 hardware platforms: 7833 7834 ARM v5, v7, and Thumb2 7835 Intel x86 32-bit and 64-bit 7836 MIPS 32-bit 7837 Power PC 32-bit and 64-bit 7838 SPARC 32-bit (experimental) 7839 7840 If --enable-jit is set on an unsupported platform, compilation fails. 7841 7842 A program that is linked with PCRE 8.20 or later can tell if JIT sup- 7843 port is available by calling pcre_config() with the PCRE_CONFIG_JIT 7844 option. The result is 1 when JIT is available, and 0 otherwise. How- 7845 ever, a simple program does not need to check this in order to use JIT. 7846 The normal API is implemented in a way that falls back to the interpre- 7847 tive code if JIT is not available. For programs that need the best pos- 7848 sible performance, there is also a "fast path" API that is JIT-spe- 7849 cific. 7850 7851 If your program may sometimes be linked with versions of PCRE that are 7852 older than 8.20, but you want to use JIT when it is available, you can 7853 test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT 7854 macro such as PCRE_CONFIG_JIT, for compile-time control of your code. 7855 7856 7857SIMPLE USE OF JIT 7858 7859 You have to do two things to make use of the JIT support in the sim- 7860 plest way: 7861 7862 (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for 7863 each compiled pattern, and pass the resulting pcre_extra block to 7864 pcre_exec(). 7865 7866 (2) Use pcre_free_study() to free the pcre_extra block when it is 7867 no longer needed, instead of just freeing it yourself. This 7868 ensures that 7869 any JIT data is also freed. 7870 7871 For a program that may be linked with pre-8.20 versions of PCRE, you 7872 can insert 7873 7874 #ifndef PCRE_STUDY_JIT_COMPILE 7875 #define PCRE_STUDY_JIT_COMPILE 0 7876 #endif 7877 7878 so that no option is passed to pcre_study(), and then use something 7879 like this to free the study data: 7880 7881 #ifdef PCRE_CONFIG_JIT 7882 pcre_free_study(study_ptr); 7883 #else 7884 pcre_free(study_ptr); 7885 #endif 7886 7887 PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate code for 7888 complete matches. If you want to run partial matches using the 7889 PCRE_PARTIAL_HARD or PCRE_PARTIAL_SOFT options of pcre_exec(), you 7890 should set one or both of the following options in addition to, or 7891 instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study(): 7892 7893 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE 7894 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE 7895 7896 The JIT compiler generates different optimized code for each of the 7897 three modes (normal, soft partial, hard partial). When pcre_exec() is 7898 called, the appropriate code is run if it is available. Otherwise, the 7899 pattern is matched using interpretive code. 7900 7901 In some circumstances you may need to call additional functions. These 7902 are described in the section entitled "Controlling the JIT stack" 7903 below. 7904 7905 If JIT support is not available, PCRE_STUDY_JIT_COMPILE etc. are 7906 ignored, and no JIT data is created. Otherwise, the compiled pattern is 7907 passed to the JIT compiler, which turns it into machine code that exe- 7908 cutes much faster than the normal interpretive code. When pcre_exec() 7909 is passed a pcre_extra block containing a pointer to JIT code of the 7910 appropriate mode (normal or hard/soft partial), it obeys that code 7911 instead of running the interpreter. The result is identical, but the 7912 compiled JIT code runs much faster. 7913 7914 There are some pcre_exec() options that are not supported for JIT exe- 7915 cution. There are also some pattern items that JIT cannot handle. 7916 Details are given below. In both cases, execution automatically falls 7917 back to the interpretive code. If you want to know whether JIT was 7918 actually used for a particular match, you should arrange for a JIT 7919 callback function to be set up as described in the section entitled 7920 "Controlling the JIT stack" below, even if you do not need to supply a 7921 non-default JIT stack. Such a callback function is called whenever JIT 7922 code is about to be obeyed. If the execution options are not right for 7923 JIT execution, the callback function is not obeyed. 7924 7925 If the JIT compiler finds an unsupported item, no JIT data is gener- 7926 ated. You can find out if JIT execution is available after studying a 7927 pattern by calling pcre_fullinfo() with the PCRE_INFO_JIT option. A 7928 result of 1 means that JIT compilation was successful. A result of 0 7929 means that JIT support is not available, or the pattern was not studied 7930 with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not able to 7931 handle the pattern. 7932 7933 Once a pattern has been studied, with or without JIT, it can be used as 7934 many times as you like for matching different subject strings. 7935 7936 7937UNSUPPORTED OPTIONS AND PATTERN ITEMS 7938 7939 The only pcre_exec() options that are supported for JIT execution are 7940 PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT- 7941 BOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PAR- 7942 TIAL_HARD, and PCRE_PARTIAL_SOFT. 7943 7944 The unsupported pattern items are: 7945 7946 \C match a single byte; not supported in UTF-8 mode 7947 (?Cn) callouts 7948 (*PRUNE) ) 7949 (*SKIP) ) backtracking control verbs 7950 (*THEN) ) 7951 7952 Support for some of these may be added in future. 7953 7954 7955RETURN VALUES FROM JIT EXECUTION 7956 7957 When a pattern is matched using JIT execution, the return values are 7958 the same as those given by the interpretive pcre_exec() code, with the 7959 addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means 7960 that the memory used for the JIT stack was insufficient. See "Control- 7961 ling the JIT stack" below for a discussion of JIT stack usage. For com- 7962 patibility with the interpretive pcre_exec() code, no more than two- 7963 thirds of the ovector argument is used for passing back captured sub- 7964 strings. 7965 7966 The error code PCRE_ERROR_MATCHLIMIT is returned by the JIT code if 7967 searching a very large pattern tree goes on for too long, as it is in 7968 the same circumstance when JIT is not used, but the details of exactly 7969 what is counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error 7970 code is never returned by JIT execution. 7971 7972 7973SAVING AND RESTORING COMPILED PATTERNS 7974 7975 The code that is generated by the JIT compiler is architecture-spe- 7976 cific, and is also position dependent. For those reasons it cannot be 7977 saved (in a file or database) and restored later like the bytecode and 7978 other data of a compiled pattern. Saving and restoring compiled pat- 7979 terns is not something many people do. More detail about this facility 7980 is given in the pcreprecompile documentation. It should be possible to 7981 run pcre_study() on a saved and restored pattern, and thereby recreate 7982 the JIT data, but because JIT compilation uses significant resources, 7983 it is probably not worth doing this; you might as well recompile the 7984 original pattern. 7985 7986 7987CONTROLLING THE JIT STACK 7988 7989 When the compiled JIT code runs, it needs a block of memory to use as a 7990 stack. By default, it uses 32K on the machine stack. However, some 7991 large or complicated patterns need more than this. The error 7992 PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. 7993 Three functions are provided for managing blocks of memory for use as 7994 JIT stacks. There is further discussion about the use of JIT stacks in 7995 the section entitled "JIT stack FAQ" below. 7996 7997 The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments 7998 are a starting size and a maximum size, and it returns a pointer to an 7999 opaque structure of type pcre_jit_stack, or NULL if there is an error. 8000 The pcre_jit_stack_free() function can be used to free a stack that is 8001 no longer needed. (For the technically minded: the address space is 8002 allocated by mmap or VirtualAlloc.) 8003 8004 JIT uses far less memory for recursion than the interpretive code, and 8005 a maximum stack size of 512K to 1M should be more than enough for any 8006 pattern. 8007 8008 The pcre_assign_jit_stack() function specifies which stack JIT code 8009 should use. Its arguments are as follows: 8010 8011 pcre_extra *extra 8012 pcre_jit_callback callback 8013 void *data 8014 8015 The extra argument must be the result of studying a pattern with 8016 PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the 8017 other two options: 8018 8019 (1) If callback is NULL and data is NULL, an internal 32K block 8020 on the machine stack is used. 8021 8022 (2) If callback is NULL and data is not NULL, data must be 8023 a valid JIT stack, the result of calling pcre_jit_stack_alloc(). 8024 8025 (3) If callback is not NULL, it must point to a function that is 8026 called with data as an argument at the start of matching, in 8027 order to set up a JIT stack. If the return from the callback 8028 function is NULL, the internal 32K stack is used; otherwise the 8029 return value must be a valid JIT stack, the result of calling 8030 pcre_jit_stack_alloc(). 8031 8032 A callback function is obeyed whenever JIT code is about to be run; it 8033 is not obeyed when pcre_exec() is called with options that are incom- 8034 patible for JIT execution. A callback function can therefore be used to 8035 determine whether a match operation was executed by JIT or by the 8036 interpreter. 8037 8038 You may safely use the same JIT stack for more than one pattern (either 8039 by assigning directly or by callback), as long as the patterns are all 8040 matched sequentially in the same thread. In a multithread application, 8041 if you do not specify a JIT stack, or if you assign or pass back NULL 8042 from a callback, that is thread-safe, because each thread has its own 8043 machine stack. However, if you assign or pass back a non-NULL JIT 8044 stack, this must be a different stack for each thread so that the 8045 application is thread-safe. 8046 8047 Strictly speaking, even more is allowed. You can assign the same non- 8048 NULL stack to any number of patterns as long as they are not used for 8049 matching by multiple threads at the same time. For example, you can 8050 assign the same stack to all compiled patterns, and use a global mutex 8051 in the callback to wait until the stack is available for use. However, 8052 this is an inefficient solution, and not recommended. 8053 8054 This is a suggestion for how a multithreaded program that needs to set 8055 up non-default JIT stacks might operate: 8056 8057 During thread initalization 8058 thread_local_var = pcre_jit_stack_alloc(...) 8059 8060 During thread exit 8061 pcre_jit_stack_free(thread_local_var) 8062 8063 Use a one-line callback function 8064 return thread_local_var 8065 8066 All the functions described in this section do nothing if JIT is not 8067 available, and pcre_assign_jit_stack() does nothing unless the extra 8068 argument is non-NULL and points to a pcre_extra block that is the 8069 result of a successful study with PCRE_STUDY_JIT_COMPILE etc. 8070 8071 8072JIT STACK FAQ 8073 8074 (1) Why do we need JIT stacks? 8075 8076 PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack 8077 where the local data of the current node is pushed before checking its 8078 child nodes. Allocating real machine stack on some platforms is diffi- 8079 cult. For example, the stack chain needs to be updated every time if we 8080 extend the stack on PowerPC. Although it is possible, its updating 8081 time overhead decreases performance. So we do the recursion in memory. 8082 8083 (2) Why don't we simply allocate blocks of memory with malloc()? 8084 8085 Modern operating systems have a nice feature: they can reserve an 8086 address space instead of allocating memory. We can safely allocate mem- 8087 ory pages inside this address space, so the stack could grow without 8088 moving memory data (this is important because of pointers). Thus we can 8089 allocate 1M address space, and use only a single memory page (usually 8090 4K) if that is enough. However, we can still grow up to 1M anytime if 8091 needed. 8092 8093 (3) Who "owns" a JIT stack? 8094 8095 The owner of the stack is the user program, not the JIT studied pattern 8096 or anything else. The user program must ensure that if a stack is used 8097 by pcre_exec(), (that is, it is assigned to the pattern currently run- 8098 ning), that stack must not be used by any other threads (to avoid over- 8099 writing the same memory area). The best practice for multithreaded pro- 8100 grams is to allocate a stack for each thread, and return this stack 8101 through the JIT callback function. 8102 8103 (4) When should a JIT stack be freed? 8104 8105 You can free a JIT stack at any time, as long as it will not be used by 8106 pcre_exec() again. When you assign the stack to a pattern, only a 8107 pointer is set. There is no reference counting or any other magic. You 8108 can free the patterns and stacks in any order, anytime. Just do not 8109 call pcre_exec() with a pattern pointing to an already freed stack, as 8110 that will cause SEGFAULT. (Also, do not free a stack currently used by 8111 pcre_exec() in another thread). You can also replace the stack for a 8112 pattern at any time. You can even free the previous stack before 8113 assigning a replacement. 8114 8115 (5) Should I allocate/free a stack every time before/after calling 8116 pcre_exec()? 8117 8118 No, because this is too costly in terms of resources. However, you 8119 could implement some clever idea which release the stack if it is not 8120 used in let's say two minutes. The JIT callback can help to achieve 8121 this without keeping a list of the currently JIT studied patterns. 8122 8123 (6) OK, the stack is for long term memory allocation. But what happens 8124 if a pattern causes stack overflow with a stack of 1M? Is that 1M kept 8125 until the stack is freed? 8126 8127 Especially on embedded sytems, it might be a good idea to release mem- 8128 ory sometimes without freeing the stack. There is no API for this at 8129 the moment. Probably a function call which returns with the currently 8130 allocated memory for any stack and another which allows releasing mem- 8131 ory (shrinking the stack) would be a good idea if someone needs this. 8132 8133 (7) This is too much of a headache. Isn't there any better solution for 8134 JIT stack handling? 8135 8136 No, thanks to Windows. If POSIX threads were used everywhere, we could 8137 throw out this complicated API. 8138 8139 8140EXAMPLE CODE 8141 8142 This is a single-threaded example that specifies a JIT stack without 8143 using a callback. 8144 8145 int rc; 8146 int ovector[30]; 8147 pcre *re; 8148 pcre_extra *extra; 8149 pcre_jit_stack *jit_stack; 8150 8151 re = pcre_compile(pattern, 0, &error, &erroffset, NULL); 8152 /* Check for errors */ 8153 extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error); 8154 jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024); 8155 /* Check for error (NULL) */ 8156 pcre_assign_jit_stack(extra, NULL, jit_stack); 8157 rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30); 8158 /* Check results */ 8159 pcre_free(re); 8160 pcre_free_study(extra); 8161 pcre_jit_stack_free(jit_stack); 8162 8163 8164JIT FAST PATH API 8165 8166 Because the API described above falls back to interpreted execution 8167 when JIT is not available, it is convenient for programs that are writ- 8168 ten for general use in many environments. However, calling JIT via 8169 pcre_exec() does have a performance impact. Programs that are written 8170 for use where JIT is known to be available, and which need the best 8171 possible performance, can instead use a "fast path" API to call JIT 8172 execution directly instead of calling pcre_exec() (obviously only for 8173 patterns that have been successfully studied by JIT). 8174 8175 The fast path function is called pcre_jit_exec(), and it takes exactly 8176 the same arguments as pcre_exec(), plus one additional argument that 8177 must point to a JIT stack. The JIT stack arrangements described above 8178 do not apply. The return values are the same as for pcre_exec(). 8179 8180 When you call pcre_exec(), as well as testing for invalid options, a 8181 number of other sanity checks are performed on the arguments. For exam- 8182 ple, if the subject pointer is NULL, or its length is negative, an 8183 immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is set, a 8184 UTF subject string is tested for validity. In the interests of speed, 8185 these checks do not happen on the JIT fast path, and if invalid data is 8186 passed, the result is undefined. 8187 8188 Bypassing the sanity checks and the pcre_exec() wrapping can give 8189 speedups of more than 10%. 8190 8191 8192SEE ALSO 8193 8194 pcreapi(3) 8195 8196 8197AUTHOR 8198 8199 Philip Hazel (FAQ by Zoltan Herczeg) 8200 University Computing Service 8201 Cambridge CB2 3QH, England. 8202 8203 8204REVISION 8205 8206 Last updated: 31 October 2012 8207 Copyright (c) 1997-2012 University of Cambridge. 8208------------------------------------------------------------------------------ 8209 8210 8211PCREPARTIAL(3) PCREPARTIAL(3) 8212 8213 8214NAME 8215 PCRE - Perl-compatible regular expressions 8216 8217 8218PARTIAL MATCHING IN PCRE 8219 8220 In normal use of PCRE, if the subject string that is passed to a match- 8221 ing function matches as far as it goes, but is too short to match the 8222 entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances 8223 where it might be helpful to distinguish this case from other cases in 8224 which there is no match. 8225 8226 Consider, for example, an application where a human is required to type 8227 in data for a field with specific formatting requirements. An example 8228 might be a date in the form ddmmmyy, defined by this pattern: 8229 8230 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$ 8231 8232 If the application sees the user's keystrokes one by one, and can check 8233 that what has been typed so far is potentially valid, it is able to 8234 raise an error as soon as a mistake is made, by beeping and not 8235 reflecting the character that has been typed, for example. This immedi- 8236 ate feedback is likely to be a better user interface than a check that 8237 is delayed until the entire string has been entered. Partial matching 8238 can also be useful when the subject string is very long and is not all 8239 available at once. 8240 8241 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and 8242 PCRE_PARTIAL_HARD options, which can be set when calling any of the 8243 matching functions. For backwards compatibility, PCRE_PARTIAL is a syn- 8244 onym for PCRE_PARTIAL_SOFT. The essential difference between the two 8245 options is whether or not a partial match is preferred to an alterna- 8246 tive complete match, though the details differ between the two types of 8247 matching function. If both options are set, PCRE_PARTIAL_HARD takes 8248 precedence. 8249 8250 If you want to use partial matching with just-in-time optimized code, 8251 you must call pcre_study(), pcre16_study() or pcre32_study() with one 8252 or both of these options: 8253 8254 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE 8255 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE 8256 8257 PCRE_STUDY_JIT_COMPILE should also be set if you are going to run non- 8258 partial matches on the same pattern. If the appropriate JIT study mode 8259 has not been set for a match, the interpretive matching code is used. 8260 8261 Setting a partial matching option disables two of PCRE's standard opti- 8262 mizations. PCRE remembers the last literal data unit in a pattern, and 8263 abandons matching immediately if it is not present in the subject 8264 string. This optimization cannot be used for a subject string that 8265 might match only partially. If the pattern was studied, PCRE knows the 8266 minimum length of a matching string, and does not bother to run the 8267 matching function on shorter strings. This optimization is also dis- 8268 abled for partial matching. 8269 8270 8271PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec() 8272 8273 A partial match occurs during a call to pcre_exec() or 8274 pcre[16|32]_exec() when the end of the subject string is reached suc- 8275 cessfully, but matching cannot continue because more characters are 8276 needed. However, at least one character in the subject must have been 8277 inspected. This character need not form part of the final matched 8278 string; lookbehind assertions and the \K escape sequence provide ways 8279 of inspecting characters before the start of a matched substring. The 8280 requirement for inspecting at least one character exists because an 8281 empty string can always be matched; without such a restriction there 8282 would always be a partial match of an empty string at the end of the 8283 subject. 8284 8285 If there are at least two slots in the offsets vector when a partial 8286 match is returned, the first slot is set to the offset of the earliest 8287 character that was inspected. For convenience, the second offset points 8288 to the end of the subject so that a substring can easily be identified. 8289 8290 For the majority of patterns, the first offset identifies the start of 8291 the partially matched string. However, for patterns that contain look- 8292 behind assertions, or \K, or begin with \b or \B, earlier characters 8293 have been inspected while carrying out the match. For example: 8294 8295 /(?<=abc)123/ 8296 8297 This pattern matches "123", but only if it is preceded by "abc". If the 8298 subject string is "xyzabc12", the offsets after a partial match are for 8299 the substring "abc12", because all these characters are needed if 8300 another match is tried with extra characters added to the subject. 8301 8302 What happens when a partial match is identified depends on which of the 8303 two partial matching options are set. 8304 8305 PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec() 8306 8307 If PCRE_PARTIAL_SOFT is set when pcre_exec() or pcre[16|32]_exec() 8308 identifies a partial match, the partial match is remembered, but match- 8309 ing continues as normal, and other alternatives in the pattern are 8310 tried. If no complete match can be found, PCRE_ERROR_PARTIAL is 8311 returned instead of PCRE_ERROR_NOMATCH. 8312 8313 This option is "soft" because it prefers a complete match over a par- 8314 tial match. All the various matching items in a pattern behave as if 8315 the subject string is potentially complete. For example, \z, \Z, and $ 8316 match at the end of the subject, as normal, and for \b and \B the end 8317 of the subject is treated as a non-alphanumeric. 8318 8319 If there is more than one partial match, the first one that was found 8320 provides the data that is returned. Consider this pattern: 8321 8322 /123\w+X|dogY/ 8323 8324 If this is matched against the subject string "abc123dog", both alter- 8325 natives fail to match, but the end of the subject is reached during 8326 matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 8327 and 9, identifying "123dog" as the first partial match that was found. 8328 (In this example, there are two partial matches, because "dog" on its 8329 own partially matches the second alternative.) 8330 8331 PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec() 8332 8333 If PCRE_PARTIAL_HARD is set for pcre_exec() or pcre[16|32]_exec(), 8334 PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, 8335 without continuing to search for possible complete matches. This option 8336 is "hard" because it prefers an earlier partial match over a later com- 8337 plete match. For this reason, the assumption is made that the end of 8338 the supplied subject string may not be the true end of the available 8339 data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the 8340 subject, the result is PCRE_ERROR_PARTIAL, provided that at least one 8341 character in the subject has been inspected. 8342 8343 Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject 8344 strings are checked for validity. Normally, an invalid sequence causes 8345 the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the 8346 special case of a truncated character at the end of the subject, 8347 PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when 8348 PCRE_PARTIAL_HARD is set. 8349 8350 Comparing hard and soft partial matching 8351 8352 The difference between the two partial matching options can be illus- 8353 trated by a pattern such as: 8354 8355 /dog(sbody)?/ 8356 8357 This matches either "dog" or "dogsbody", greedily (that is, it prefers 8358 the longer string if possible). If it is matched against the string 8359 "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog". 8360 However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. 8361 On the other hand, if the pattern is made ungreedy the result is dif- 8362 ferent: 8363 8364 /dog(sbody)??/ 8365 8366 In this case the result is always a complete match because that is 8367 found first, and matching never continues after finding a complete 8368 match. It might be easier to follow this explanation by thinking of the 8369 two patterns like this: 8370 8371 /dog(sbody)?/ is the same as /dogsbody|dog/ 8372 /dog(sbody)??/ is the same as /dog|dogsbody/ 8373 8374 The second pattern will never match "dogsbody", because it will always 8375 find the shorter match first. 8376 8377 8378PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec() 8379 8380 The DFA functions move along the subject string character by character, 8381 without backtracking, searching for all possible matches simultane- 8382 ously. If the end of the subject is reached before the end of the pat- 8383 tern, there is the possibility of a partial match, again provided that 8384 at least one character has been inspected. 8385 8386 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if 8387 there have been no complete matches. Otherwise, the complete matches 8388 are returned. However, if PCRE_PARTIAL_HARD is set, a partial match 8389 takes precedence over any complete matches. The portion of the string 8390 that was inspected when the longest partial match was found is set as 8391 the first matching string, provided there are at least two slots in the 8392 offsets vector. 8393 8394 Because the DFA functions always search for all possible matches, and 8395 there is no difference between greedy and ungreedy repetition, their 8396 behaviour is different from the standard functions when PCRE_PAR- 8397 TIAL_HARD is set. Consider the string "dog" matched against the 8398 ungreedy pattern shown above: 8399 8400 /dog(sbody)??/ 8401 8402 Whereas the standard functions stop as soon as they find the complete 8403 match for "dog", the DFA functions also find the partial match for 8404 "dogsbody", and so return that when PCRE_PARTIAL_HARD is set. 8405 8406 8407PARTIAL MATCHING AND WORD BOUNDARIES 8408 8409 If a pattern ends with one of sequences \b or \B, which test for word 8410 boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- 8411 intuitive results. Consider this pattern: 8412 8413 /\bcat\b/ 8414 8415 This matches "cat", provided there is a word boundary at either end. If 8416 the subject string is "the cat", the comparison of the final "t" with a 8417 following character cannot take place, so a partial match is found. 8418 However, normal matching carries on, and \b matches at the end of the 8419 subject when the last character is a letter, so a complete match is 8420 found. The result, therefore, is not PCRE_ERROR_PARTIAL. Using 8421 PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because 8422 then the partial match takes precedence. 8423 8424 8425FORMERLY RESTRICTED PATTERNS 8426 8427 For releases of PCRE prior to 8.00, because of the way certain internal 8428 optimizations were implemented in the pcre_exec() function, the 8429 PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be 8430 used with all patterns. From release 8.00 onwards, the restrictions no 8431 longer apply, and partial matching with can be requested for any pat- 8432 tern. 8433 8434 Items that were formerly restricted were repeated single characters and 8435 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did 8436 not conform to the restrictions, pcre_exec() returned the error code 8437 PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The 8438 PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled 8439 pattern can be used for partial matching now always returns 1. 8440 8441 8442EXAMPLE OF PARTIAL MATCHING USING PCRETEST 8443 8444 If the escape sequence \P is present in a pcretest data line, the 8445 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of 8446 pcretest that uses the date example quoted above: 8447 8448 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 8449 data> 25jun04\P 8450 0: 25jun04 8451 1: jun 8452 data> 25dec3\P 8453 Partial match: 23dec3 8454 data> 3ju\P 8455 Partial match: 3ju 8456 data> 3juj\P 8457 No match 8458 data> j\P 8459 No match 8460 8461 The first data string is matched completely, so pcretest shows the 8462 matched substrings. The remaining four strings do not match the com- 8463 plete pattern, but the first two are partial matches. Similar output is 8464 obtained if DFA matching is used. 8465 8466 If the escape sequence \P is present more than once in a pcretest data 8467 line, the PCRE_PARTIAL_HARD option is set for the match. 8468 8469 8470MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec() 8471 8472 When a partial match has been found using a DFA matching function, it 8473 is possible to continue the match by providing additional subject data 8474 and calling the function again with the same compiled regular expres- 8475 sion, this time setting the PCRE_DFA_RESTART option. You must pass the 8476 same working space as before, because this is where details of the pre- 8477 vious partial match are stored. Here is an example using pcretest, 8478 using the \R escape sequence to set the PCRE_DFA_RESTART option (\D 8479 specifies the use of the DFA matching function): 8480 8481 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 8482 data> 23ja\P\D 8483 Partial match: 23ja 8484 data> n05\R\D 8485 0: n05 8486 8487 The first call has "23ja" as the subject, and requests partial match- 8488 ing; the second call has "n05" as the subject for the continued 8489 (restarted) match. Notice that when the match is complete, only the 8490 last part is shown; PCRE does not retain the previously partially- 8491 matched string. It is up to the calling program to do that if it needs 8492 to. 8493 8494 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with 8495 PCRE_DFA_RESTART to continue partial matching over multiple segments. 8496 This facility can be used to pass very long subject strings to the DFA 8497 matching functions. 8498 8499 8500MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec() 8501 8502 From release 8.00, the standard matching functions can also be used to 8503 do multi-segment matching. Unlike the DFA functions, it is not possible 8504 to restart the previous match with a new segment of data. Instead, new 8505 data must be added to the previous subject string, and the entire match 8506 re-run, starting from the point where the partial match occurred. Ear- 8507 lier data can be discarded. 8508 8509 It is best to use PCRE_PARTIAL_HARD in this situation, because it does 8510 not treat the end of a segment as the end of the subject when matching 8511 \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches 8512 dates: 8513 8514 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ 8515 data> The date is 23ja\P\P 8516 Partial match: 23ja 8517 8518 At this stage, an application could discard the text preceding "23ja", 8519 add on text from the next segment, and call the matching function 8520 again. Unlike the DFA matching functions, the entire matching string 8521 must always be available, and the complete matching process occurs for 8522 each call, so more memory and more processing time is needed. 8523 8524 Note: If the pattern contains lookbehind assertions, or \K, or starts 8525 with \b or \B, the string that is returned for a partial match includes 8526 characters that precede the partially matched string itself, because 8527 these must be retained when adding on more characters for a subsequent 8528 matching attempt. However, in some cases you may need to retain even 8529 earlier characters, as discussed in the next section. 8530 8531 8532ISSUES WITH MULTI-SEGMENT MATCHING 8533 8534 Certain types of pattern may give problems with multi-segment matching, 8535 whichever matching function is used. 8536 8537 1. If the pattern contains a test for the beginning of a line, you need 8538 to pass the PCRE_NOTBOL option when the subject string for any call 8539 does start at the beginning of a line. There is also a PCRE_NOTEOL 8540 option, but in practice when doing multi-segment matching you should be 8541 using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL. 8542 8543 2. Lookbehind assertions that have already been obeyed are catered for 8544 in the offsets that are returned for a partial match. However a lookbe- 8545 hind assertion later in the pattern could require even earlier charac- 8546 ters to be inspected. You can handle this case by using the 8547 PCRE_INFO_MAXLOOKBEHIND option of the pcre_fullinfo() or 8548 pcre[16|32]_fullinfo() functions to obtain the length of the largest 8549 lookbehind in the pattern. This length is given in characters, not 8550 bytes. If you always retain at least that many characters before the 8551 partially matched string, all should be well. (Of course, near the 8552 start of the subject, fewer characters may be present; in that case all 8553 characters should be retained.) 8554 8555 3. Because a partial match must always contain at least one character, 8556 what might be considered a partial match of an empty string actually 8557 gives a "no match" result. For example: 8558 8559 re> /c(?<=abc)x/ 8560 data> ab\P 8561 No match 8562 8563 If the next segment begins "cx", a match should be found, but this will 8564 only happen if characters from the previous segment are retained. For 8565 this reason, a "no match" result should be interpreted as "partial 8566 match of an empty string" when the pattern contains lookbehinds. 8567 8568 4. Matching a subject string that is split into multiple segments may 8569 not always produce exactly the same result as matching over one single 8570 long string, especially when PCRE_PARTIAL_SOFT is used. The section 8571 "Partial Matching and Word Boundaries" above describes an issue that 8572 arises if the pattern ends with \b or \B. Another kind of difference 8573 may occur when there are multiple matching possibilities, because (for 8574 PCRE_PARTIAL_SOFT) a partial match result is given only when there are 8575 no completed matches. This means that as soon as the shortest match has 8576 been found, continuation to a new subject segment is no longer possi- 8577 ble. Consider again this pcretest example: 8578 8579 re> /dog(sbody)?/ 8580 data> dogsb\P 8581 0: dog 8582 data> do\P\D 8583 Partial match: do 8584 data> gsb\R\P\D 8585 0: g 8586 data> dogsbody\D 8587 0: dogsbody 8588 1: dog 8589 8590 The first data line passes the string "dogsb" to a standard matching 8591 function, setting the PCRE_PARTIAL_SOFT option. Although the string is 8592 a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL, 8593 because the shorter string "dog" is a complete match. Similarly, when 8594 the subject is presented to a DFA matching function in several parts 8595 ("do" and "gsb" being the first two) the match stops when "dog" has 8596 been found, and it is not possible to continue. On the other hand, if 8597 "dogsbody" is presented as a single string, a DFA matching function 8598 finds both matches. 8599 8600 Because of these problems, it is best to use PCRE_PARTIAL_HARD when 8601 matching multi-segment data. The example above then behaves differ- 8602 ently: 8603 8604 re> /dog(sbody)?/ 8605 data> dogsb\P\P 8606 Partial match: dogsb 8607 data> do\P\D 8608 Partial match: do 8609 data> gsb\R\P\P\D 8610 Partial match: gsb 8611 8612 5. Patterns that contain alternatives at the top level which do not all 8613 start with the same pattern item may not work as expected when 8614 PCRE_DFA_RESTART is used. For example, consider this pattern: 8615 8616 1234|3789 8617 8618 If the first part of the subject is "ABC123", a partial match of the 8619 first alternative is found at offset 3. There is no partial match for 8620 the second alternative, because such a match does not start at the same 8621 point in the subject string. Attempting to continue with the string 8622 "7890" does not yield a match because only those alternatives that 8623 match at one point in the subject are remembered. The problem arises 8624 because the start of the second alternative matches within the first 8625 alternative. There is no problem with anchored patterns or patterns 8626 such as: 8627 8628 1234|ABCD 8629 8630 where no string can be a partial match for both alternatives. This is 8631 not a problem if a standard matching function is used, because the 8632 entire match has to be rerun each time: 8633 8634 re> /1234|3789/ 8635 data> ABC123\P\P 8636 Partial match: 123 8637 data> 1237890 8638 0: 3789 8639 8640 Of course, instead of using PCRE_DFA_RESTART, the same technique of re- 8641 running the entire match can also be used with the DFA matching func- 8642 tions. Another possibility is to work with two buffers. If a partial 8643 match at offset n in the first buffer is followed by "no match" when 8644 PCRE_DFA_RESTART is used on the second buffer, you can then try a new 8645 match starting at offset n+1 in the first buffer. 8646 8647 8648AUTHOR 8649 8650 Philip Hazel 8651 University Computing Service 8652 Cambridge CB2 3QH, England. 8653 8654 8655REVISION 8656 8657 Last updated: 24 June 2012 8658 Copyright (c) 1997-2012 University of Cambridge. 8659------------------------------------------------------------------------------ 8660 8661 8662PCREPRECOMPILE(3) PCREPRECOMPILE(3) 8663 8664 8665NAME 8666 PCRE - Perl-compatible regular expressions 8667 8668 8669SAVING AND RE-USING PRECOMPILED PCRE PATTERNS 8670 8671 If you are running an application that uses a large number of regular 8672 expression patterns, it may be useful to store them in a precompiled 8673 form instead of having to compile them every time the application is 8674 run. If you are not using any private character tables (see the 8675 pcre_maketables() documentation), this is relatively straightforward. 8676 If you are using private tables, it is a little bit more complicated. 8677 However, if you are using the just-in-time optimization feature, it is 8678 not possible to save and reload the JIT data. 8679 8680 If you save compiled patterns to a file, you can copy them to a differ- 8681 ent host and run them there. If the two hosts have different endianness 8682 (byte order), you should run the pcre[16|32]_pat- 8683 tern_to_host_byte_order() function on the new host before trying to 8684 match the pattern. The matching functions return PCRE_ERROR_BADENDIAN- 8685 NESS if they detect a pattern with the wrong endianness. 8686 8687 Compiling regular expressions with one version of PCRE for use with a 8688 different version is not guaranteed to work and may cause crashes, and 8689 saving and restoring a compiled pattern loses any JIT optimization 8690 data. 8691 8692 8693SAVING A COMPILED PATTERN 8694 8695 The value returned by pcre[16|32]_compile() points to a single block of 8696 memory that holds the compiled pattern and associated data. You can 8697 find the length of this block in bytes by calling 8698 pcre[16|32]_fullinfo() with an argument of PCRE_INFO_SIZE. You can then 8699 save the data in any appropriate manner. Here is sample code for the 8700 8-bit library that compiles a pattern and writes it to a file. It 8701 assumes that the variable fd refers to a file that is open for output: 8702 8703 int erroroffset, rc, size; 8704 char *error; 8705 pcre *re; 8706 8707 re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL); 8708 if (re == NULL) { ... handle errors ... } 8709 rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size); 8710 if (rc < 0) { ... handle errors ... } 8711 rc = fwrite(re, 1, size, fd); 8712 if (rc != size) { ... handle errors ... } 8713 8714 In this example, the bytes that comprise the compiled pattern are 8715 copied exactly. Note that this is binary data that may contain any of 8716 the 256 possible byte values. On systems that make a distinction 8717 between binary and non-binary data, be sure that the file is opened for 8718 binary output. 8719 8720 If you want to write more than one pattern to a file, you will have to 8721 devise a way of separating them. For binary data, preceding each pat- 8722 tern with its length is probably the most straightforward approach. 8723 Another possibility is to write out the data in hexadecimal instead of 8724 binary, one pattern to a line. 8725 8726 Saving compiled patterns in a file is only one possible way of storing 8727 them for later use. They could equally well be saved in a database, or 8728 in the memory of some daemon process that passes them via sockets to 8729 the processes that want them. 8730 8731 If the pattern has been studied, it is also possible to save the normal 8732 study data in a similar way to the compiled pattern itself. However, if 8733 the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre- 8734 ated cannot be saved because it is too dependent on the current envi- 8735 ronment. When studying generates additional information, 8736 pcre[16|32]_study() returns a pointer to a pcre[16|32]_extra data 8737 block. Its format is defined in the section on matching a pattern in 8738 the pcreapi documentation. The study_data field points to the binary 8739 study data, and this is what you must save (not the pcre[16|32]_extra 8740 block itself). The length of the study data can be obtained by calling 8741 pcre[16|32]_fullinfo() with an argument of PCRE_INFO_STUDYSIZE. Remem- 8742 ber to check that pcre[16|32]_study() did return a non-NULL value 8743 before trying to save the study data. 8744 8745 8746RE-USING A PRECOMPILED PATTERN 8747 8748 Re-using a precompiled pattern is straightforward. Having reloaded it 8749 into main memory, called pcre[16|32]_pattern_to_host_byte_order() if 8750 necessary, you pass its pointer to pcre[16|32]_exec() or 8751 pcre[16|32]_dfa_exec() in the usual way. 8752 8753 However, if you passed a pointer to custom character tables when the 8754 pattern was compiled (the tableptr argument of pcre[16|32]_compile()), 8755 you must now pass a similar pointer to pcre[16|32]_exec() or 8756 pcre[16|32]_dfa_exec(), because the value saved with the compiled pat- 8757 tern will obviously be nonsense. A field in a pcre[16|32]_extra() block 8758 is used to pass this data, as described in the section on matching a 8759 pattern in the pcreapi documentation. 8760 8761 If you did not provide custom character tables when the pattern was 8762 compiled, the pointer in the compiled pattern is NULL, which causes the 8763 matching functions to use PCRE's internal tables. Thus, you do not need 8764 to take any special action at run time in this case. 8765 8766 If you saved study data with the compiled pattern, you need to create 8767 your own pcre[16|32]_extra data block and set the study_data field to 8768 point to the reloaded study data. You must also set the 8769 PCRE_EXTRA_STUDY_DATA bit in the flags field to indicate that study 8770 data is present. Then pass the pcre[16|32]_extra block to the matching 8771 function in the usual way. If the pattern was studied for just-in-time 8772 optimization, that data cannot be saved, and so is lost by a 8773 save/restore cycle. 8774 8775 8776COMPATIBILITY WITH DIFFERENT PCRE RELEASES 8777 8778 In general, it is safest to recompile all saved patterns when you 8779 update to a new PCRE release, though not all updates actually require 8780 this. 8781 8782 8783AUTHOR 8784 8785 Philip Hazel 8786 University Computing Service 8787 Cambridge CB2 3QH, England. 8788 8789 8790REVISION 8791 8792 Last updated: 24 June 2012 8793 Copyright (c) 1997-2012 University of Cambridge. 8794------------------------------------------------------------------------------ 8795 8796 8797PCREPERFORM(3) PCREPERFORM(3) 8798 8799 8800NAME 8801 PCRE - Perl-compatible regular expressions 8802 8803 8804PCRE PERFORMANCE 8805 8806 Two aspects of performance are discussed below: memory usage and pro- 8807 cessing time. The way you express your pattern as a regular expression 8808 can affect both of them. 8809 8810 8811COMPILED PATTERN MEMORY USAGE 8812 8813 Patterns are compiled by PCRE into a reasonably efficient interpretive 8814 code, so that most simple patterns do not use much memory. However, 8815 there is one case where the memory usage of a compiled pattern can be 8816 unexpectedly large. If a parenthesized subpattern has a quantifier with 8817 a minimum greater than 1 and/or a limited maximum, the whole subpattern 8818 is repeated in the compiled code. For example, the pattern 8819 8820 (abc|def){2,4} 8821 8822 is compiled as if it were 8823 8824 (abc|def)(abc|def)((abc|def)(abc|def)?)? 8825 8826 (Technical aside: It is done this way so that backtrack points within 8827 each of the repetitions can be independently maintained.) 8828 8829 For regular expressions whose quantifiers use only small numbers, this 8830 is not usually a problem. However, if the numbers are large, and par- 8831 ticularly if such repetitions are nested, the memory usage can become 8832 an embarrassment. For example, the very simple pattern 8833 8834 ((ab){1,1000}c){1,3} 8835 8836 uses 51K bytes when compiled using the 8-bit library. When PCRE is com- 8837 piled with its default internal pointer size of two bytes, the size 8838 limit on a compiled pattern is 64K data units, and this is reached with 8839 the above pattern if the outer repetition is increased from 3 to 4. 8840 PCRE can be compiled to use larger internal pointers and thus handle 8841 larger compiled patterns, but it is better to try to rewrite your pat- 8842 tern to use less memory if you can. 8843 8844 One way of reducing the memory usage for such patterns is to make use 8845 of PCRE's "subroutine" facility. Re-writing the above pattern as 8846 8847 ((ab)(?2){0,999}c)(?1){0,2} 8848 8849 reduces the memory requirements to 18K, and indeed it remains under 20K 8850 even with the outer repetition increased to 100. However, this pattern 8851 is not exactly equivalent, because the "subroutine" calls are treated 8852 as atomic groups into which there can be no backtracking if there is a 8853 subsequent matching failure. Therefore, PCRE cannot do this kind of 8854 rewriting automatically. Furthermore, there is a noticeable loss of 8855 speed when executing the modified pattern. Nevertheless, if the atomic 8856 grouping is not a problem and the loss of speed is acceptable, this 8857 kind of rewriting will allow you to process patterns that PCRE cannot 8858 otherwise handle. 8859 8860 8861STACK USAGE AT RUN TIME 8862 8863 When pcre_exec() or pcre[16|32]_exec() is used for matching, certain 8864 kinds of pattern can cause it to use large amounts of the process 8865 stack. In some environments the default process stack is quite small, 8866 and if it runs out the result is often SIGSEGV. This issue is probably 8867 the most frequently raised problem with PCRE. Rewriting your pattern 8868 can often help. The pcrestack documentation discusses this issue in 8869 detail. 8870 8871 8872PROCESSING TIME 8873 8874 Certain items in regular expression patterns are processed more effi- 8875 ciently than others. It is more efficient to use a character class like 8876 [aeiou] than a set of single-character alternatives such as 8877 (a|e|i|o|u). In general, the simplest construction that provides the 8878 required behaviour is usually the most efficient. Jeffrey Friedl's book 8879 contains a lot of useful general discussion about optimizing regular 8880 expressions for efficient performance. This document contains a few 8881 observations about PCRE. 8882 8883 Using Unicode character properties (the \p, \P, and \X escapes) is 8884 slow, because PCRE has to use a multi-stage table lookup whenever it 8885 needs a character's property. If you can find an alternative pattern 8886 that does not use character properties, it will probably be faster. 8887 8888 By default, the escape sequences \b, \d, \s, and \w, and the POSIX 8889 character classes such as [:alpha:] do not use Unicode properties, 8890 partly for backwards compatibility, and partly for performance reasons. 8891 However, you can set PCRE_UCP if you want Unicode character properties 8892 to be used. This can double the matching time for items such as \d, 8893 when matched with a traditional matching function; the performance loss 8894 is less with a DFA matching function, and in both cases there is not 8895 much difference for \b. 8896 8897 When a pattern begins with .* not in parentheses, or in parentheses 8898 that are not the subject of a backreference, and the PCRE_DOTALL option 8899 is set, the pattern is implicitly anchored by PCRE, since it can match 8900 only at the start of a subject string. However, if PCRE_DOTALL is not 8901 set, PCRE cannot make this optimization, because the . metacharacter 8902 does not then match a newline, and if the subject string contains new- 8903 lines, the pattern may match from the character immediately following 8904 one of them instead of from the very start. For example, the pattern 8905 8906 .*second 8907 8908 matches the subject "first\nand second" (where \n stands for a newline 8909 character), with the match starting at the seventh character. In order 8910 to do this, PCRE has to retry the match starting after every newline in 8911 the subject. 8912 8913 If you are using such a pattern with subject strings that do not con- 8914 tain newlines, the best performance is obtained by setting PCRE_DOTALL, 8915 or starting the pattern with ^.* or ^.*? to indicate explicit anchor- 8916 ing. That saves PCRE from having to scan along the subject looking for 8917 a newline to restart at. 8918 8919 Beware of patterns that contain nested indefinite repeats. These can 8920 take a long time to run when applied to a string that does not match. 8921 Consider the pattern fragment 8922 8923 ^(a+)* 8924 8925 This can match "aaaa" in 16 different ways, and this number increases 8926 very rapidly as the string gets longer. (The * repeat can match 0, 1, 8927 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 8928 repeats can match different numbers of times.) When the remainder of 8929 the pattern is such that the entire match is going to fail, PCRE has in 8930 principle to try every possible variation, and this can take an 8931 extremely long time, even for relatively short strings. 8932 8933 An optimization catches some of the more simple cases such as 8934 8935 (a+)*b 8936 8937 where a literal character follows. Before embarking on the standard 8938 matching procedure, PCRE checks that there is a "b" later in the sub- 8939 ject string, and if there is not, it fails the match immediately. How- 8940 ever, when there is no following literal this optimization cannot be 8941 used. You can see the difference by comparing the behaviour of 8942 8943 (a+)*\d 8944 8945 with the pattern above. The former gives a failure almost instantly 8946 when applied to a whole line of "a" characters, whereas the latter 8947 takes an appreciable time with strings longer than about 20 characters. 8948 8949 In many cases, the solution to this kind of performance issue is to use 8950 an atomic group or a possessive quantifier. 8951 8952 8953AUTHOR 8954 8955 Philip Hazel 8956 University Computing Service 8957 Cambridge CB2 3QH, England. 8958 8959 8960REVISION 8961 8962 Last updated: 25 August 2012 8963 Copyright (c) 1997-2012 University of Cambridge. 8964------------------------------------------------------------------------------ 8965 8966 8967PCREPOSIX(3) PCREPOSIX(3) 8968 8969 8970NAME 8971 PCRE - Perl-compatible regular expressions. 8972 8973 8974SYNOPSIS OF POSIX API 8975 8976 #include <pcreposix.h> 8977 8978 int regcomp(regex_t *preg, const char *pattern, 8979 int cflags); 8980 8981 int regexec(regex_t *preg, const char *string, 8982 size_t nmatch, regmatch_t pmatch[], int eflags); 8983 8984 size_t regerror(int errcode, const regex_t *preg, 8985 char *errbuf, size_t errbuf_size); 8986 8987 void regfree(regex_t *preg); 8988 8989 8990DESCRIPTION 8991 8992 This set of functions provides a POSIX-style API for the PCRE regular 8993 expression 8-bit library. See the pcreapi documentation for a descrip- 8994 tion of PCRE's native API, which contains much additional functional- 8995 ity. There is no POSIX-style wrapper for PCRE's 16-bit and 32-bit 8996 library. 8997 8998 The functions described here are just wrapper functions that ultimately 8999 call the PCRE native API. Their prototypes are defined in the 9000 pcreposix.h header file, and on Unix systems the library itself is 9001 called pcreposix.a, so can be accessed by adding -lpcreposix to the 9002 command for linking an application that uses them. Because the POSIX 9003 functions call the native ones, it is also necessary to add -lpcre. 9004 9005 I have implemented only those POSIX option bits that can be reasonably 9006 mapped to PCRE native options. In addition, the option REG_EXTENDED is 9007 defined with the value zero. This has no effect, but since programs 9008 that are written to the POSIX interface often use it, this makes it 9009 easier to slot in PCRE as a replacement library. Other POSIX options 9010 are not even defined. 9011 9012 There are also some other options that are not defined by POSIX. These 9013 have been added at the request of users who want to make use of certain 9014 PCRE-specific features via the POSIX calling interface. 9015 9016 When PCRE is called via these functions, it is only the API that is 9017 POSIX-like in style. The syntax and semantics of the regular expres- 9018 sions themselves are still those of Perl, subject to the setting of 9019 various PCRE options, as described below. "POSIX-like in style" means 9020 that the API approximates to the POSIX definition; it is not fully 9021 POSIX-compatible, and in multi-byte encoding domains it is probably 9022 even less compatible. 9023 9024 The header for these functions is supplied as pcreposix.h to avoid any 9025 potential clash with other POSIX libraries. It can, of course, be 9026 renamed or aliased as regex.h, which is the "correct" name. It provides 9027 two structure types, regex_t for compiled internal forms, and reg- 9028 match_t for returning captured substrings. It also defines some con- 9029 stants whose names start with "REG_"; these are used for setting 9030 options and identifying error codes. 9031 9032 9033COMPILING A PATTERN 9034 9035 The function regcomp() is called to compile a pattern into an internal 9036 form. The pattern is a C string terminated by a binary zero, and is 9037 passed in the argument pattern. The preg argument is a pointer to a 9038 regex_t structure that is used as a base for storing information about 9039 the compiled regular expression. 9040 9041 The argument cflags is either zero, or contains one or more of the bits 9042 defined by the following macros: 9043 9044 REG_DOTALL 9045 9046 The PCRE_DOTALL option is set when the regular expression is passed for 9047 compilation to the native function. Note that REG_DOTALL is not part of 9048 the POSIX standard. 9049 9050 REG_ICASE 9051 9052 The PCRE_CASELESS option is set when the regular expression is passed 9053 for compilation to the native function. 9054 9055 REG_NEWLINE 9056 9057 The PCRE_MULTILINE option is set when the regular expression is passed 9058 for compilation to the native function. Note that this does not mimic 9059 the defined POSIX behaviour for REG_NEWLINE (see the following sec- 9060 tion). 9061 9062 REG_NOSUB 9063 9064 The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is 9065 passed for compilation to the native function. In addition, when a pat- 9066 tern that is compiled with this flag is passed to regexec() for match- 9067 ing, the nmatch and pmatch arguments are ignored, and no captured 9068 strings are returned. 9069 9070 REG_UCP 9071 9072 The PCRE_UCP option is set when the regular expression is passed for 9073 compilation to the native function. This causes PCRE to use Unicode 9074 properties when matchine \d, \w, etc., instead of just recognizing 9075 ASCII values. Note that REG_UTF8 is not part of the POSIX standard. 9076 9077 REG_UNGREEDY 9078 9079 The PCRE_UNGREEDY option is set when the regular expression is passed 9080 for compilation to the native function. Note that REG_UNGREEDY is not 9081 part of the POSIX standard. 9082 9083 REG_UTF8 9084 9085 The PCRE_UTF8 option is set when the regular expression is passed for 9086 compilation to the native function. This causes the pattern itself and 9087 all data strings used for matching it to be treated as UTF-8 strings. 9088 Note that REG_UTF8 is not part of the POSIX standard. 9089 9090 In the absence of these flags, no options are passed to the native 9091 function. This means the the regex is compiled with PCRE default 9092 semantics. In particular, the way it handles newline characters in the 9093 subject string is the Perl way, not the POSIX way. Note that setting 9094 PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE. 9095 It does not affect the way newlines are matched by . (they are not) or 9096 by a negative class such as [^a] (they are). 9097 9098 The yield of regcomp() is zero on success, and non-zero otherwise. The 9099 preg structure is filled in on success, and one member of the structure 9100 is public: re_nsub contains the number of capturing subpatterns in the 9101 regular expression. Various error codes are defined in the header file. 9102 9103 NOTE: If the yield of regcomp() is non-zero, you must not attempt to 9104 use the contents of the preg structure. If, for example, you pass it to 9105 regexec(), the result is undefined and your program is likely to crash. 9106 9107 9108MATCHING NEWLINE CHARACTERS 9109 9110 This area is not simple, because POSIX and Perl take different views of 9111 things. It is not possible to get PCRE to obey POSIX semantics, but 9112 then PCRE was never intended to be a POSIX engine. The following table 9113 lists the different possibilities for matching newline characters in 9114 PCRE: 9115 9116 Default Change with 9117 9118 . matches newline no PCRE_DOTALL 9119 newline matches [^a] yes not changeable 9120 $ matches \n at end yes PCRE_DOLLARENDONLY 9121 $ matches \n in middle no PCRE_MULTILINE 9122 ^ matches \n in middle no PCRE_MULTILINE 9123 9124 This is the equivalent table for POSIX: 9125 9126 Default Change with 9127 9128 . matches newline yes REG_NEWLINE 9129 newline matches [^a] yes REG_NEWLINE 9130 $ matches \n at end no REG_NEWLINE 9131 $ matches \n in middle no REG_NEWLINE 9132 ^ matches \n in middle no REG_NEWLINE 9133 9134 PCRE's behaviour is the same as Perl's, except that there is no equiva- 9135 lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is 9136 no way to stop newline from matching [^a]. 9137 9138 The default POSIX newline handling can be obtained by setting 9139 PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE 9140 behave exactly as for the REG_NEWLINE action. 9141 9142 9143MATCHING A PATTERN 9144 9145 The function regexec() is called to match a compiled pattern preg 9146 against a given string, which is by default terminated by a zero byte 9147 (but see REG_STARTEND below), subject to the options in eflags. These 9148 can be: 9149 9150 REG_NOTBOL 9151 9152 The PCRE_NOTBOL option is set when calling the underlying PCRE matching 9153 function. 9154 9155 REG_NOTEMPTY 9156 9157 The PCRE_NOTEMPTY option is set when calling the underlying PCRE match- 9158 ing function. Note that REG_NOTEMPTY is not part of the POSIX standard. 9159 However, setting this option can give more POSIX-like behaviour in some 9160 situations. 9161 9162 REG_NOTEOL 9163 9164 The PCRE_NOTEOL option is set when calling the underlying PCRE matching 9165 function. 9166 9167 REG_STARTEND 9168 9169 The string is considered to start at string + pmatch[0].rm_so and to 9170 have a terminating NUL located at string + pmatch[0].rm_eo (there need 9171 not actually be a NUL at that location), regardless of the value of 9172 nmatch. This is a BSD extension, compatible with but not specified by 9173 IEEE Standard 1003.2 (POSIX.2), and should be used with caution in 9174 software intended to be portable to other systems. Note that a non-zero 9175 rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location 9176 of the string, not how it is matched. 9177 9178 If the pattern was compiled with the REG_NOSUB flag, no data about any 9179 matched strings is returned. The nmatch and pmatch arguments of 9180 regexec() are ignored. 9181 9182 If the value of nmatch is zero, or if the value pmatch is NULL, no data 9183 about any matched strings is returned. 9184 9185 Otherwise,the portion of the string that was matched, and also any cap- 9186 tured substrings, are returned via the pmatch argument, which points to 9187 an array of nmatch structures of type regmatch_t, containing the mem- 9188 bers rm_so and rm_eo. These contain the offset to the first character 9189 of each substring and the offset to the first character after the end 9190 of each substring, respectively. The 0th element of the vector relates 9191 to the entire portion of string that was matched; subsequent elements 9192 relate to the capturing subpatterns of the regular expression. Unused 9193 entries in the array have both structure members set to -1. 9194 9195 A successful match yields a zero return; various error codes are 9196 defined in the header file, of which REG_NOMATCH is the "expected" 9197 failure code. 9198 9199 9200ERROR MESSAGES 9201 9202 The regerror() function maps a non-zero errorcode from either regcomp() 9203 or regexec() to a printable message. If preg is not NULL, the error 9204 should have arisen from the use of that structure. A message terminated 9205 by a binary zero is placed in errbuf. The length of the message, 9206 including the zero, is limited to errbuf_size. The yield of the func- 9207 tion is the size of buffer needed to hold the whole message. 9208 9209 9210MEMORY USAGE 9211 9212 Compiling a regular expression causes memory to be allocated and asso- 9213 ciated with the preg structure. The function regfree() frees all such 9214 memory, after which preg may no longer be used as a compiled expres- 9215 sion. 9216 9217 9218AUTHOR 9219 9220 Philip Hazel 9221 University Computing Service 9222 Cambridge CB2 3QH, England. 9223 9224 9225REVISION 9226 9227 Last updated: 09 January 2012 9228 Copyright (c) 1997-2012 University of Cambridge. 9229------------------------------------------------------------------------------ 9230 9231 9232PCRECPP(3) PCRECPP(3) 9233 9234 9235NAME 9236 PCRE - Perl-compatible regular expressions. 9237 9238 9239SYNOPSIS OF C++ WRAPPER 9240 9241 #include <pcrecpp.h> 9242 9243 9244DESCRIPTION 9245 9246 The C++ wrapper for PCRE was provided by Google Inc. Some additional 9247 functionality was added by Giuseppe Maxia. This brief man page was con- 9248 structed from the notes in the pcrecpp.h file, which should be con- 9249 sulted for further details. Note that the C++ wrapper supports only the 9250 original 8-bit PCRE library. There is no 16-bit or 32-bit support at 9251 present. 9252 9253 9254MATCHING INTERFACE 9255 9256 The "FullMatch" operation checks that supplied text matches a supplied 9257 pattern exactly. If pointer arguments are supplied, it copies matched 9258 sub-strings that match sub-patterns into them. 9259 9260 Example: successful match 9261 pcrecpp::RE re("h.*o"); 9262 re.FullMatch("hello"); 9263 9264 Example: unsuccessful match (requires full match): 9265 pcrecpp::RE re("e"); 9266 !re.FullMatch("hello"); 9267 9268 Example: creating a temporary RE object: 9269 pcrecpp::RE("h.*o").FullMatch("hello"); 9270 9271 You can pass in a "const char*" or a "string" for "text". The examples 9272 below tend to use a const char*. You can, as in the different examples 9273 above, store the RE object explicitly in a variable or use a temporary 9274 RE object. The examples below use one mode or the other arbitrarily. 9275 Either could correctly be used for any of these examples. 9276 9277 You must supply extra pointer arguments to extract matched subpieces. 9278 9279 Example: extracts "ruby" into "s" and 1234 into "i" 9280 int i; 9281 string s; 9282 pcrecpp::RE re("(\\w+):(\\d+)"); 9283 re.FullMatch("ruby:1234", &s, &i); 9284 9285 Example: does not try to extract any extra sub-patterns 9286 re.FullMatch("ruby:1234", &s); 9287 9288 Example: does not try to extract into NULL 9289 re.FullMatch("ruby:1234", NULL, &i); 9290 9291 Example: integer overflow causes failure 9292 !re.FullMatch("ruby:1234567891234", NULL, &i); 9293 9294 Example: fails because there aren't enough sub-patterns: 9295 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s); 9296 9297 Example: fails because string cannot be stored in integer 9298 !pcrecpp::RE("(.*)").FullMatch("ruby", &i); 9299 9300 The provided pointer arguments can be pointers to any scalar numeric 9301 type, or one of: 9302 9303 string (matched piece is copied to string) 9304 StringPiece (StringPiece is mutated to point to matched piece) 9305 T (where "bool T::ParseFrom(const char*, int)" exists) 9306 NULL (the corresponding matched sub-pattern is not copied) 9307 9308 The function returns true iff all of the following conditions are sat- 9309 isfied: 9310 9311 a. "text" matches "pattern" exactly; 9312 9313 b. The number of matched sub-patterns is >= number of supplied 9314 pointers; 9315 9316 c. The "i"th argument has a suitable type for holding the 9317 string captured as the "i"th sub-pattern. If you pass in 9318 void * NULL for the "i"th argument, or a non-void * NULL 9319 of the correct type, or pass fewer arguments than the 9320 number of sub-patterns, "i"th captured sub-pattern is 9321 ignored. 9322 9323 CAVEAT: An optional sub-pattern that does not exist in the matched 9324 string is assigned the empty string. Therefore, the following will 9325 return false (because the empty string is not a valid number): 9326 9327 int number; 9328 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number); 9329 9330 The matching interface supports at most 16 arguments per call. If you 9331 need more, consider using the more general interface 9332 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. 9333 9334 NOTE: Do not use no_arg, which is used internally to mark the end of a 9335 list of optional arguments, as a placeholder for missing arguments, as 9336 this can lead to segfaults. 9337 9338 9339QUOTING METACHARACTERS 9340 9341 You can use the "QuoteMeta" operation to insert backslashes before all 9342 potentially meaningful characters in a string. The returned string, 9343 used as a regular expression, will exactly match the original string. 9344 9345 Example: 9346 string quoted = RE::QuoteMeta(unquoted); 9347 9348 Note that it's legal to escape a character even if it has no special 9349 meaning in a regular expression -- so this function does that. (This 9350 also makes it identical to the perl function of the same name; see 9351 "perldoc -f quotemeta".) For example, "1.5-2.0?" becomes 9352 "1\.5\-2\.0\?". 9353 9354 9355PARTIAL MATCHES 9356 9357 You can use the "PartialMatch" operation when you want the pattern to 9358 match any substring of the text. 9359 9360 Example: simple search for a string: 9361 pcrecpp::RE("ell").PartialMatch("hello"); 9362 9363 Example: find first number in a string: 9364 int number; 9365 pcrecpp::RE re("(\\d+)"); 9366 re.PartialMatch("x*100 + 20", &number); 9367 assert(number == 100); 9368 9369 9370UTF-8 AND THE MATCHING INTERFACE 9371 9372 By default, pattern and text are plain text, one byte per character. 9373 The UTF8 flag, passed to the constructor, causes both pattern and 9374 string to be treated as UTF-8 text, still a byte stream but potentially 9375 multiple bytes per character. In practice, the text is likelier to be 9376 UTF-8 than the pattern, but the match returned may depend on the UTF8 9377 flag, so always use it when matching UTF8 text. For example, "." will 9378 match one byte normally but with UTF8 set may match up to three bytes 9379 of a multi-byte character. 9380 9381 Example: 9382 pcrecpp::RE_Options options; 9383 options.set_utf8(); 9384 pcrecpp::RE re(utf8_pattern, options); 9385 re.FullMatch(utf8_string); 9386 9387 Example: using the convenience function UTF8(): 9388 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); 9389 re.FullMatch(utf8_string); 9390 9391 NOTE: The UTF8 flag is ignored if pcre was not configured with the 9392 --enable-utf8 flag. 9393 9394 9395PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE 9396 9397 PCRE defines some modifiers to change the behavior of the regular 9398 expression engine. The C++ wrapper defines an auxiliary class, 9399 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur- 9400 rently, the following modifiers are supported: 9401 9402 modifier description Perl corresponding 9403 9404 PCRE_CASELESS case insensitive match /i 9405 PCRE_MULTILINE multiple lines match /m 9406 PCRE_DOTALL dot matches newlines /s 9407 PCRE_DOLLAR_ENDONLY $ matches only at end N/A 9408 PCRE_EXTRA strict escape parsing N/A 9409 PCRE_EXTENDED ignore white spaces /x 9410 PCRE_UTF8 handles UTF8 chars built-in 9411 PCRE_UNGREEDY reverses * and *? N/A 9412 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) 9413 9414 (*) Both Perl and PCRE allow non capturing parentheses by means of the 9415 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap- 9416 ture, while (ab|cd) does. 9417 9418 For a full account on how each modifier works, please check the PCRE 9419 API reference page. 9420 9421 For each modifier, there are two member functions whose name is made 9422 out of the modifier in lowercase, without the "PCRE_" prefix. For 9423 instance, PCRE_CASELESS is handled by 9424 9425 bool caseless() 9426 9427 which returns true if the modifier is set, and 9428 9429 RE_Options & set_caseless(bool) 9430 9431 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can 9432 be accessed through the set_match_limit() and match_limit() member 9433 functions. Setting match_limit to a non-zero value will limit the exe- 9434 cution of pcre to keep it from doing bad things like blowing the stack 9435 or taking an eternity to return a result. A value of 5000 is good 9436 enough to stop stack blowup in a 2MB thread stack. Setting match_limit 9437 to zero disables match limiting. Alternatively, you can call 9438 match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to 9439 limit how much PCRE recurses. match_limit() limits the number of 9440 matches PCRE does; match_limit_recursion() limits the depth of internal 9441 recursion, and therefore the amount of stack that is used. 9442 9443 Normally, to pass one or more modifiers to a RE class, you declare a 9444 RE_Options object, set the appropriate options, and pass this object to 9445 a RE constructor. Example: 9446 9447 RE_Options opt; 9448 opt.set_caseless(true); 9449 if (RE("HELLO", opt).PartialMatch("hello world")) ... 9450 9451 RE_options has two constructors. The default constructor takes no argu- 9452 ments and creates a set of flags that are off by default. The optional 9453 parameter option_flags is to facilitate transfer of legacy code from C 9454 programs. This lets you do 9455 9456 RE(pattern, 9457 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); 9458 9459 However, new code is better off doing 9460 9461 RE(pattern, 9462 RE_Options().set_caseless(true).set_multiline(true)) 9463 .PartialMatch(str); 9464 9465 If you are going to pass one of the most used modifiers, there are some 9466 convenience functions that return a RE_Options class with the appropri- 9467 ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(), 9468 and EXTENDED(). 9469 9470 If you need to set several options at once, and you don't want to go 9471 through the pains of declaring a RE_Options object and setting several 9472 options, there is a parallel method that give you such ability on the 9473 fly. You can concatenate several set_xxxxx() member functions, since 9474 each of them returns a reference to its class object. For example, to 9475 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one 9476 statement, you may write: 9477 9478 RE(" ^ xyz \\s+ .* blah$", 9479 RE_Options() 9480 .set_caseless(true) 9481 .set_extended(true) 9482 .set_multiline(true)).PartialMatch(sometext); 9483 9484 9485SCANNING TEXT INCREMENTALLY 9486 9487 The "Consume" operation may be useful if you want to repeatedly match 9488 regular expressions at the front of a string and skip over them as they 9489 match. This requires use of the "StringPiece" type, which represents a 9490 sub-range of a real string. Like RE, StringPiece is defined in the 9491 pcrecpp namespace. 9492 9493 Example: read lines of the form "var = value" from a string. 9494 string contents = ...; // Fill string somehow 9495 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece 9496 9497 string var; 9498 int value; 9499 pcrecpp::RE re("(\\w+) = (\\d+)\n"); 9500 while (re.Consume(&input, &var, &value)) { 9501 ...; 9502 } 9503 9504 Each successful call to "Consume" will set "var/value", and also 9505 advance "input" so it points past the matched text. 9506 9507 The "FindAndConsume" operation is similar to "Consume" but does not 9508 anchor your match at the beginning of the string. For example, you 9509 could extract all words from a string by repeatedly calling 9510 9511 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) 9512 9513 9514PARSING HEX/OCTAL/C-RADIX NUMBERS 9515 9516 By default, if you pass a pointer to a numeric value, the corresponding 9517 text is interpreted as a base-10 number. You can instead wrap the 9518 pointer with a call to one of the operators Hex(), Octal(), or CRadix() 9519 to interpret the text in another base. The CRadix operator interprets 9520 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to 9521 base-10. 9522 9523 Example: 9524 int a, b, c, d; 9525 pcrecpp::RE re("(.*) (.*) (.*) (.*)"); 9526 re.FullMatch("100 40 0100 0x40", 9527 pcrecpp::Octal(&a), pcrecpp::Hex(&b), 9528 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); 9529 9530 will leave 64 in a, b, c, and d. 9531 9532 9533REPLACING PARTS OF STRINGS 9534 9535 You can replace the first match of "pattern" in "str" with "rewrite". 9536 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to 9537 insert text matching corresponding parenthesized group from the pat- 9538 tern. \0 in "rewrite" refers to the entire matching text. For example: 9539 9540 string s = "yabba dabba doo"; 9541 pcrecpp::RE("b+").Replace("d", &s); 9542 9543 will leave "s" containing "yada dabba doo". The result is true if the 9544 pattern matches and a replacement occurs, false otherwise. 9545 9546 GlobalReplace is like Replace except that it replaces all occurrences 9547 of the pattern in the string with the rewrite. Replacements are not 9548 subject to re-matching. For example: 9549 9550 string s = "yabba dabba doo"; 9551 pcrecpp::RE("b+").GlobalReplace("d", &s); 9552 9553 will leave "s" containing "yada dada doo". It returns the number of 9554 replacements made. 9555 9556 Extract is like Replace, except that if the pattern matches, "rewrite" 9557 is copied into "out" (an additional argument) with substitutions. The 9558 non-matching portions of "text" are ignored. Returns true iff a match 9559 occurred and the extraction happened successfully; if no match occurs, 9560 the string is left unaffected. 9561 9562 9563AUTHOR 9564 9565 The C++ wrapper was contributed by Google Inc. 9566 Copyright (c) 2007 Google Inc. 9567 9568 9569REVISION 9570 9571 Last updated: 08 January 2012 9572------------------------------------------------------------------------------ 9573 9574 9575PCRESAMPLE(3) PCRESAMPLE(3) 9576 9577 9578NAME 9579 PCRE - Perl-compatible regular expressions 9580 9581 9582PCRE SAMPLE PROGRAM 9583 9584 A simple, complete demonstration program, to get you started with using 9585 PCRE, is supplied in the file pcredemo.c in the PCRE distribution. A 9586 listing of this program is given in the pcredemo documentation. If you 9587 do not have a copy of the PCRE distribution, you can save this listing 9588 to re-create pcredemo.c. 9589 9590 The demonstration program, which uses the original PCRE 8-bit library, 9591 compiles the regular expression that is its first argument, and matches 9592 it against the subject string in its second argument. No PCRE options 9593 are set, and default character tables are used. If matching succeeds, 9594 the program outputs the portion of the subject that matched, together 9595 with the contents of any captured substrings. 9596 9597 If the -g option is given on the command line, the program then goes on 9598 to check for further matches of the same regular expression in the same 9599 subject string. The logic is a little bit tricky because of the possi- 9600 bility of matching an empty string. Comments in the code explain what 9601 is going on. 9602 9603 If PCRE is installed in the standard include and library directories 9604 for your operating system, you should be able to compile the demonstra- 9605 tion program using this command: 9606 9607 gcc -o pcredemo pcredemo.c -lpcre 9608 9609 If PCRE is installed elsewhere, you may need to add additional options 9610 to the command line. For example, on a Unix-like system that has PCRE 9611 installed in /usr/local, you can compile the demonstration program 9612 using a command like this: 9613 9614 gcc -o pcredemo -I/usr/local/include pcredemo.c \ 9615 -L/usr/local/lib -lpcre 9616 9617 In a Windows environment, if you want to statically link the program 9618 against a non-dll pcre.a file, you must uncomment the line that defines 9619 PCRE_STATIC before including pcre.h, because otherwise the pcre_mal- 9620 loc() and pcre_free() exported functions will be declared 9621 __declspec(dllimport), with unwanted results. 9622 9623 Once you have compiled and linked the demonstration program, you can 9624 run simple tests like this: 9625 9626 ./pcredemo 'cat|dog' 'the cat sat on the mat' 9627 ./pcredemo -g 'cat|dog' 'the dog sat on the cat' 9628 9629 Note that there is a much more comprehensive test program, called 9630 pcretest, which supports many more facilities for testing regular 9631 expressions and both PCRE libraries. The pcredemo program is provided 9632 as a simple coding example. 9633 9634 If you try to run pcredemo when PCRE is not installed in the standard 9635 library directory, you may get an error like this on some operating 9636 systems (e.g. Solaris): 9637 9638 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or 9639 directory 9640 9641 This is caused by the way shared library support works on those sys- 9642 tems. You need to add 9643 9644 -R/usr/local/lib 9645 9646 (for example) to the compile command to get round this problem. 9647 9648 9649AUTHOR 9650 9651 Philip Hazel 9652 University Computing Service 9653 Cambridge CB2 3QH, England. 9654 9655 9656REVISION 9657 9658 Last updated: 10 January 2012 9659 Copyright (c) 1997-2012 University of Cambridge. 9660------------------------------------------------------------------------------ 9661PCRELIMITS(3) PCRELIMITS(3) 9662 9663 9664NAME 9665 PCRE - Perl-compatible regular expressions 9666 9667 9668SIZE AND OTHER LIMITATIONS 9669 9670 There are some size limitations in PCRE but it is hoped that they will 9671 never in practice be relevant. 9672 9673 The maximum length of a compiled pattern is approximately 64K data 9674 units (bytes for the 8-bit library, 32-bit units for the 32-bit 9675 library, and 32-bit units for the 32-bit library) if PCRE is compiled 9676 with the default internal linkage size of 2 bytes. If you want to 9677 process regular expressions that are truly enormous, you can compile 9678 PCRE with an internal linkage size of 3 or 4 (when building the 16-bit 9679 or 32-bit library, 3 is rounded up to 4). See the README file in the 9680 source distribution and the pcrebuild documentation for details. In 9681 these cases the limit is substantially larger. However, the speed of 9682 execution is slower. 9683 9684 All values in repeating quantifiers must be less than 65536. 9685 9686 There is no limit to the number of parenthesized subpatterns, but there 9687 can be no more than 65535 capturing subpatterns. 9688 9689 There is a limit to the number of forward references to subsequent sub- 9690 patterns of around 200,000. Repeated forward references with fixed 9691 upper limits, for example, (?2){0,100} when subpattern number 2 is to 9692 the right, are included in the count. There is no limit to the number 9693 of backward references. 9694 9695 The maximum length of name for a named subpattern is 32 characters, and 9696 the maximum number of named subpatterns is 10000. 9697 9698 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or 9699 (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and 9700 32-bit library. 9701 9702 The maximum length of a subject string is the largest positive number 9703 that an integer variable can hold. However, when using the traditional 9704 matching function, PCRE uses recursion to handle subpatterns and indef- 9705 inite repetition. This means that the available stack space may limit 9706 the size of a subject string that can be processed by certain patterns. 9707 For a discussion of stack issues, see the pcrestack documentation. 9708 9709 9710AUTHOR 9711 9712 Philip Hazel 9713 University Computing Service 9714 Cambridge CB2 3QH, England. 9715 9716 9717REVISION 9718 9719 Last updated: 04 May 2012 9720 Copyright (c) 1997-2012 University of Cambridge. 9721------------------------------------------------------------------------------ 9722 9723 9724PCRESTACK(3) PCRESTACK(3) 9725 9726 9727NAME 9728 PCRE - Perl-compatible regular expressions 9729 9730 9731PCRE DISCUSSION OF STACK USAGE 9732 9733 When you call pcre[16|32]_exec(), it makes use of an internal function 9734 called match(). This calls itself recursively at branch points in the 9735 pattern, in order to remember the state of the match so that it can 9736 back up and try a different alternative if the first one fails. As 9737 matching proceeds deeper and deeper into the tree of possibilities, the 9738 recursion depth increases. The match() function is also called in other 9739 circumstances, for example, whenever a parenthesized sub-pattern is 9740 entered, and in certain cases of repetition. 9741 9742 Not all calls of match() increase the recursion depth; for an item such 9743 as a* it may be called several times at the same level, after matching 9744 different numbers of a's. Furthermore, in a number of cases where the 9745 result of the recursive call would immediately be passed back as the 9746 result of the current call (a "tail recursion"), the function is just 9747 restarted instead. 9748 9749 The above comments apply when pcre[16|32]_exec() is run in its normal 9750 interpretive manner. If the pattern was studied with the 9751 PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling was success- 9752 ful, and the options passed to pcre[16|32]_exec() were not incompati- 9753 ble, the matching process uses the JIT-compiled code instead of the 9754 match() function. In this case, the memory requirements are handled 9755 entirely differently. See the pcrejit documentation for details. 9756 9757 The pcre[16|32]_dfa_exec() function operates in an entirely different 9758 way, and uses recursion only when there is a regular expression recur- 9759 sion or subroutine call in the pattern. This includes the processing of 9760 assertion and "once-only" subpatterns, which are handled like subrou- 9761 tine calls. Normally, these are never very deep, and the limit on the 9762 complexity of pcre[16|32]_dfa_exec() is controlled by the amount of 9763 workspace it is given. However, it is possible to write patterns with 9764 runaway infinite recursions; such patterns will cause 9765 pcre[16|32]_dfa_exec() to run out of stack. At present, there is no 9766 protection against this. 9767 9768 The comments that follow do NOT apply to pcre[16|32]_dfa_exec(); they 9769 are relevant only for pcre[16|32]_exec() without the JIT optimization. 9770 9771 Reducing pcre[16|32]_exec()'s stack usage 9772 9773 Each time that match() is actually called recursively, it uses memory 9774 from the process stack. For certain kinds of pattern and data, very 9775 large amounts of stack may be needed, despite the recognition of "tail 9776 recursion". You can often reduce the amount of recursion, and there- 9777 fore the amount of stack used, by modifying the pattern that is being 9778 matched. Consider, for example, this pattern: 9779 9780 ([^<]|<(?!inet))+ 9781 9782 It matches from wherever it starts until it encounters "<inet" or the 9783 end of the data, and is the kind of pattern that might be used when 9784 processing an XML file. Each iteration of the outer parentheses matches 9785 either one character that is not "<" or a "<" that is not followed by 9786 "inet". However, each time a parenthesis is processed, a recursion 9787 occurs, so this formulation uses a stack frame for each matched charac- 9788 ter. For a long string, a lot of stack is required. Consider now this 9789 rewritten pattern, which matches exactly the same strings: 9790 9791 ([^<]++|<(?!inet))+ 9792 9793 This uses very much less stack, because runs of characters that do not 9794 contain "<" are "swallowed" in one item inside the parentheses. Recur- 9795 sion happens only when a "<" character that is not followed by "inet" 9796 is encountered (and we assume this is relatively rare). A possessive 9797 quantifier is used to stop any backtracking into the runs of non-"<" 9798 characters, but that is not related to stack usage. 9799 9800 This example shows that one way of avoiding stack problems when match- 9801 ing long subject strings is to write repeated parenthesized subpatterns 9802 to match more than one character whenever possible. 9803 9804 Compiling PCRE to use heap instead of stack for pcre[16|32]_exec() 9805 9806 In environments where stack memory is constrained, you might want to 9807 compile PCRE to use heap memory instead of stack for remembering back- 9808 up points when pcre[16|32]_exec() is running. This makes it run a lot 9809 more slowly, however. Details of how to do this are given in the pcre- 9810 build documentation. When built in this way, instead of using the 9811 stack, PCRE obtains and frees memory by calling the functions that are 9812 pointed to by the pcre[16|32]_stack_malloc and pcre[16|32]_stack_free 9813 variables. By default, these point to malloc() and free(), but you can 9814 replace the pointers to cause PCRE to use your own functions. Since the 9815 block sizes are always the same, and are always freed in reverse order, 9816 it may be possible to implement customized memory handlers that are 9817 more efficient than the standard functions. 9818 9819 Limiting pcre[16|32]_exec()'s stack usage 9820 9821 You can set limits on the number of times that match() is called, both 9822 in total and recursively. If a limit is exceeded, pcre[16|32]_exec() 9823 returns an error code. Setting suitable limits should prevent it from 9824 running out of stack. The default values of the limits are very large, 9825 and unlikely ever to operate. They can be changed when PCRE is built, 9826 and they can also be set when pcre[16|32]_exec() is called. For details 9827 of these interfaces, see the pcrebuild documentation and the section on 9828 extra data for pcre[16|32]_exec() in the pcreapi documentation. 9829 9830 As a very rough rule of thumb, you should reckon on about 500 bytes per 9831 recursion. Thus, if you want to limit your stack usage to 8Mb, you 9832 should set the limit at 16000 recursions. A 64Mb stack, on the other 9833 hand, can support around 128000 recursions. 9834 9835 In Unix-like environments, the pcretest test program has a command line 9836 option (-S) that can be used to increase the size of its stack. As long 9837 as the stack is large enough, another option (-M) can be used to find 9838 the smallest limits that allow a particular pattern to match a given 9839 subject string. This is done by calling pcre[16|32]_exec() repeatedly 9840 with different limits. 9841 9842 Obtaining an estimate of stack usage 9843 9844 The actual amount of stack used per recursion can vary quite a lot, 9845 depending on the compiler that was used to build PCRE and the optimiza- 9846 tion or debugging options that were set for it. The rule of thumb value 9847 of 500 bytes mentioned above may be larger or smaller than what is 9848 actually needed. A better approximation can be obtained by running this 9849 command: 9850 9851 pcretest -m -C 9852 9853 The -C option causes pcretest to output information about the options 9854 with which PCRE was compiled. When -m is also given (before -C), infor- 9855 mation about stack use is given in a line like this: 9856 9857 Match recursion uses stack: approximate frame size = 640 bytes 9858 9859 The value is approximate because some recursions need a bit more (up to 9860 perhaps 16 more bytes). 9861 9862 If the above command is given when PCRE is compiled to use the heap 9863 instead of the stack for recursion, the value that is output is the 9864 size of each block that is obtained from the heap. 9865 9866 Changing stack size in Unix-like systems 9867 9868 In Unix-like environments, there is not often a problem with the stack 9869 unless very long strings are involved, though the default limit on 9870 stack size varies from system to system. Values from 8Mb to 64Mb are 9871 common. You can find your default limit by running the command: 9872 9873 ulimit -s 9874 9875 Unfortunately, the effect of running out of stack is often SIGSEGV, 9876 though sometimes a more explicit error message is given. You can nor- 9877 mally increase the limit on stack size by code such as this: 9878 9879 struct rlimit rlim; 9880 getrlimit(RLIMIT_STACK, &rlim); 9881 rlim.rlim_cur = 100*1024*1024; 9882 setrlimit(RLIMIT_STACK, &rlim); 9883 9884 This reads the current limits (soft and hard) using getrlimit(), then 9885 attempts to increase the soft limit to 100Mb using setrlimit(). You 9886 must do this before calling pcre[16|32]_exec(). 9887 9888 Changing stack size in Mac OS X 9889 9890 Using setrlimit(), as described above, should also work on Mac OS X. It 9891 is also possible to set a stack size when linking a program. There is a 9892 discussion about stack sizes in Mac OS X at this web site: 9893 http://developer.apple.com/qa/qa2005/qa1419.html. 9894 9895 9896AUTHOR 9897 9898 Philip Hazel 9899 University Computing Service 9900 Cambridge CB2 3QH, England. 9901 9902 9903REVISION 9904 9905 Last updated: 24 June 2012 9906 Copyright (c) 1997-2012 University of Cambridge. 9907------------------------------------------------------------------------------ 9908 9909 9910