1Technical Notes about PCRE 2-------------------------- 3 4These are very rough technical notes that record potentially useful information 5about PCRE internals. For information about testing PCRE, see the pcretest 6documentation and the comment at the head of the RunTest file. 7 8 9Historical note 1 10----------------- 11 12Many years ago I implemented some regular expression functions to an algorithm 13suggested by Martin Richards. These were not Unix-like in form, and were quite 14restricted in what they could do by comparison with Perl. The interesting part 15about the algorithm was that the amount of space required to hold the compiled 16form of an expression was known in advance. The code to apply an expression did 17not operate by backtracking, as the original Henry Spencer code and current 18Perl code does, but instead checked all possibilities simultaneously by keeping 19a list of current states and checking all of them as it advanced through the 20subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA 21algorithm", though it was not a traditional Finite State Machine (FSM). When 22the pattern was all used up, all remaining states were possible matches, and 23the one matching the longest subset of the subject string was chosen. This did 24not necessarily maximize the individual wild portions of the pattern, as is 25expected in Unix and Perl-style regular expressions. 26 27 28Historical note 2 29----------------- 30 31By contrast, the code originally written by Henry Spencer (which was 32subsequently heavily modified for Perl) compiles the expression twice: once in 33a dummy mode in order to find out how much store will be needed, and then for 34real. (The Perl version probably doesn't do this any more; I'm talking about 35the original library.) The execution function operates by backtracking and 36maximizing (or, optionally, minimizing in Perl) the amount of the subject that 37matches individual wild portions of the pattern. This is an "NFA algorithm" in 38Friedl's terminology. 39 40 41OK, here's the real stuff 42------------------------- 43 44For the set of functions that form the "basic" PCRE library (which are 45unrelated to those mentioned above), I tried at first to invent an algorithm 46that used an amount of store bounded by a multiple of the number of characters 47in the pattern, to save on compiling time. However, because of the greater 48complexity in Perl regular expressions, I couldn't do this. In any case, a 49first pass through the pattern is helpful for other reasons. 50 51 52Support for 16-bit and 32-bit data strings 53------------------------------------------- 54 55From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from 56release 8.32, PCRE supports 32-bit data strings. The library can be compiled 57in any combination of 8-bit, 16-bit or 32-bit modes, creating different 58libraries. In the description that follows, the word "short" is 59used for a 16-bit data quantity, and the word "unit" is used for a quantity 60that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned 61integer in 32-bit mode. However, so as not to over-complicate the text, the 62names of PCRE functions are given in 8-bit form only. 63 64 65Computing the memory requirement: how it was 66-------------------------------------------- 67 68Up to and including release 6.7, PCRE worked by running a very degenerate first 69pass to calculate a maximum store size, and then a second pass to do the real 70compile - which might use a bit less than the predicted amount of memory. The 71idea was that this would turn out faster than the Henry Spencer code because 72the first pass is degenerate and the second pass can just store stuff straight 73into the vector, which it knows is big enough. 74 75 76Computing the memory requirement: how it is 77------------------------------------------- 78 79By the time I was working on a potential 6.8 release, the degenerate first pass 80had become very complicated and hard to maintain. Indeed one of the early 81things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then 82I had a flash of inspiration as to how I could run the real compile function in 83a "fake" mode that enables it to compute how much memory it would need, while 84actually only ever using a few hundred bytes of working memory, and without too 85many tests of the mode that might slow it down. So I refactored the compiling 86functions to work this way. This got rid of about 600 lines of source. It 87should make future maintenance and development easier. As this was such a major 88change, I never released 6.8, instead upping the number to 7.0 (other quite 89major changes were also present in the 7.0 release). 90 91A side effect of this work was that the previous limit of 200 on the nesting 92depth of parentheses was removed. However, there is a downside: pcre_compile() 93runs more slowly than before (30% or more, depending on the pattern) because it 94is doing a full analysis of the pattern. My hope was that this would not be a 95big issue, and in the event, nobody has commented on it. 96 97 98Traditional matching function 99----------------------------- 100 101The "traditional", and original, matching function is called pcre_exec(), and 102it implements an NFA algorithm, similar to the original Henry Spencer algorithm 103and the way that Perl works. This is not surprising, since it is intended to be 104as compatible with Perl as possible. This is the function most users of PCRE 105will use most of the time. From release 8.20, if PCRE is compiled with 106just-in-time (JIT) support, and studying a compiled pattern with JIT is 107successful, the JIT code is run instead of the normal pcre_exec() code, but the 108result is the same. 109 110 111Supplementary matching function 112------------------------------- 113 114From PCRE 6.0, there is also a supplementary matching function called 115pcre_dfa_exec(). This implements a DFA matching algorithm that searches 116simultaneously for all possible matches that start at one point in the subject 117string. (Going back to my roots: see Historical Note 1 above.) This function 118intreprets the same compiled pattern data as pcre_exec(); however, not all the 119facilities are available, and those that are do not always work in quite the 120same way. See the user documentation for details. 121 122The algorithm that is used for pcre_dfa_exec() is not a traditional FSM, 123because it may have a number of states active at one time. More work would be 124needed at compile time to produce a traditional FSM where only one state is 125ever active at once. I believe some other regex matchers work this way. 126 127 128Changeable options 129------------------ 130 131The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL) may 132change in the middle of patterns. From PCRE 8.13, their processing is handled 133entirely at compile time by generating different opcodes for the different 134settings. The runtime functions do not need to keep track of an options state 135any more. 136 137 138Format of compiled patterns 139--------------------------- 140 141The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or 142shorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing 143items of variable length. The first unit in an item contains an opcode, and 144the length of the item is either implicit in the opcode or contained in the 145data that follows it. 146 147In many cases listed below, LINK_SIZE data values are specified for offsets 148within the compiled pattern. LINK_SIZE always specifies a number of bytes. The 149default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or 1504-byte values for these offsets, although this impairs the performance. (3-byte 151LINK_SIZE values are available only in 8-bit mode.) Specifing a LINK_SIZE 152larger than 2 is necessary only when patterns whose compiled length is greater 153than 64K are going to be processed. In this description, we assume the "normal" 154compilation options. Data values that are counts (e.g. for quantifiers) are 155always just two bytes long (one short in 16-bit mode). 156 157Opcodes with no following data 158------------------------------ 159 160These items are all just one unit long 161 162 OP_END end of pattern 163 OP_ANY match any one character other than newline 164 OP_ALLANY match any one character, including newline 165 OP_ANYBYTE match any single byte, even in UTF-8 mode 166 OP_SOD match start of data: \A 167 OP_SOM, start of match (subject + offset): \G 168 OP_SET_SOM, set start of match (\K) 169 OP_CIRC ^ (start of data) 170 OP_CIRCM ^ multiline mode (start of data or after newline) 171 OP_NOT_WORD_BOUNDARY \W 172 OP_WORD_BOUNDARY \w 173 OP_NOT_DIGIT \D 174 OP_DIGIT \d 175 OP_NOT_HSPACE \H 176 OP_HSPACE \h 177 OP_NOT_WHITESPACE \S 178 OP_WHITESPACE \s 179 OP_NOT_VSPACE \V 180 OP_VSPACE \v 181 OP_NOT_WORDCHAR \W 182 OP_WORDCHAR \w 183 OP_EODN match end of data or \n at end: \Z 184 OP_EOD match end of data: \z 185 OP_DOLL $ (end of data, or before final newline) 186 OP_DOLLM $ multiline mode (end of data or before newline) 187 OP_EXTUNI match an extended Unicode character 188 OP_ANYNL match any Unicode newline sequence 189 190 OP_ACCEPT ) These are Perl 5.10's "backtracking control 191 OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing 192 OP_FAIL ) parentheses, it may be preceded by one or more 193 OP_PRUNE ) OP_CLOSE, followed by a 2-byte number, 194 OP_SKIP ) indicating which parentheses must be closed. 195 196 197Backtracking control verbs with (optional) data 198----------------------------------------------- 199 200(*THEN) without an argument generates the opcode OP_THEN and no following data. 201OP_MARK is followed by the mark name, preceded by a one-unit length, and 202followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments, 203the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name 204following in the same format. 205 206 207Matching literal characters 208--------------------------- 209 210The OP_CHAR opcode is followed by a single character that is to be matched 211casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes, 212the character may be more than one unit long. In UTF-32 mode, characters 213are always exactly one unit long. 214 215 216Repeating single characters 217--------------------------- 218 219The common repeats (*, +, ?), when applied to a single character, use the 220following opcodes, which come in caseful and caseless versions: 221 222 Caseful Caseless 223 OP_STAR OP_STARI 224 OP_MINSTAR OP_MINSTARI 225 OP_POSSTAR OP_POSSTARI 226 OP_PLUS OP_PLUSI 227 OP_MINPLUS OP_MINPLUSI 228 OP_POSPLUS OP_POSPLUSI 229 OP_QUERY OP_QUERYI 230 OP_MINQUERY OP_MINQUERYI 231 OP_POSQUERY OP_POSQUERYI 232 233Each opcode is followed by the character that is to be repeated. In ASCII mode, 234these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in 235UTF-32 mode these are one-unit items. 236Those with "MIN" in their names are the minimizing versions. Those with "POS" 237in their names are possessive versions. Other repeats make use of these 238opcodes: 239 240 Caseful Caseless 241 OP_UPTO OP_UPTOI 242 OP_MINUPTO OP_MINUPTOI 243 OP_POSUPTO OP_POSUPTOI 244 OP_EXACT OP_EXACTI 245 246Each of these is followed by a two-byte (one short) count (most significant 247byte first in 8-bit mode) and then the repeated character. OP_UPTO matches from 2480 to the given number. A repeat with a non-zero minimum and a fixed maximum is 249coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). 250 251 252Repeating character types 253------------------------- 254 255Repeats of things like \d are done exactly as for single characters, except 256that instead of a character, the opcode for the type is stored in the data 257unit. The opcodes are: 258 259 OP_TYPESTAR 260 OP_TYPEMINSTAR 261 OP_TYPEPOSSTAR 262 OP_TYPEPLUS 263 OP_TYPEMINPLUS 264 OP_TYPEPOSPLUS 265 OP_TYPEQUERY 266 OP_TYPEMINQUERY 267 OP_TYPEPOSQUERY 268 OP_TYPEUPTO 269 OP_TYPEMINUPTO 270 OP_TYPEPOSUPTO 271 OP_TYPEEXACT 272 273 274Match by Unicode property 275------------------------- 276 277OP_PROP and OP_NOTPROP are used for positive and negative matches of a 278character by testing its Unicode property (the \p and \P escape sequences). 279Each is followed by two units that encode the desired property as a type and a 280value. 281 282Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by 283three units: OP_PROP or OP_NOTPROP, and then the desired property type and 284value. 285 286 287Character classes 288----------------- 289 290If there is only one character in the class, OP_CHAR or OP_CHARI is used for a 291positive class, and OP_NOT or OP_NOTI for a negative one (that is, for 292something like [^a]). 293 294Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for 295repeated, negated, single-character classes. The normal single-character 296opcodes (OP_STAR, etc.) are used for repeated positive single-character 297classes. 298 299When there is more than one character in a class and all the characters are 300less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a 301negative one. In either case, the opcode is followed by a 32-byte (16-short) 302bit map containing a 1 bit for every character that is acceptable. The bits are 303counted from the least significant end of each unit. In caseless mode, bits for 304both cases are set. 305 306The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode, 307subject characters with values greater than 255 can be handled correctly. For 308OP_CLASS they do not match, whereas for OP_NCLASS they do. 309 310For classes containing characters with values greater than 255, OP_XCLASS is 311used. It optionally uses a bit map (if any characters lie within it), followed 312by a list of pairs (for a range) and single characters. In caseless mode, both 313cases are explicitly listed. There is a flag character than indicates whether 314it is a positive or a negative class. 315 316 317Back references 318--------------- 319 320OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes (one short) 321containing the reference number. 322 323 324Repeating character classes and back references 325----------------------------------------------- 326 327Single-character classes are handled specially (see above). This section 328applies to OP_CLASS and OP_REF[I]. In both cases, the repeat information 329follows the base item. The matching code looks at the following opcode to see 330if it is one of 331 332 OP_CRSTAR 333 OP_CRMINSTAR 334 OP_CRPLUS 335 OP_CRMINPLUS 336 OP_CRQUERY 337 OP_CRMINQUERY 338 OP_CRRANGE 339 OP_CRMINRANGE 340 341All but the last two are just single-unit items. The others are followed by 342four bytes (two shorts) of data, comprising the minimum and maximum repeat 343counts. There are no special possessive opcodes for these repeats; a possessive 344repeat is compiled into an atomic group. 345 346 347Brackets and alternation 348------------------------ 349 350A pair of non-capturing (round) brackets is wrapped round each expression at 351compile time, so alternation always happens in the context of brackets. 352 353[Note for North Americans: "bracket" to some English speakers, including 354myself, can be round, square, curly, or pointy. Hence this usage rather than 355"parentheses".] 356 357Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99 358capturing brackets and it used a different opcode for each one. From release 3593.5, the limit was removed by putting the bracket number into the data for 360higher-numbered brackets. From release 7.0 all capturing brackets are handled 361this way, using the single opcode OP_CBRA. 362 363A bracket opcode is followed by LINK_SIZE bytes which give the offset to the 364next alternative OP_ALT or, if there aren't any branches, to the matching 365OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to 366the next one, or to the OP_KET opcode. For capturing brackets, the bracket 367number immediately follows the offset, always as a 2-byte (one short) item. 368 369OP_KET is used for subpatterns that do not repeat indefinitely, and 370OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or 371maximally respectively (see below for possessive repetitions). All three are 372followed by LINK_SIZE bytes giving (as a positive number) the offset back to 373the matching bracket opcode. 374 375If a subpattern is quantified such that it is permitted to match zero times, it 376is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are 377single-unit opcodes that tell the matcher that skipping the following 378subpattern entirely is a valid branch. In the case of the first two, not 379skipping the pattern is also valid (greedy and non-greedy). The third is used 380when a pattern has the quantifier {0,0}. It cannot be entirely discarded, 381because it may be called as a subroutine from elsewhere in the regex. 382 383A subpattern with an indefinite maximum repetition is replicated in the 384compiled data its minimum number of times (or once with OP_BRAZERO if the 385minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX 386as appropriate. 387 388A subpattern with a bounded maximum repetition is replicated in a nested 389fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO 390before each replication after the minimum, so that, for example, (abc){2,5} is 391compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group 392has the same number. 393 394When a repeated subpattern has an unbounded upper limit, it is checked to see 395whether it could match an empty string. If this is the case, the opcode in the 396final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher 397that it needs to check for matching an empty string when it hits OP_KETRMIN or 398OP_KETRMAX, and if so, to break the loop. 399 400Possessive brackets 401------------------- 402 403When a repeated group (capturing or non-capturing) is marked as possessive by 404the "+" notation, e.g. (abc)++, different opcodes are used. Their names all 405have POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCPBRPOS instead 406of OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum 407repetition is zero, the group is preceded by OP_BRAPOSZERO. 408 409 410Assertions 411---------- 412 413Forward assertions are just like other subpatterns, but starting with one of 414the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes 415OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion 416is OP_REVERSE, followed by a two byte (one short) count of the number of 417characters to move back the pointer in the subject string. In ASCII mode, the 418count is a number of units, but in UTF-8/16 mode each character may occupy more 419than one unit; in UTF-32 mode each character occupies exactly one unit. 420A separate count is present in each alternative of a lookbehind 421assertion, allowing them to have different fixed lengths. 422 423 424Once-only (atomic) subpatterns 425------------------------------ 426 427These are also just like other subpatterns, but they start with the opcode 428OP_ONCE. The check for matching an empty string in an unbounded repeat is 429handled entirely at runtime, so there is just this one opcode. 430 431 432Conditional subpatterns 433----------------------- 434 435These are like other subpatterns, but they start with the opcode OP_COND, or 436OP_SCOND for one that might match an empty string in an unbounded repeat. If 437the condition is a back reference, this is stored at the start of the 438subpattern using the opcode OP_CREF followed by two bytes (one short) 439containing the reference number. OP_NCREF is used instead if the reference was 440generated by name (so that the runtime code knows to check for duplicate 441names). 442 443If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of 444group x" (coded as "(?(Rx)"), the group number is stored at the start of the 445subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of 446zero for "the whole pattern". For a DEFINE condition, just the single unit 447OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern 448always starts with one of the assertions. 449 450 451Recursion 452--------- 453 454Recursion either matches the current regex, or some subexpression. The opcode 455OP_RECURSE is followed by an value which is the offset to the starting bracket 456from the start of the whole pattern. From release 6.5, OP_RECURSE is 457automatically wrapped inside OP_ONCE brackets (because otherwise some patterns 458broke it). OP_RECURSE is also used for "subroutine" calls, even though they 459are not strictly a recursion. 460 461 462Callout 463------- 464 465OP_CALLOUT is followed by one unit of data that holds a callout number in the 466range 0 to 254 for manual callouts, or 255 for an automatic callout. In both 467cases there follows a two-byte (one short) value giving the offset in the 468pattern to the start of the following item, and another two-byte (one short) 469item giving the length of the next item. 470 471 472Philip Hazel 473February 2012 474