xref: /PHP-5.3/ext/pcre/pcrelib/HACKING (revision 357ab3cb)
1Technical Notes about PCRE
2--------------------------
3
4These are very rough technical notes that record potentially useful information
5about PCRE internals. For information about testing PCRE, see the pcretest
6documentation and the comment at the head of the RunTest file.
7
8
9Historical note 1
10-----------------
11
12Many years ago I implemented some regular expression functions to an algorithm
13suggested by Martin Richards. These were not Unix-like in form, and were quite
14restricted in what they could do by comparison with Perl. The interesting part
15about the algorithm was that the amount of space required to hold the compiled
16form of an expression was known in advance. The code to apply an expression did
17not operate by backtracking, as the original Henry Spencer code and current
18Perl code does, but instead checked all possibilities simultaneously by keeping
19a list of current states and checking all of them as it advanced through the
20subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
21algorithm", though it was not a traditional Finite State Machine (FSM). When
22the pattern was all used up, all remaining states were possible matches, and
23the one matching the longest subset of the subject string was chosen. This did
24not necessarily maximize the individual wild portions of the pattern, as is
25expected in Unix and Perl-style regular expressions.
26
27
28Historical note 2
29-----------------
30
31By contrast, the code originally written by Henry Spencer (which was
32subsequently heavily modified for Perl) compiles the expression twice: once in
33a dummy mode in order to find out how much store will be needed, and then for
34real. (The Perl version probably doesn't do this any more; I'm talking about
35the original library.) The execution function operates by backtracking and
36maximizing (or, optionally, minimizing in Perl) the amount of the subject that
37matches individual wild portions of the pattern. This is an "NFA algorithm" in
38Friedl's terminology.
39
40
41OK, here's the real stuff
42-------------------------
43
44For the set of functions that form the "basic" PCRE library (which are
45unrelated to those mentioned above), I tried at first to invent an algorithm
46that used an amount of store bounded by a multiple of the number of characters
47in the pattern, to save on compiling time. However, because of the greater
48complexity in Perl regular expressions, I couldn't do this. In any case, a
49first pass through the pattern is helpful for other reasons.
50
51
52Support for 16-bit and 32-bit data strings
53-------------------------------------------
54
55From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from
56release 8.32, PCRE supports 32-bit data strings. The library can be compiled
57in any combination of 8-bit, 16-bit or 32-bit modes, creating different
58libraries. In the description that follows, the word "short" is
59used for a 16-bit data quantity, and the word "unit" is used for a quantity
60that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned
61integer in 32-bit mode. However, so as not to over-complicate the text, the
62names of PCRE functions are given in 8-bit form only.
63
64
65Computing the memory requirement: how it was
66--------------------------------------------
67
68Up to and including release 6.7, PCRE worked by running a very degenerate first
69pass to calculate a maximum store size, and then a second pass to do the real
70compile - which might use a bit less than the predicted amount of memory. The
71idea was that this would turn out faster than the Henry Spencer code because
72the first pass is degenerate and the second pass can just store stuff straight
73into the vector, which it knows is big enough.
74
75
76Computing the memory requirement: how it is
77-------------------------------------------
78
79By the time I was working on a potential 6.8 release, the degenerate first pass
80had become very complicated and hard to maintain. Indeed one of the early
81things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
82I had a flash of inspiration as to how I could run the real compile function in
83a "fake" mode that enables it to compute how much memory it would need, while
84actually only ever using a few hundred bytes of working memory, and without too
85many tests of the mode that might slow it down. So I refactored the compiling
86functions to work this way. This got rid of about 600 lines of source. It
87should make future maintenance and development easier. As this was such a major
88change, I never released 6.8, instead upping the number to 7.0 (other quite
89major changes were also present in the 7.0 release).
90
91A side effect of this work was that the previous limit of 200 on the nesting
92depth of parentheses was removed. However, there is a downside: pcre_compile()
93runs more slowly than before (30% or more, depending on the pattern) because it
94is doing a full analysis of the pattern. My hope was that this would not be a
95big issue, and in the event, nobody has commented on it.
96
97
98Traditional matching function
99-----------------------------
100
101The "traditional", and original, matching function is called pcre_exec(), and
102it implements an NFA algorithm, similar to the original Henry Spencer algorithm
103and the way that Perl works. This is not surprising, since it is intended to be
104as compatible with Perl as possible. This is the function most users of PCRE
105will use most of the time. From release 8.20, if PCRE is compiled with
106just-in-time (JIT) support, and studying a compiled pattern with JIT is
107successful, the JIT code is run instead of the normal pcre_exec() code, but the
108result is the same.
109
110
111Supplementary matching function
112-------------------------------
113
114From PCRE 6.0, there is also a supplementary matching function called
115pcre_dfa_exec(). This implements a DFA matching algorithm that searches
116simultaneously for all possible matches that start at one point in the subject
117string. (Going back to my roots: see Historical Note 1 above.) This function
118intreprets the same compiled pattern data as pcre_exec(); however, not all the
119facilities are available, and those that are do not always work in quite the
120same way. See the user documentation for details.
121
122The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
123because it may have a number of states active at one time. More work would be
124needed at compile time to produce a traditional FSM where only one state is
125ever active at once. I believe some other regex matchers work this way.
126
127
128Changeable options
129------------------
130
131The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL) may
132change in the middle of patterns. From PCRE 8.13, their processing is handled
133entirely at compile time by generating different opcodes for the different
134settings. The runtime functions do not need to keep track of an options state
135any more.
136
137
138Format of compiled patterns
139---------------------------
140
141The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
142shorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing
143items of variable length. The first unit in an item contains an opcode, and
144the length of the item is either implicit in the opcode or contained in the
145data that follows it.
146
147In many cases listed below, LINK_SIZE data values are specified for offsets
148within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
149default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or
1504-byte values for these offsets, although this impairs the performance. (3-byte
151LINK_SIZE values are available only in 8-bit mode.) Specifing a LINK_SIZE
152larger than 2 is necessary only when patterns whose compiled length is greater
153than 64K are going to be processed. In this description, we assume the "normal"
154compilation options. Data values that are counts (e.g. for quantifiers) are
155always just two bytes long (one short in 16-bit mode).
156
157Opcodes with no following data
158------------------------------
159
160These items are all just one unit long
161
162  OP_END                 end of pattern
163  OP_ANY                 match any one character other than newline
164  OP_ALLANY              match any one character, including newline
165  OP_ANYBYTE             match any single byte, even in UTF-8 mode
166  OP_SOD                 match start of data: \A
167  OP_SOM,                start of match (subject + offset): \G
168  OP_SET_SOM,            set start of match (\K)
169  OP_CIRC                ^ (start of data)
170  OP_CIRCM               ^ multiline mode (start of data or after newline)
171  OP_NOT_WORD_BOUNDARY   \W
172  OP_WORD_BOUNDARY       \w
173  OP_NOT_DIGIT           \D
174  OP_DIGIT               \d
175  OP_NOT_HSPACE          \H
176  OP_HSPACE              \h
177  OP_NOT_WHITESPACE      \S
178  OP_WHITESPACE          \s
179  OP_NOT_VSPACE          \V
180  OP_VSPACE              \v
181  OP_NOT_WORDCHAR        \W
182  OP_WORDCHAR            \w
183  OP_EODN                match end of data or \n at end: \Z
184  OP_EOD                 match end of data: \z
185  OP_DOLL                $ (end of data, or before final newline)
186  OP_DOLLM               $ multiline mode (end of data or before newline)
187  OP_EXTUNI              match an extended Unicode character
188  OP_ANYNL               match any Unicode newline sequence
189
190  OP_ACCEPT              ) These are Perl 5.10's "backtracking control
191  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
192  OP_FAIL                ) parentheses, it may be preceded by one or more
193  OP_PRUNE               ) OP_CLOSE, followed by a 2-byte number,
194  OP_SKIP                ) indicating which parentheses must be closed.
195
196
197Backtracking control verbs with (optional) data
198-----------------------------------------------
199
200(*THEN) without an argument generates the opcode OP_THEN and no following data.
201OP_MARK is followed by the mark name, preceded by a one-unit length, and
202followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
203the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name
204following in the same format.
205
206
207Matching literal characters
208---------------------------
209
210The OP_CHAR opcode is followed by a single character that is to be matched
211casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
212the character may be more than one unit long. In UTF-32 mode, characters
213are always exactly one unit long.
214
215
216Repeating single characters
217---------------------------
218
219The common repeats (*, +, ?), when applied to a single character, use the
220following opcodes, which come in caseful and caseless versions:
221
222  Caseful         Caseless
223  OP_STAR         OP_STARI
224  OP_MINSTAR      OP_MINSTARI
225  OP_POSSTAR      OP_POSSTARI
226  OP_PLUS         OP_PLUSI
227  OP_MINPLUS      OP_MINPLUSI
228  OP_POSPLUS      OP_POSPLUSI
229  OP_QUERY        OP_QUERYI
230  OP_MINQUERY     OP_MINQUERYI
231  OP_POSQUERY     OP_POSQUERYI
232
233Each opcode is followed by the character that is to be repeated. In ASCII mode,
234these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in
235UTF-32 mode these are one-unit items.
236Those with "MIN" in their names are the minimizing versions. Those with "POS"
237in their names are possessive versions. Other repeats make use of these
238opcodes:
239
240  Caseful         Caseless
241  OP_UPTO         OP_UPTOI
242  OP_MINUPTO      OP_MINUPTOI
243  OP_POSUPTO      OP_POSUPTOI
244  OP_EXACT        OP_EXACTI
245
246Each of these is followed by a two-byte (one short) count (most significant
247byte first in 8-bit mode) and then the repeated character. OP_UPTO matches from
2480 to the given number. A repeat with a non-zero minimum and a fixed maximum is
249coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
250
251
252Repeating character types
253-------------------------
254
255Repeats of things like \d are done exactly as for single characters, except
256that instead of a character, the opcode for the type is stored in the data
257unit. The opcodes are:
258
259  OP_TYPESTAR
260  OP_TYPEMINSTAR
261  OP_TYPEPOSSTAR
262  OP_TYPEPLUS
263  OP_TYPEMINPLUS
264  OP_TYPEPOSPLUS
265  OP_TYPEQUERY
266  OP_TYPEMINQUERY
267  OP_TYPEPOSQUERY
268  OP_TYPEUPTO
269  OP_TYPEMINUPTO
270  OP_TYPEPOSUPTO
271  OP_TYPEEXACT
272
273
274Match by Unicode property
275-------------------------
276
277OP_PROP and OP_NOTPROP are used for positive and negative matches of a
278character by testing its Unicode property (the \p and \P escape sequences).
279Each is followed by two units that encode the desired property as a type and a
280value.
281
282Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
283three units: OP_PROP or OP_NOTPROP, and then the desired property type and
284value.
285
286
287Character classes
288-----------------
289
290If there is only one character in the class, OP_CHAR or OP_CHARI is used for a
291positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
292something like [^a]).
293
294Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for
295repeated, negated, single-character classes. The normal single-character
296opcodes (OP_STAR, etc.) are used for repeated positive single-character
297classes.
298
299When there is more than one character in a class and all the characters are
300less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a
301negative one. In either case, the opcode is followed by a 32-byte (16-short)
302bit map containing a 1 bit for every character that is acceptable. The bits are
303counted from the least significant end of each unit. In caseless mode, bits for
304both cases are set.
305
306The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode,
307subject characters with values greater than 255 can be handled correctly. For
308OP_CLASS they do not match, whereas for OP_NCLASS they do.
309
310For classes containing characters with values greater than 255, OP_XCLASS is
311used. It optionally uses a bit map (if any characters lie within it), followed
312by a list of pairs (for a range) and single characters. In caseless mode, both
313cases are explicitly listed. There is a flag character than indicates whether
314it is a positive or a negative class.
315
316
317Back references
318---------------
319
320OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes (one short)
321containing the reference number.
322
323
324Repeating character classes and back references
325-----------------------------------------------
326
327Single-character classes are handled specially (see above). This section
328applies to OP_CLASS and OP_REF[I]. In both cases, the repeat information
329follows the base item. The matching code looks at the following opcode to see
330if it is one of
331
332  OP_CRSTAR
333  OP_CRMINSTAR
334  OP_CRPLUS
335  OP_CRMINPLUS
336  OP_CRQUERY
337  OP_CRMINQUERY
338  OP_CRRANGE
339  OP_CRMINRANGE
340
341All but the last two are just single-unit items. The others are followed by
342four bytes (two shorts) of data, comprising the minimum and maximum repeat
343counts. There are no special possessive opcodes for these repeats; a possessive
344repeat is compiled into an atomic group.
345
346
347Brackets and alternation
348------------------------
349
350A pair of non-capturing (round) brackets is wrapped round each expression at
351compile time, so alternation always happens in the context of brackets.
352
353[Note for North Americans: "bracket" to some English speakers, including
354myself, can be round, square, curly, or pointy. Hence this usage rather than
355"parentheses".]
356
357Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
358capturing brackets and it used a different opcode for each one. From release
3593.5, the limit was removed by putting the bracket number into the data for
360higher-numbered brackets. From release 7.0 all capturing brackets are handled
361this way, using the single opcode OP_CBRA.
362
363A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
364next alternative OP_ALT or, if there aren't any branches, to the matching
365OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
366the next one, or to the OP_KET opcode. For capturing brackets, the bracket
367number immediately follows the offset, always as a 2-byte (one short) item.
368
369OP_KET is used for subpatterns that do not repeat indefinitely, and
370OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
371maximally respectively (see below for possessive repetitions). All three are
372followed by LINK_SIZE bytes giving (as a positive number) the offset back to
373the matching bracket opcode.
374
375If a subpattern is quantified such that it is permitted to match zero times, it
376is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
377single-unit opcodes that tell the matcher that skipping the following
378subpattern entirely is a valid branch. In the case of the first two, not
379skipping the pattern is also valid (greedy and non-greedy). The third is used
380when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
381because it may be called as a subroutine from elsewhere in the regex.
382
383A subpattern with an indefinite maximum repetition is replicated in the
384compiled data its minimum number of times (or once with OP_BRAZERO if the
385minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
386as appropriate.
387
388A subpattern with a bounded maximum repetition is replicated in a nested
389fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
390before each replication after the minimum, so that, for example, (abc){2,5} is
391compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
392has the same number.
393
394When a repeated subpattern has an unbounded upper limit, it is checked to see
395whether it could match an empty string. If this is the case, the opcode in the
396final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
397that it needs to check for matching an empty string when it hits OP_KETRMIN or
398OP_KETRMAX, and if so, to break the loop.
399
400Possessive brackets
401-------------------
402
403When a repeated group (capturing or non-capturing) is marked as possessive by
404the "+" notation, e.g. (abc)++, different opcodes are used. Their names all
405have POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCPBRPOS instead
406of OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum
407repetition is zero, the group is preceded by OP_BRAPOSZERO.
408
409
410Assertions
411----------
412
413Forward assertions are just like other subpatterns, but starting with one of
414the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
415OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
416is OP_REVERSE, followed by a two byte (one short) count of the number of
417characters to move back the pointer in the subject string. In ASCII mode, the
418count is a number of units, but in UTF-8/16 mode each character may occupy more
419than one unit; in UTF-32 mode each character occupies exactly one unit.
420A separate count is present in each alternative of a lookbehind
421assertion, allowing them to have different fixed lengths.
422
423
424Once-only (atomic) subpatterns
425------------------------------
426
427These are also just like other subpatterns, but they start with the opcode
428OP_ONCE. The check for matching an empty string in an unbounded repeat is
429handled entirely at runtime, so there is just this one opcode.
430
431
432Conditional subpatterns
433-----------------------
434
435These are like other subpatterns, but they start with the opcode OP_COND, or
436OP_SCOND for one that might match an empty string in an unbounded repeat. If
437the condition is a back reference, this is stored at the start of the
438subpattern using the opcode OP_CREF followed by two bytes (one short)
439containing the reference number. OP_NCREF is used instead if the reference was
440generated by name (so that the runtime code knows to check for duplicate
441names).
442
443If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
444group x" (coded as "(?(Rx)"), the group number is stored at the start of the
445subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
446zero for "the whole pattern". For a DEFINE condition, just the single unit
447OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
448always starts with one of the assertions.
449
450
451Recursion
452---------
453
454Recursion either matches the current regex, or some subexpression. The opcode
455OP_RECURSE is followed by an value which is the offset to the starting bracket
456from the start of the whole pattern. From release 6.5, OP_RECURSE is
457automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
458broke it). OP_RECURSE is also used for "subroutine" calls, even though they
459are not strictly a recursion.
460
461
462Callout
463-------
464
465OP_CALLOUT is followed by one unit of data that holds a callout number in the
466range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
467cases there follows a two-byte (one short) value giving the offset in the
468pattern to the start of the following item, and another two-byte (one short)
469item giving the length of the next item.
470
471
472Philip Hazel
473February 2012
474