xref: /PHP-Parser/doc/component/Lexer.markdown (revision 21ead390)
1Lexer component documentation
2=============================
3
4The lexer is responsible for providing tokens to the parser. Typical use of the library does not require direct
5interaction with the lexer, as an appropriate lexer is created by `PhpParser\ParserFactory`. The tokens produced
6by the lexer can then be retrieved using `PhpParser\Parser::getTokens()`.
7
8Emulation
9---------
10
11While this library implements a custom parser, it relies on PHP's `ext/tokenizer` extension to perform lexing. However,
12this extension only supports lexing code for the PHP version you are running on, while this library also wants to support
13parsing newer code. For that reason, the lexer performs additional "emulation" in three layers:
14
15First, PhpParser uses the `PhpToken` based representation introduced in PHP 8.0, rather than the array-based tokens from
16previous versions. The `PhpParser\Token` class either extends `PhpToken` (on PHP 8.0) or a polyfill implementation. The
17polyfill implementation will also perform two emulations that are required by the parser and cannot be disabled:
18
19 * Single-line comments use the PHP 8.0 representation that does not include a trailing newline. The newline will be
20   part of a following `T_WHITESPACE` token.
21 * Namespaced names use the PHP 8.0 representation using `T_NAME_FULLY_QUALIFIED`, `T_NAME_QUALIFIED` and
22   `T_NAME_RELATIVE` tokens, rather than the previous representation using a sequence of `T_STRING` and `T_NS_SEPARATOR`.
23   This means that certain code that is legal on older versions (namespaced names including whitespace, such as `A \ B`)
24   will not be accepted by the parser.
25
26Second, the `PhpParser\Lexer` base class will convert `&` tokens into the PHP 8.1 representation of either
27`T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG` or `T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG`. This is required by the parser
28and cannot be disabled.
29
30Finally, `PhpParser\Lexer\Emulative` performs other, optional emulations. This lexer is parameterized by `PhpVersion`
31and will try to emulate `ext/tokenizer` output for that version. This is done using separate `TokenEmulator`s for each
32emulated feature.
33
34Emulation is usually used to support newer PHP versions, but there is also very limited support for reverse emulation to
35older PHP versions, which can make keywords from newer versions non-reserved.
36
37Tokens, positions and attributes
38--------------------------------
39
40The `Lexer::tokenize()` method returns an array of `PhpParser\Token`s. The most important parts of the interface can be
41summarized as follows:
42
43```php
44class Token {
45    /** @var int Token ID, either T_* or ord($char) for single-character tokens. */
46    public int $id;
47    /** @var string The textual content of the token. */
48    public string $text;
49    /** @var int The 1-based starting line of the token (or -1 if unknown). */
50    public int $line;
51    /** @var int The 0-based starting position of the token (or -1 if unknown). */
52    public int $pos;
53
54    /** @param int|string|(int|string)[] $kind Token ID or text (or array of them) */
55    public function is($kind): bool;
56}
57```
58
59Unlike PHP's own `PhpToken::tokenize()` output, the token array is terminated by a sentinel token with ID 0.
60
61The lexer is normally invoked implicitly by the parser. In that case, the tokens for the last parse can be retrieved
62using `Parser::getTokens()`.
63
64Nodes in the AST produced by the parser always corresponds to some range of tokens. The parser adds a number of
65positioning attributes to allow mapping nodes back to lines, tokens or file offsets:
66
67 * `startLine`: Line in which the node starts. Used by `$node->getStartLine()`.
68 * `endLine`: Line in which the node ends. Used by `$node->getEndLine()`.
69 * `startTokenPos`: Offset into the token array of the first token in the node. Used by `$node->getStartTokenPos()`.
70 * `endTokenPos`: Offset into the token array of the last token in the node. Used by `$node->getEndTokenPos()`.
71 * `startFilePos`: Offset into the code string of the first character that is part of the node. Used by `$node->getStartFilePos()`.
72 * `endFilePos`: Offset into the code string of the last character that is part of the node. Used by `$node->getEndFilePos()`.
73
74Note that `start`/`end` here are closed rather than half-open ranges. This means that a node consisting of a single
75token will have `startTokenPos == endTokenPos` rather than `startTokenPos + 1 == endTokenPos`. This also means that a
76zero-length node will have `startTokenPos -1 == endTokenPos`.
77
78### Using token positions
79
80> **Note:** The example in this section is outdated in that this information is directly available in the AST: While
81> `$property->isPublic()` does not distinguish between `public` and `var`, directly checking `$property->flags` for
82> the `$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0` allows making this distinction without resorting to
83> tokens. However, the general idea behind the example still applies in other cases.
84
85The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST
86does not distinguish whether a property was declared using `public` or using `var`, but you can retrieve this
87information based on the token position:
88
89```php
90/** @param PhpParser\Token[] $tokens */
91function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) {
92    $i = $prop->getStartTokenPos();
93    return $tokens[$i]->id === T_VAR;
94}
95```
96
97In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using
98code similar to the following:
99
100```php
101class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
102    private $tokens;
103    public function setTokens(array $tokens) {
104        $this->tokens = $tokens;
105    }
106
107    public function leaveNode(PhpParser\Node $node) {
108        if ($node instanceof PhpParser\Node\Stmt\Property) {
109            var_dump(isDeclaredUsingVar($this->tokens, $node));
110        }
111    }
112}
113
114$parser = (new PhpParser\ParserFactory())->createForHostVersion($lexerOptions);
115
116$visitor = new MyNodeVisitor();
117$traverser = new PhpParser\NodeTraverser($visitor);
118
119try {
120    $stmts = $parser->parse($code);
121    $visitor->setTokens($parser->getTokens());
122    $stmts = $traverser->traverse($stmts);
123} catch (PhpParser\Error $e) {
124    echo 'Parse Error: ', $e->getMessage();
125}
126```
127