1#####################
2 High-level overview
3#####################
4
5PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't
6compiled into machine-readable code ahead of time. Instead, the source files are read, processed and
7interpreted when the program is executed. This can be very convenient for developers for rapid
8prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges
9to performance, which is one of the primary reasons interpreters can be complex. php-src borrows
10many concepts from other compilers and interpreters.
11
12**********
13 Pipeline
14**********
15
16The goal of the interpreter is to read the users source files, and to simulate the users intent.
17This process can be split into distinct phases that are easier to understand and implement.
18
19-  Tokenization - splitting whole source files into words, called tokens.
20-  Parsing - building a tree structure from tokens, called AST (abstract syntax tree).
21-  Compilation - traversing the AST and building a list of operations, called opcodes.
22-  Interpretation - reading and executing opcodes.
23
24php-src as a whole can be seen as a pipeline consisting of these stages, using the input of the
25previous phase and producing some output for the next.
26
27.. code:: haskell
28
29   source_code
30     |> tokenizer   -- tokens
31     |> parser      -- ast
32     |> compiler    -- opcodes
33     |> interpreter
34
35Let's go into each phase in a bit more detail.
36
37**************
38 Tokenization
39**************
40
41Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file
42and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple
43integer constant representing the token, and a lexeme, the literal string used in the source code.
44
45.. code:: php
46
47   if ($cond) {
48       echo "Cond is true\n";
49   }
50
51.. code:: text
52
53   T_IF                       "if"
54   T_WHITESPACE               " "
55                              "("
56   T_VARIABLE                 "$cond"
57                              ")"
58   T_WHITESPACE               " "
59                              "{"
60   T_WHITESPACE               "\n    "
61   T_ECHO                     "echo"
62   T_WHITESPACE               " "
63   T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"'
64                              ";"
65   T_WHITESPACE               "\n"
66                              "}"
67
68While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate
69this process. It takes a definition file and generates efficient C code to build these tokens from a
70stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the
71`re2c documentation`_ for details.
72
73.. _re2c documentation: https://re2c.org/
74
75*********
76 Parsing
77*********
78
79Parsing is the process of reading the tokens generated from the tokenizer and building a tree
80structure from it. To humans, how source code elements are grouped seems obvious through whitespace
81and the usage of symbols like ``()`` and ``{}``. However, computers cannot visually glance over the
82code to determine these boundaries quickly. To make it easier and faster to work with, we build a
83tree structure from the tokens to more closely reflect the source code the way humans see it.
84
85Here is a simplified example of what an AST from the tokens above might look like.
86
87.. code:: text
88
89   ZEND_AST_IF {
90       ZEND_AST_IF_ELEM {
91           ZEND_AST_VAR {
92               ZEND_AST_ZVAL { "cond" },
93           },
94           ZEND_AST_STMT_LIST {
95               ZEND_AST_ECHO {
96                   ZEND_AST_ZVAL { "Cond is true\n" },
97               },
98           },
99       },
100   }
101
102Each AST node has a type and may have children. They also store their original position in the
103source code, and may define some arbitrary flags. These are omitted for brevity.
104
105Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a
106grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the
107`Bison documentation`_ for details. Luckily, the syntax is quite approachable.
108
109.. _bison documentation: https://www.gnu.org/software/bison/manual/
110
111Parsing is described in more detail in its `dedicated chapter <todo>`__.
112
113*************
114 Compilation
115*************
116
117Computers don't understand human language, or even programming languages. They only understand
118machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For
119example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain
120condition, etc. It turns out that even the most complex expressions can be reduced to a number of
121these simple instructions.
122
123PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run
124on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no
125physical machine you can buy that understands these instructions, but that this machine is
126implemented in software. This is our interpreter. This also means that we are free to make up
127instructions ourselves at will. Some of these instructions look very similar to something you'd find
128in an actual CPU instruction set (e.g. adding two numbers), while others are much more high-level
129(e.g. load property of object by name).
130
131With that little detour out of the way, the job of the compiler is to read the AST and translate it
132into our virtual machine instructions, also called opcodes. The code responsible for this
133transformation lives in ``Zend/zend_compile.c``. It essentially traverses the AST and generates a
134number of instructions, before going to the next node.
135
136Here's what the surprisingly compact opcodes for the AST above might look like:
137
138.. code:: text
139
140   0000 JMPZ CV0($cond) 0002
141   0001 ECHO string("Cond is true\n")
142   0002 RETURN int(1)
143
144****************
145 Interpretation
146****************
147
148Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for
149instructions. This essentially means that each instructions may have a result value, and at most two
150operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals
151<../core/data-structures/zval>`.
152
153.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code
154
155How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in
156the generated ``Zend/zend_vm_opcodes.h`` file. The behavior of each instruction is defined in
157``Zend/zend_vm_def.h``.
158
159Let's step through the opcodes form the example above:
160
161-  We start at the top, i.e. ``JMPZ``. If its first operand contains a "falsy" value, it will jump
162   to the instruction encoded in its second operand. If it is truthy, it will simply fall-through to
163   the next instruction.
164
165-  The ``ECHO`` instruction prints its first operand.
166
167-  The ``RETURN`` operand terminates the current function.
168
169With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is
170truthy, and skip over the ``echo`` otherwise.
171
172That's it! This is how PHP works, fundamentally. Of course, we skipped over a ton of details. The VM
173is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter.
174
175*********
176 Opcache
177*********
178
179As you may imagine, running this whole pipeline every time PHP serves a request is time consuming.
180Luckily, it is also not necessary. We can cache the opcodes in memory between requests, to skip over
181all of the phases, except for the execution phase. This is precisely what the opcache extension
182does. It lives in the ``ext/opcache`` directory.
183
184Opcache also performs some optimizations on the opcodes before caching them. As opcaches are
185expected to be reused many times, it is profitable to spend some additional time simplifying them if
186possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``.
187
188JIT
189===
190
191The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler
192takes the virtual PHP opcodes and turns it into actual machine instructions, with additional
193information gained at runtime. JITs are very complex pieces of software, so this book will likely
194barely scratch the surface of how it works. It lives in ``ext/opcache/jit``.
195