1##################### 2 High-level overview 3##################### 4 5PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't 6compiled into machine-readable code ahead of time. Instead, the source files are read, processed and 7interpreted when the program is executed. This can be very convenient for developers for rapid 8prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges 9to performance, which is one of the primary reasons interpreters can be complex. php-src borrows 10many concepts from other compilers and interpreters. 11 12********** 13 Pipeline 14********** 15 16The goal of the interpreter is to read the users source files, and to simulate the users intent. 17This process can be split into distinct phases that are easier to understand and implement. 18 19- Tokenization - splitting whole source files into words, called tokens. 20- Parsing - building a tree structure from tokens, called AST (abstract syntax tree). 21- Compilation - traversing the AST and building a list of operations, called opcodes. 22- Interpretation - reading and executing opcodes. 23 24php-src as a whole can be seen as a pipeline consisting of these stages, using the input of the 25previous phase and producing some output for the next. 26 27.. code:: haskell 28 29 source_code 30 |> tokenizer -- tokens 31 |> parser -- ast 32 |> compiler -- opcodes 33 |> interpreter 34 35Let's go into each phase in a bit more detail. 36 37************** 38 Tokenization 39************** 40 41Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file 42and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple 43integer constant representing the token, and a lexeme, the literal string used in the source code. 44 45.. code:: php 46 47 if ($cond) { 48 echo "Cond is true\n"; 49 } 50 51.. code:: text 52 53 T_IF "if" 54 T_WHITESPACE " " 55 "(" 56 T_VARIABLE "$cond" 57 ")" 58 T_WHITESPACE " " 59 "{" 60 T_WHITESPACE "\n " 61 T_ECHO "echo" 62 T_WHITESPACE " " 63 T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"' 64 ";" 65 T_WHITESPACE "\n" 66 "}" 67 68While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate 69this process. It takes a definition file and generates efficient C code to build these tokens from a 70stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the 71`re2c documentation`_ for details. 72 73.. _re2c documentation: https://re2c.org/ 74 75********* 76 Parsing 77********* 78 79Parsing is the process of reading the tokens generated from the tokenizer and building a tree 80structure from it. To humans, how source code elements are grouped seems obvious through whitespace 81and the usage of symbols like ``()`` and ``{}``. However, computers cannot visually glance over the 82code to determine these boundaries quickly. To make it easier and faster to work with, we build a 83tree structure from the tokens to more closely reflect the source code the way humans see it. 84 85Here is a simplified example of what an AST from the tokens above might look like. 86 87.. code:: text 88 89 ZEND_AST_IF { 90 ZEND_AST_IF_ELEM { 91 ZEND_AST_VAR { 92 ZEND_AST_ZVAL { "cond" }, 93 }, 94 ZEND_AST_STMT_LIST { 95 ZEND_AST_ECHO { 96 ZEND_AST_ZVAL { "Cond is true\n" }, 97 }, 98 }, 99 }, 100 } 101 102Each AST node has a type and may have children. They also store their original position in the 103source code, and may define some arbitrary flags. These are omitted for brevity. 104 105Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a 106grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the 107`Bison documentation`_ for details. Luckily, the syntax is quite approachable. 108 109.. _bison documentation: https://www.gnu.org/software/bison/manual/ 110 111Parsing is described in more detail in its `dedicated chapter <todo>`__. 112 113************* 114 Compilation 115************* 116 117Computers don't understand human language, or even programming languages. They only understand 118machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For 119example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain 120condition, etc. It turns out that even the most complex expressions can be reduced to a number of 121these simple instructions. 122 123PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run 124on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no 125physical machine you can buy that understands these instructions, but that this machine is 126implemented in software. This is our interpreter. This also means that we are free to make up 127instructions ourselves at will. Some of these instructions look very similar to something you'd find 128in an actual CPU instruction set (e.g. adding two numbers), while others are much more high-level 129(e.g. load property of object by name). 130 131With that little detour out of the way, the job of the compiler is to read the AST and translate it 132into our virtual machine instructions, also called opcodes. The code responsible for this 133transformation lives in ``Zend/zend_compile.c``. It essentially traverses the AST and generates a 134number of instructions, before going to the next node. 135 136Here's what the surprisingly compact opcodes for the AST above might look like: 137 138.. code:: text 139 140 0000 JMPZ CV0($cond) 0002 141 0001 ECHO string("Cond is true\n") 142 0002 RETURN int(1) 143 144**************** 145 Interpretation 146**************** 147 148Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for 149instructions. This essentially means that each instructions may have a result value, and at most two 150operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals 151<../core/data-structures/zval>`. 152 153.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code 154 155How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in 156the generated ``Zend/zend_vm_opcodes.h`` file. The behavior of each instruction is defined in 157``Zend/zend_vm_def.h``. 158 159Let's step through the opcodes form the example above: 160 161- We start at the top, i.e. ``JMPZ``. If its first operand contains a "falsy" value, it will jump 162 to the instruction encoded in its second operand. If it is truthy, it will simply fall-through to 163 the next instruction. 164 165- The ``ECHO`` instruction prints its first operand. 166 167- The ``RETURN`` operand terminates the current function. 168 169With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is 170truthy, and skip over the ``echo`` otherwise. 171 172That's it! This is how PHP works, fundamentally. Of course, we skipped over a ton of details. The VM 173is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter. 174 175********* 176 Opcache 177********* 178 179As you may imagine, running this whole pipeline every time PHP serves a request is time consuming. 180Luckily, it is also not necessary. We can cache the opcodes in memory between requests, to skip over 181all of the phases, except for the execution phase. This is precisely what the opcache extension 182does. It lives in the ``ext/opcache`` directory. 183 184Opcache also performs some optimizations on the opcodes before caching them. As opcaches are 185expected to be reused many times, it is profitable to spend some additional time simplifying them if 186possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``. 187 188JIT 189=== 190 191The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler 192takes the virtual PHP opcodes and turns it into actual machine instructions, with additional 193information gained at runtime. JITs are very complex pieces of software, so this book will likely 194barely scratch the surface of how it works. It lives in ``ext/opcache/jit``. 195