1#############
2 zend_string
3#############
4
5In C, strings are represented as sequential lists of characters, ``char*`` or ``char[]``. The end of
6the string is usually indicated by the special NUL character, ``'\0'``. This comes with a few
7significant downsides:
8
9-  Calculating the length of the string is expensive, as it requires walking the entire string to
10   look for the terminating NUL character.
11-  The string may not contain the NUL character itself.
12-  It is easy to run into buffer overflows if the NUL byte is accidentally missing.
13
14php-src uses the ``zend_string`` struct as an abstraction over ``char*``, which explicitly stores
15the strings length, along with some other fields. It looks as follows:
16
17.. code:: c
18
19   struct _zend_string {
20       zend_refcounted_h gc;
21       zend_ulong        h; /* hash value */
22       size_t            len;
23       char              val[1];
24   };
25
26The ``gc`` field is used for :doc:`./reference-counting`. The ``h`` field contains a hash value,
27which is used for `hash table <todo>`__ lookups. The ``len`` field stores the length of the string
28in bytes, and the ``val`` field contains the actual string data.
29
30You may wonder why the ``val`` field is declared as ``char val[1]``. This is called the `struct
31hack`_ in C. It is used to create structs with a flexible size, namely by allowing the last element
32to be expanded arbitrarily. In this case, the size of ``zend_string`` depends on the strings length,
33which is determined at runtime (see ``_ZSTR_STRUCT_SIZE``). When allocating the string, we append
34enough bytes to the allocation to hold the strings content.
35
36.. _struct hack: https://www.geeksforgeeks.org/struct-hack/
37
38Here's a basic example of how to use ``zend_string``:
39
40.. code:: c
41
42   // Allocate the string.
43   zend_string *string = ZSTR_INIT_LITERAL("Hello world!", /* persistent */ false);
44   // Write it to the output buffer.
45   zend_write(ZSTR_VAL(string), ZSTR_LEN(string));
46   // Decrease the reference count and free it if necessary.
47   zend_string_release(string);
48
49``ZSTR_INIT_LITERAL`` creates a ``zend_string`` from a string literal. It is just a wrapper around
50``zend_string_init(char *string, size_t length, bool persistent)`` that provides the length of the
51string at compile time. The ``persistent`` parameter indicates whether the string is allocated using
52``malloc`` (``persistent == true``) or ``emalloc``, `PHPs custom allocator <todo>`__ (``persistent
53== false``) that is emptied after each request.
54
55When you're done using the string, you must call ``zend_string_release``, or the memory will leak.
56``zend_string_release`` will automatically call ``malloc`` or ``emalloc``, depending on how the
57string was allocated. After releasing the string, you must not access any of its fields anymore, as
58it may have been freed if you were its last user.
59
60*****
61 API
62*****
63
64The string API is defined in ``Zend/zend_string.h``. It provides a number of functions for creating
65new strings.
66
67.. list-table:: ``zend_string`` creation
68   :header-rows: 1
69
70   -  -  Function/Macro [#persistent]_
71      -  Description
72
73   -  -  ``ZSTR_INIT(s, p)``
74      -  Creates a new string from a string literal.
75
76   -  -  ``zend_string_init(s, l, p)``
77      -  Creates a new string from a character buffer.
78
79   -  -  ``zend_string_alloc(l, p)``
80      -  Creates a new string of a given length without initializing its content.
81
82   -  -  ``zend_string_concat2(s1, l1, s2, l2)``
83      -  Creates a non-persistent string by concatenating two character buffers.
84
85   -  -  ``zend_string_concat3(...)``
86      -  Same as ``zend_string_concat2``, but for three character buffers.
87
88   -  -  ``ZSTR_EMPTY_ALLOC()``
89      -  Gets an immutable, empty string. This does not allocate memory.
90
91   -  -  ``ZSTR_CHAR(char)``
92      -  Gets an immutable, single-character string. This does not allocate memory.
93
94   -  -  ``ZSTR_KNOWN(ZEND_STR_const)``
95
96      -  Gets an immutable, predefined string. Used for string common within PHP itself, e.g.
97         ``"class"``. See ``ZEND_KNOWN_STRINGS`` in ``Zend/zend_string.h``. This does not allocate
98         memory.
99
100.. [#persistent]
101
102   ``s`` = ``zend_string``, ``l`` = ``length``, ``p`` = ``persistent``.
103
104As per php-src fashion, you are not supposed to access the ``zend_string`` fields directly. Instead,
105use the following macros. There are macros for both ``zend_string`` and ``zvals`` known to contain
106strings.
107
108.. list-table:: Accessor macros
109   :header-rows: 1
110
111   -  -  ``zend_string``
112      -  ``zval``
113      -  Description
114
115   -  -  ``ZSTR_LEN``
116      -  ``Z_STRLEN[_P]``
117      -  Returns the length of the string in bytes.
118
119   -  -  ``ZSTR_VAL``
120      -  ``Z_STRVAL[_P]``
121      -  Returns the string data as a ``char*``.
122
123   -  -  ``ZSTR_HASH``
124      -  ``Z_STRHASH[_P]``
125      -  Computes the string has if it hasn't already been, and returns it.
126
127   -  -  ``ZSTR_H``
128      -  \-
129      -  Returns the string hash. This macro assumes that the hash has already been computed.
130
131.. list-table:: Reference counting macros
132   :header-rows: 1
133
134   -  -  Macro
135      -  Description
136
137   -  -  ``zend_string_copy(s)``
138      -  Increases the reference count and returns the same string. The reference count is not
139         increased if the string is interned.
140
141   -  -  ``zend_string_release(s)``
142      -  Decreases the reference count and frees the string if it goes to 0.
143
144   -  -  ``zend_string_dup(s, p)``
145      -  Creates a true copy of the string in a new allocation, except if the string is interned.
146
147   -  -  ``zend_string_separate(s)``
148      -  Duplicates the string if the reference count is greater than 1. See
149         :doc:`./reference-counting` for details.
150
151   -  -  ``zend_string_realloc(s, l, p)``
152
153      -  Changes the size of the string. If the string has a reference count greater than 1 or if
154         the string is interned, a new string is created. You must always use the return value of
155         this function, as the original array may have been moved to a new location in memory.
156
157There are various functions to compare strings. The ``zend_string_equals`` function compares two
158strings in full, while ``zend_string_starts_with`` checks whether the first argument starts with the
159second. There are variations for ``_ci`` and ``_literal``, i.e. case-insensitive comparison and
160literal strings, respectively. We won't go over all variations here, as they are straightforward to
161use.
162
163******************
164 Interned strings
165******************
166
167Programs use some strings many times. For example, if your program declares a class called
168``MyClass``, it would be wasteful to allocate a new string ``"MyClass"`` every time it is referenced
169within your program. Instead, when repeated strings are expected, php-src uses a technique called
170string interning. Essentially, this is just a simple `HashTable <todo>`__ where existing interned
171strings are stored. When creating a new interned string, php-src first checks the interned string
172buffer. If it finds it there, it can return a pointer to the existing string. If it doesn't, it
173allocates a new string and adds it to the buffer.
174
175.. code:: c
176
177   zend_string *str1 = zend_new_interned_string(
178       ZSTR_INIT_LITERAL("MyClass", /* persistent */ false));
179
180   // In some other place entirely.
181   zend_string *str2 = zend_new_interned_string(
182       ZSTR_INIT_LITERAL("MyClass", /* persistent */ false));
183
184   assert(ZSTR_IS_INTERNED(str1));
185   assert(ZSTR_IS_INTERNED(str2));
186   assert(str1 == str2);
187
188Interned strings are *not* reference counted, as they are expected to live for the entire request,
189or longer.
190
191With opcache, this goes one step further by sharing strings across different processes. For example,
192if you're using php-fpm with 8 workers, all workers will share the same interned strings buffer. It
193gets a bit more complicated. During requests, no interned strings are actually created. Instead,
194this is delayed until the script is persisted to shared memory. This means that
195``zend_new_interned_string`` may not actually return an interned string if opcache is enabled.
196Usually you don't have to worry about this.
197