#
9e1447db |
| 05-Sep-2021 |
Alex Dowad |
Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to th
Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to these modes had names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA. The "ZEN2HAN" part of the name is misleading, since these modes do not convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted characters are fullwidth both before and after the conversion. So... let's name the constants accordingly.
show more ...
|
#
776296e1 |
| 30-Aug-2021 |
Alex Dowad |
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.
show more ...
|
#
63901584 |
| 08-Jul-2021 |
Nikita Popov |
Deprecate calling mb_check_encoding() without argument Part of https://wiki.php.net/rfc/deprecations_php_8_1.
|
#
e7135cb8 |
| 14-May-2021 |
George Peter Banyard |
Use zend_string_equals_* API in a couple of more place Closes GH-6979
|
#
aca6aefd |
| 14-May-2021 |
George Peter Banyard |
Remove 'register' type qualifier (#6980) The compiler should be smart enough to optimize this on its own
|
#
01b3fc03 |
| 06-May-2021 |
KsaR |
Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as
Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before |
show more ...
|
#
0cafd53d |
| 04-May-2021 |
Christoph M. Becker |
Fix #81011: mb_convert_encoding removes references from arrays We need to dereference references. Closes GH-6938.
|
#
09efad61 |
| 08-Apr-2021 |
George Peter Banyard |
Use zend_string_equals_(literal_)ci() API more often Also drive-by usage of zend_ini_parse_bool() Closes GH-6844
|
#
5caaf40b |
| 29-Sep-2020 |
George Peter Banyard |
Introduce pseudo-keyword ZEND_FALLTHROUGH And use it instead of comments
|
#
a06c20a1 |
| 18-Oct-2020 |
Alex Dowad |
Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't
Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.
show more ...
|
#
3e01f5af |
| 15-Jan-2021 |
Nikita Popov |
Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool
Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool is retained as an alias.
show more ...
|
#
72660c41 |
| 20-Sep-2020 |
Alex Dowad |
Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use the
Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.
show more ...
|
#
e169ad3b |
| 03-Nov-2020 |
Alex Dowad |
Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way.
|
#
3e7acf90 |
| 04-Nov-2020 |
Alex Dowad |
Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string.
Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.
show more ...
|
#
be1a2155 |
| 29-Aug-2020 |
Alex Dowad |
Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or n
Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or not, it would convert the whole string into wchar (code points), which required dynamically allocating a (potentially large) buffer. Then it would turn right around and convert that big 'ol buffer of code points back to the original encoding again. Finally, it would check whether any invalid bytes were detected during that long and onerous process. The thing is, mbstring _already_ has machinery for detecting whether a string is valid in a certain encoding or not, and it doesn't require copying any data around or allocating buffers. Better yet, it can fail fast when an invalid byte is found. Why not use it? It's sure a lot faster! Further, the legacy code was also badly broken. Why? Because aside from checking whether illegal characters were detected, it would also check whether the conversion to and from wchars was lossless. But, some encodings have more than one valid encoding for the same character. In such cases, it is not possible to make the conversion to and from wchars lossless for every valid character. So `mb_check_encoding` would actually reject good strings in a lot of encodings!
show more ...
|
#
7dc16374 |
| 12-Oct-2020 |
Alex Dowad |
Remove unused IS_SJIS1 and IS_SJIS2 macros
|
#
9b4094c3 |
| 13-Oct-2020 |
Nikita Popov |
Fix incorrect zpp parameter count in mb_substr() / mb_strcut() These functions only accept 4 params.
|
#
124bce3c |
| 13-Oct-2020 |
Nikita Popov |
Fix argument nullability in mbstring These arguments were declared nullable in stubs (and should be nullable), but didn't accept null in zpp.
|
#
0ffc1f55 |
| 16-Jul-2020 |
Alex Dowad |
Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for thi
Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for things which will never be NULL) - Don't return success/failure code from functions which can never fail - For encoding structs, don't use pointers to pointers to pointers for the list of alias strings. Pointers to pointers (2 levels of indirection) is what actually makes sense. This gets rid of some extraneous dereference operations.
show more ...
|
#
e950ca13 |
| 20-Sep-2020 |
Máté Kocsis |
Consolidate the usage of "either" and "one of" in error messages Closes GH-6173
|
#
c37a1cd6 |
| 10-Sep-2020 |
Máté Kocsis |
Promote a few remaining errors in ext/standard Closes GH-6110
|
#
1c81a345 |
| 14-Sep-2020 |
Máté Kocsis |
Make mb_send_mail() consistent with mail() The $additional_headers parameter shouldn't accept null.
|
#
c98d4769 |
| 10-Sep-2020 |
Máté Kocsis |
Consolidate new union type ZPP macro names They will now follow the canonical order of types. Older macros are left intact due to maintaining BC. Closes GH-6112
|
#
f33fd9b7 |
| 11-Sep-2020 |
Nikita Popov |
Throw ValueError on null bytes in mb_send_mail() Instead of silently replacing with spaces.
|
#
5b78d76e |
| 08-Sep-2020 |
Alex Dowad |
mb_str_split is already documented on php.net So remove TODO comment which implies that it's not.
|