mbstring.c - OpenGrok history log for /PHP-8.2/ext/mbstring/mbstring.c

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 9e1447db	05-Sep-2021	Alex Dowad	Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to th Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to these modes had names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA. The "ZEN2HAN" part of the name is misleading, since these modes do not convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted characters are fullwidth both before and after the conversion. So... let's name the constants accordingly. show more ...
# 776296e1	30-Aug-2021	Alex Dowad	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success. show more ...
# 63901584	08-Jul-2021	Nikita Popov	Deprecate calling mb_check_encoding() without argument Part of https://wiki.php.net/rfc/deprecations_php_8_1.
# e7135cb8	14-May-2021	George Peter Banyard	Use zend_string_equals_* API in a couple of more place Closes GH-6979
# aca6aefd	14-May-2021	George Peter Banyard	Remove 'register' type qualifier (#6980) The compiler should be smart enough to optimize this on its own
# 01b3fc03	06-May-2021	KsaR	Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \| show more ...
# 0cafd53d	04-May-2021	Christoph M. Becker	Fix #81011: mb_convert_encoding removes references from arrays We need to dereference references. Closes GH-6938.
# 09efad61	08-Apr-2021	George Peter Banyard	Use zend_string_equals_(literal_)ci() API more often Also drive-by usage of zend_ini_parse_bool() Closes GH-6844
# 5caaf40b	29-Sep-2020	George Peter Banyard	Introduce pseudo-keyword ZEND_FALLTHROUGH And use it instead of comments
# a06c20a1	18-Oct-2020	Alex Dowad	Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it. show more ...
# 3e01f5af	15-Jan-2021	Nikita Popov	Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool is retained as an alias. show more ...
# 72660c41	20-Sep-2020	Alex Dowad	Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use the Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants. show more ...
# e169ad3b	03-Nov-2020	Alex Dowad	Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way.
# 3e7acf90	04-Nov-2020	Alex Dowad	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS. show more ...
# be1a2155	29-Aug-2020	Alex Dowad	Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or n Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or not, it would convert the whole string into wchar (code points), which required dynamically allocating a (potentially large) buffer. Then it would turn right around and convert that big 'ol buffer of code points back to the original encoding again. Finally, it would check whether any invalid bytes were detected during that long and onerous process. The thing is, mbstring _already_ has machinery for detecting whether a string is valid in a certain encoding or not, and it doesn't require copying any data around or allocating buffers. Better yet, it can fail fast when an invalid byte is found. Why not use it? It's sure a lot faster! Further, the legacy code was also badly broken. Why? Because aside from checking whether illegal characters were detected, it would also check whether the conversion to and from wchars was lossless. But, some encodings have more than one valid encoding for the same character. In such cases, it is not possible to make the conversion to and from wchars lossless for every valid character. So `mb_check_encoding` would actually reject good strings in a lot of encodings! show more ...
# 7dc16374	12-Oct-2020	Alex Dowad	Remove unused IS_SJIS1 and IS_SJIS2 macros
# 9b4094c3	13-Oct-2020	Nikita Popov	Fix incorrect zpp parameter count in mb_substr() / mb_strcut() These functions only accept 4 params.
# 124bce3c	13-Oct-2020	Nikita Popov	Fix argument nullability in mbstring These arguments were declared nullable in stubs (and should be nullable), but didn't accept null in zpp.
# 0ffc1f55	16-Jul-2020	Alex Dowad	Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for thi Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for things which will never be NULL) - Don't return success/failure code from functions which can never fail - For encoding structs, don't use pointers to pointers to pointers for the list of alias strings. Pointers to pointers (2 levels of indirection) is what actually makes sense. This gets rid of some extraneous dereference operations. show more ...
# e950ca13	20-Sep-2020	Máté Kocsis	Consolidate the usage of "either" and "one of" in error messages Closes GH-6173
# c37a1cd6	10-Sep-2020	Máté Kocsis	Promote a few remaining errors in ext/standard Closes GH-6110
# 1c81a345	14-Sep-2020	Máté Kocsis	Make mb_send_mail() consistent with mail() The $additional_headers parameter shouldn't accept null.
# c98d4769	10-Sep-2020	Máté Kocsis	Consolidate new union type ZPP macro names They will now follow the canonical order of types. Older macros are left intact due to maintaining BC. Closes GH-6112
# f33fd9b7	11-Sep-2020	Nikita Popov	Throw ValueError on null bytes in mb_send_mail() Instead of silently replacing with spaces.
# 5b78d76e	08-Sep-2020	Alex Dowad	mb_str_split is already documented on php.net So remove TODO comment which implies that it's not.
1 2 3 4 567 8 9 10 >>...35