#
5fdb2724 |
| 22-Dec-2023 |
Alex Dowad |
Add mbstring support for GB18030-2022 text encoding The previous version of the GB-18030 standard was published in 2005. This commit adds support for the updated (2022) version of this t
Add mbstring support for GB18030-2022 text encoding The previous version of the GB-18030 standard was published in 2005. This commit adds support for the updated (2022) version of this text encoding. The existing GB18030 implementation has been left unchanged for backwards compatibility; users who want to use the new standard must explicitly indicate the desired text encoding is 'GB18030-2022'. The document which defines GB18030-2022, published by the government of the People's Republic of China, defines three levels of standards compliance. This implementation is intended to achieve Implementation Level 3, which is the highest level of compliance. Experts in the GB18030 standard are requested to assess this implementation and report any deviation from the standard.
show more ...
|
#
cffdeb81 |
| 14-Dec-2023 |
Alex Dowad |
Add specialized implementation of mb_strcut for GB18030 For GB18030, it is not generally possible to identify character boundaries without scanning through the entire string. Therefore,
Add specialized implementation of mb_strcut for GB18030 For GB18030, it is not generally possible to identify character boundaries without scanning through the entire string. Therefore, implement mb_strcut using a similar strategy as the mblen_table based implementation in mbstring.c. The difference is that for GB18030, we need to look at two leading bytes to determine the byte length of a multi-byte character. The new implementation is 4-5x faster for short strings, and more than 10x faster for long strings. (Part of the reason why this new code has such a great performance advantage is because it is replacing code based on the older text conversion filters provided by libmbfl, which were quite slow.) The behavior is the same as before for valid GB18030 strings; for some invalid strings, mb_strcut will choose different 'cut' points as compared to before. (Clang's libFuzzer was used to compare the old and new implementations, searching for test cases where they had different behavior; no such cases were found.)
show more ...
|
#
91279cfd |
| 18-Nov-2023 |
Niels Dossche <7771979+nielsdos@users.noreply.github.com> |
Use binary search for cp932ext*_ucs_table lookups (#12712) * Use binary search for cp932ext*_ucs_table lookups A large amount of time is spent doing a linear search through these
Use binary search for cp932ext*_ucs_table lookups (#12712) * Use binary search for cp932ext*_ucs_table lookups A large amount of time is spent doing a linear search through these tables in the CP932 encoding. Instead of that, we can add sorted versions of these tables that also store the index of the non-sorted version and perform a binary search on those sorted versions. This reduces the time spent from 1.54s to 0.91s for the reference benchmark [1]. [1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924 * Fix search bounds
show more ...
|
#
1f0cf133 |
| 30-Sep-2023 |
Alex Dowad |
Add fast mb_strcut implementation for UTF-8 The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant
Add fast mb_strcut implementation for UTF-8 The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time. For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.) The new implementation behaves identically to the old one on VALID UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: in some cases, the old implementation will pass invalid byte sequences through unchanged, while in others it will remove them. The new implementation has behavior which is perhaps slightly more predictable: it simply backs up the starting and ending cut points to the preceding "starter byte" (one which is not a UTF-8 continuation byte).
show more ...
|
#
6930ef58 |
| 30-May-2023 |
Alex Dowad |
Merge branch 'PHP-8.2' * PHP-8.2: Fix mb_strlen is wrong length for CP932 when 0x80.
|
#
8e6be143 |
| 16-May-2023 |
Alex Dowad |
Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1 This bug was introduced in e837a8800b. In that commit, I increased the performance of CP949 text conversion, but
Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1 This bug was introduced in e837a8800b. In that commit, I increased the performance of CP949 text conversion, but accidentally broke the case where 0xC9 (illegal byte to start a character) is followed by a valid character with a first byte less than 0xA1. The 'broken' behavior is that both the 0xC9 byte and the following valid character would be converted to error markers.
show more ...
|
#
245daedb |
| 22-Apr-2023 |
Alex Dowad |
Move kana translation tables to mbfilter_cjk.c These (static) tables were defined in a header file, which was included in two different .c files. That will result in two copies of the ta
Move kana translation tables to mbfilter_cjk.c These (static) tables were defined in a header file, which was included in two different .c files. That will result in two copies of the tables being included in the PHP binary. But the tables were only used in one of the two .c files. Move it where it is used to avoid needlessly bloating the binary. (I checked in a hex editor and confirmed that while the previous binary contained two copies of these tables, it now only contains one.)
show more ...
|
#
175154db |
| 18-Apr-2023 |
Alex Dowad |
Optimize conversion of CP932 text to Unicode Conversion of CP932 text to UTF-8 using `mb_convert_encoding` is now about 20% faster than before. |
#
73633bf1 |
| 18-Apr-2023 |
Alex Dowad |
Optimize conversion of SJIS-2004 text to Unicode Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is now about 60% faster than before. (Many other mbstring functions wil
Optimize conversion of SJIS-2004 text to Unicode Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is now about 60% faster than before. (Many other mbstring functions will also be faster now on SJIS-2004 text.)
show more ...
|
#
c717c79a |
| 14-Apr-2023 |
Alex Dowad |
Combine CJK encoding conversion code in a single source file This will make it easier to combine duplicated code between all the CJK text encodings (a significant amount is already combi
Combine CJK encoding conversion code in a single source file This will make it easier to combine duplicated code between all the CJK text encodings (a significant amount is already combined in this commit, such as the repeated definitions of SJIS_DECODE and SJIS_ENCODE), but I hope to remove even more redundancy in the future. The table used to implement mb_strlen for CP932 has been changed to the same table as "SJIS-win".
show more ...
|