History log of /php-src/ext/mbstring/libmbfl/filters/mbfilter_cjk.c (Results 1 – 10 of 10)
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 5fdb2724 22-Dec-2023 Alex Dowad

Add mbstring support for GB18030-2022 text encoding

The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this t

Add mbstring support for GB18030-2022 text encoding

The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this text
encoding. The existing GB18030 implementation has been left unchanged
for backwards compatibility; users who want to use the new standard
must explicitly indicate the desired text encoding is 'GB18030-2022'.

The document which defines GB18030-2022, published by the government
of the People's Republic of China, defines three levels of standards
compliance. This implementation is intended to achieve Implementation
Level 3, which is the highest level of compliance.

Experts in the GB18030 standard are requested to assess this
implementation and report any deviation from the standard.

show more ...


# cffdeb81 14-Dec-2023 Alex Dowad

Add specialized implementation of mb_strcut for GB18030

For GB18030, it is not generally possible to identify character
boundaries without scanning through the entire string. Therefore,

Add specialized implementation of mb_strcut for GB18030

For GB18030, it is not generally possible to identify character
boundaries without scanning through the entire string. Therefore,
implement mb_strcut using a similar strategy as the mblen_table based
implementation in mbstring.c. The difference is that for GB18030, we
need to look at two leading bytes to determine the byte length of a
multi-byte character.

The new implementation is 4-5x faster for short strings, and more than
10x faster for long strings. (Part of the reason why this new code has
such a great performance advantage is because it is replacing code
based on the older text conversion filters provided by libmbfl, which
were quite slow.)

The behavior is the same as before for valid GB18030 strings; for
some invalid strings, mb_strcut will choose different 'cut' points
as compared to before. (Clang's libFuzzer was used to compare the
old and new implementations, searching for test cases where they had
different behavior; no such cases were found.)

show more ...


# 91279cfd 18-Nov-2023 Niels Dossche <7771979+nielsdos@users.noreply.github.com>

Use binary search for cp932ext*_ucs_table lookups (#12712)

* Use binary search for cp932ext*_ucs_table lookups

A large amount of time is spent doing a linear search through these

Use binary search for cp932ext*_ucs_table lookups (#12712)

* Use binary search for cp932ext*_ucs_table lookups

A large amount of time is spent doing a linear search through these
tables in the CP932 encoding. Instead of that, we can add sorted
versions of these tables that also store the index of the non-sorted
version and perform a binary search on those sorted versions.

This reduces the time spent from 1.54s to 0.91s for the reference
benchmark [1].

[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924

* Fix search bounds

show more ...


# 1f0cf133 30-Sep-2023 Alex Dowad

Add fast mb_strcut implementation for UTF-8

The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant

Add fast mb_strcut implementation for UTF-8

The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)

The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).

show more ...


# 6930ef58 30-May-2023 Alex Dowad

Merge branch 'PHP-8.2'

* PHP-8.2:
Fix mb_strlen is wrong length for CP932 when 0x80.


# 8e6be143 16-May-2023 Alex Dowad

Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1

This bug was introduced in e837a8800b. In that commit, I increased the
performance of CP949 text conversion, but

Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1

This bug was introduced in e837a8800b. In that commit, I increased the
performance of CP949 text conversion, but accidentally broke the case
where 0xC9 (illegal byte to start a character) is followed by a valid
character with a first byte less than 0xA1. The 'broken' behavior is
that both the 0xC9 byte and the following valid character would be
converted to error markers.

show more ...

# 245daedb 22-Apr-2023 Alex Dowad

Move kana translation tables to mbfilter_cjk.c

These (static) tables were defined in a header file, which was included
in two different .c files. That will result in two copies of the ta

Move kana translation tables to mbfilter_cjk.c

These (static) tables were defined in a header file, which was included
in two different .c files. That will result in two copies of the tables
being included in the PHP binary.

But the tables were only used in one of the two .c files. Move it where
it is used to avoid needlessly bloating the binary. (I checked in a
hex editor and confirmed that while the previous binary contained two
copies of these tables, it now only contains one.)

show more ...

# 175154db 18-Apr-2023 Alex Dowad

Optimize conversion of CP932 text to Unicode

Conversion of CP932 text to UTF-8 using `mb_convert_encoding` is
now about 20% faster than before.

# 73633bf1 18-Apr-2023 Alex Dowad

Optimize conversion of SJIS-2004 text to Unicode

Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is
now about 60% faster than before. (Many other mbstring functions wil

Optimize conversion of SJIS-2004 text to Unicode

Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is
now about 60% faster than before. (Many other mbstring functions will
also be faster now on SJIS-2004 text.)

show more ...

# c717c79a 14-Apr-2023 Alex Dowad

Combine CJK encoding conversion code in a single source file

This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combi

Combine CJK encoding conversion code in a single source file

This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combined in this
commit, such as the repeated definitions of SJIS_DECODE and
SJIS_ENCODE), but I hope to remove even more redundancy in the future.

The table used to implement mb_strlen for CP932 has been changed to
the same table as "SJIS-win".

show more ...