History log of /PHP-8.2/ext/mbstring/tests/iso2022kr_encoding.phpt (Results 1 – 6 of 6)
Revision Date Author Comments
# d9269bec 09-Aug-2022 Alex Dowad

Fix problems with ISO-2022-KR conversion

• The legacy conversion code did not emit an error marker if an
escape sequence was truncated.

• BOTH old and new conversion code woul

Fix problems with ISO-2022-KR conversion

• The legacy conversion code did not emit an error marker if an
escape sequence was truncated.

• BOTH old and new conversion code would shift from KSC5601
(KS X 1001) mode to ASCII mode on an invalid escape sequence.
This doesn't make any sense.

show more ...


# a7890885 26-Apr-2022 Alex Dowad

Add more tests for mbstring encoding conversion

When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield

Add more tests for mbstring encoding conversion

When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield
different outputs from the new and old encoding conversion code.
Some were found. In most cases, analysis revealed that the new code
was correct and the old code was not.

In all cases where the new code was incorrect, regression tests were
added. However, there may be some value in adding regression tests
for cases where the old code was incorrect as well. That is done here.

This does not cover every case where the new and old code yielded
different results. Some of them were very obscure, and it is proving
difficult even to reproduce them (since I did not keep a record of
all the input strings which triggered the differing output).

show more ...


# c9479899 28-Dec-2021 Alex Dowad

Implement fast text conversion interface for ISO-2022-KR

When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, t

Implement fast text conversion interface for ISO-2022-KR

When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, the
legacy code did not obey the RFC. (It would emit the "ESC $ ) C"
sequence anywhere, not just at the beginning of a line as the RFC
requires.)

The new code obeys the RFC; one quirk is that it always emits the
escape sequence at the beginning of each output string, even if the
string is completely ASCII (in which case the escape sequence is
allowed, but not required).

The new code doesn't always generate the same number of error markers
for invalid escapes as the old code did.

The old code could not emit the special KDDI emoji for national flags.

Further, there was a bug in the test which the old code used to
determine whether an 0xF byte should be emitted at the end of a string
(to switch back to ASCII mode). As a result, it would not always switch
back to ASCII mode, meaning that it was not always safe to concatenate
the resulting strings.

show more ...


# 776296e1 30-Aug-2021 Alex Dowad

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.

show more ...


# 51b9d7a5 27-Jul-2021 Alex Dowad

Test behavior of 'long' illegal character markers

After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
D

Test behavior of 'long' illegal character markers

After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
Depending on the situation, these error markers will either look like
BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it
converts to Unicode codepoints which cannot be represented in the
output encoding), or an encoding-specific marker like JISX+XXXX or
W932+XXXX.

We have almost no tests for this feature. Add a bunch of tests to
ensure that all our legacy encoding handlers work in a reasonable
way when 'long' error markers are enabled.

show more ...


# b626e893 03-Jul-2021 Alex Dowad

Fix conversion of ISO-2022-KR text (and add test suite)

- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error

Fix conversion of ISO-2022-KR text (and add test suite)

- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character

show more ...