gh9535b.phpt - OpenGrok history log for /php-src/ext/mbstring/tests/gh9535b.phpt

Revision	Date	Author	Comments
# c211e67b	29-Mar-2023	Alex Dowad	Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding The documentation for mb_strcut states: mb_strcut( string $string, Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding The documentation for mb_strcut states: mb_strcut( string $string, int $start, ?int $length = null, ?string $encoding = null ): string mb_strcut() extracts a substring from a string similarly to mb_substr(), but operates on bytes instead of characters. If the cut position happens to be between two bytes of a multi-byte character, the cut is performed starting from the first byte of that character. My understanding of the $length parameter for mb_strcut is that it specified the range of bytes to extract from $string, and that all characters encoded by those bytes should be included in the returned string, even if that means the returned string would be longer than $length bytes. This can happen either if 1) there is more than one way to encode the same character in $encoding, and one way requires more bytes than the other, or 2) $encoding uses escape sequences. However, discussion with users of mb_strcut indicates that many of them interpret $length as the maximum length of the returned string. This is also the historical behavior of the function. Hence, there is no need to modify the behavior of mb_strcut and then remove XFAIL from these test cases afterwards. We can keep the current behavior. show more ...
# fa0401b0	16-Sep-2022	NathanFreeman <1056159381@qq.com>	Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then running them through a conversion filter to ensure that the output is valid in the requested encoding. If the conversion filter emits error markers when doing the final 'flush' operation which ends the conversion of the extracted bytes, these error markers may (in some cases) be included in the output. The conversion operation does not respect the value of mb_substitute_character; rather, it always uses '?' as an error marker. So this issue manifests itself as unwanted '?' characters being inserted into the output. This issue has existed for a long time, but became noticeable in PHP 8.1 because for at least some of the supported text encodings, mbstring is now more strict about emitting error markers when strings end in an illegal state. The simplest fix is to suppress error markers during the final flush operation. While working on a fix for this problem, another problem with mb_strcut was discovered; since it decides when to stop consuming bytes from the input by looking at the byte length of its OUTPUT, anything which causes extra bytes to be emitted to the output may cause mb_strcut to not consume all the bytes in the requested range. The one case where we DO emit extra output bytes is for encodings which have a selectable mode, like ISO-2022-JP; if a string in such an encoding ends in a mode which is not the default, we emit an ending escape sequence which changes back to the default mode. This is done so that concatenating strings in such encodings is safe. However, as mentioned, this can cause the output of mb_strcut to be shorter than it logically should be. This bug has existed for a long time, and fixing it now will be a BC break, so we may not fix it right away. Therefore, tests for THIS fix which don't pass because of that OTHER bug have been split out into a separate test file (gh9535b.phpt), and that file has been marked XFAIL. show more ...

Revision

Date

Author

Comments

# c211e67b

29-Mar-2023

Alex Dowad

Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding

The documentation for mb_strcut states:

mb_strcut(
string $string,

Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding

The documentation for mb_strcut states:

mb_strcut(
string $string,
int $start,
?int $length = null,
?string $encoding = null
): string

mb_strcut() extracts a substring from a string similarly to mb_substr(),
but operates on bytes instead of characters. If the cut position happens
to be between two bytes of a multi-byte character, the cut is performed
starting from the first byte of that character.

My understanding of the $length parameter for mb_strcut is that it
specified the range of bytes to extract from $string, and that all
characters encoded by those bytes should be included in the returned
string, even if that means the returned string would be longer than
$length bytes. This can happen either if 1) there is more than one way
to encode the same character in $encoding, and one way requires more
bytes than the other, or 2) $encoding uses escape sequences.

However, discussion with users of mb_strcut indicates that many of them
interpret $length as the maximum length of the *returned* string.
This is also the historical behavior of the function.

Hence, there is no need to modify the behavior of mb_strcut and then
remove XFAIL from these test cases afterwards. We can keep the current
behavior.

# fa0401b0

16-Sep-2022

NathanFreeman <1056159381@qq.com>

Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)

The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then

Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)

The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.

If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.

This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.

The simplest fix is to suppress error markers during the final flush
operation.

While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.

The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.

However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.

Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.