History log of /php-src/ext/mbstring/tests/gh9535b.phpt (Results 1 – 2 of 2)
Revision Date Author Comments
# c211e67b 29-Mar-2023 Alex Dowad

Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding

The documentation for mb_strcut states:

mb_strcut(
string $string,

Remove XFAIL from test cases for mb_strcut when used with JIS or ISO-2022-JP encoding

The documentation for mb_strcut states:

mb_strcut(
string $string,
int $start,
?int $length = null,
?string $encoding = null
): string

mb_strcut() extracts a substring from a string similarly to mb_substr(),
but operates on bytes instead of characters. If the cut position happens
to be between two bytes of a multi-byte character, the cut is performed
starting from the first byte of that character.

My understanding of the $length parameter for mb_strcut is that it
specified the range of bytes to extract from $string, and that all
characters encoded by those bytes should be included in the returned
string, even if that means the returned string would be longer than
$length bytes. This can happen either if 1) there is more than one way
to encode the same character in $encoding, and one way requires more
bytes than the other, or 2) $encoding uses escape sequences.

However, discussion with users of mb_strcut indicates that many of them
interpret $length as the maximum length of the *returned* string.
This is also the historical behavior of the function.

Hence, there is no need to modify the behavior of mb_strcut and then
remove XFAIL from these test cases afterwards. We can keep the current
behavior.

show more ...


# fa0401b0 16-Sep-2022 NathanFreeman <1056159381@qq.com>

Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)

The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then

Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)

The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.

If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.

This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.

The simplest fix is to suppress error markers during the final flush
operation.

While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.

The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.

However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.

Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.

show more ...