#
d3933e0b |
| 14-Nov-2022 |
Alex Dowad |
Fix regression test for GH-9535 on PHP-8.2+ Some of the legacy text encodings which were used in this regression test are deprecated in PHP-8.2+. The deprecation warnings break the e
Fix regression test for GH-9535 on PHP-8.2+ Some of the legacy text encodings which were used in this regression test are deprecated in PHP-8.2+. The deprecation warnings break the expected output. Since using these encodings in mbstring is now deprecated, I think there is little point in keeping them in this test. So they are now removed from it. Further, in 219fff376b, I made a change to avoid a situation where the legacy UTF7-IMAP conversion code gets stuck in a wrong state when its attempt to emit a character fails. When a Base64-encoded section of input ended with -, the previous code would FIRST emit a character if necessary (using the CK or "check" macro, which causes the function to return immediately if the downstream filter function returns an error code), and THEN update its own state to indicate that it is now in ASCII rather than Base64 mode. If the downstream filter function returned an error code, the CK macro would then cause the UTF7-IMAP filter function to return immediately WITHOUT setting its own state to indicate that the Base64-encoded section was done. I fixed this by updating the filter state as needed BEFORE calling CK... but I missed updating the filter state in the case where the Base64 section ends normally and there is no need to emit anything. Again, in 6d525a425e, I modified the legacy conversion code for ISO-2022-KR to try to comply more closely with the RFC for this text encoding. The RFC states that before any occurrence of 'Shift In' or 'Shift Out' codes in a ISO-2022-KR string, a special escape sequence must appear at least ONCE, at the beginning of a line. The previous code did not comply with this requirement. I made it comply by always emitting this escape sequence at the beginning of the first line. Since mb_strcut (wrongly) determines when it has consumed enough of the input string by looking at the length of its output in bytes, this extra escape sequence makes mb_strcut consume 4 bytes less of an ISO-2022-KR string than would otherwise be the case. When this strange behavior of mb_strcut is fixed, this test will have to be adjusted to restore the previous expected outputs for ISO-2022-KR.
show more ...
|
#
fa0401b0 |
| 16-Sep-2022 |
NathanFreeman <1056159381@qq.com> |
Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then
Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then running them through a conversion filter to ensure that the output is valid in the requested encoding. If the conversion filter emits error markers when doing the final 'flush' operation which ends the conversion of the extracted bytes, these error markers may (in some cases) be included in the output. The conversion operation does not respect the value of mb_substitute_character; rather, it always uses '?' as an error marker. So this issue manifests itself as unwanted '?' characters being inserted into the output. This issue has existed for a long time, but became noticeable in PHP 8.1 because for at least some of the supported text encodings, mbstring is now more strict about emitting error markers when strings end in an illegal state. The simplest fix is to suppress error markers during the final flush operation. While working on a fix for this problem, another problem with mb_strcut was discovered; since it decides when to stop consuming bytes from the input by looking at the byte length of its OUTPUT, anything which causes extra bytes to be emitted to the output may cause mb_strcut to not consume all the bytes in the requested range. The one case where we DO emit extra output bytes is for encodings which have a selectable mode, like ISO-2022-JP; if a string in such an encoding ends in a mode which is not the default, we emit an ending escape sequence which changes back to the default mode. This is done so that concatenating strings in such encodings is safe. However, as mentioned, this can cause the output of mb_strcut to be shorter than it logically should be. This bug has existed for a long time, and fixing it now will be a BC break, so we may not fix it right away. Therefore, tests for THIS fix which don't pass because of that OTHER bug have been split out into a separate test file (gh9535b.phpt), and that file has been marked XFAIL.
show more ...
|