#
bfccdbd8 |
| 08-Aug-2022 |
Alex Dowad |
SJIS-Mobile#SOFTBANK string can end immediately after special escape sequence SJIS-Mobile#SOFTBANK text encoding supports special escape sequences, which shift the decoder into a mode wh
SJIS-Mobile#SOFTBANK string can end immediately after special escape sequence SJIS-Mobile#SOFTBANK text encoding supports special escape sequences, which shift the decoder into a mode where each single byte represents an emoji. To get out of this mode, a 0xF (SHIFT OUT) byte can be used. After one of these special escape sequences, the new conversion code expected to see at least one more byte. However, there doesn't seem to be any particular reason why it should be treated as an error condition if a string ends abruptly after one of these escapes. Well, the escape sequence is useless in that case, but it is a complete and valid escape sequence. The legacy conversion code did allow a string to end immediately after one of these escape sequences. Amend the new code to allow the same.
show more ...
|
#
7559bf77 |
| 24-Jun-2022 |
Alex Dowad |
Fix new conversion filters for mobile SJIS variants ('0' at end of buffer) Previously, I had adjusted this code so that if a character which could be part of a special Docomo/Softbank/KD
Fix new conversion filters for mobile SJIS variants ('0' at end of buffer) Previously, I had adjusted this code so that if a character which could be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at the end of one buffer, and the 'keypad' character appeared at the beginning of the next, they would still be combined. However, this broke the handling of such a character appearing at the end of one buffer, and a character which is NOT 'keypad' appearing at the beginning of the next. This was found while fuzzing the new implementation of mb_decode_numericentity.
show more ...
|
#
67d83f57 |
| 23-Jan-2022 |
Alex Dowad |
Implement fast text conversion interface for mobile SJIS variants
|
#
bf940a13 |
| 03-Sep-2021 |
Alex Dowad |
Add another test for SJIS-Mobile text conversion
|
#
776296e1 |
| 30-Aug-2021 |
Alex Dowad |
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.
show more ...
|
#
e6f1a722 |
| 08-Aug-2021 |
Alex Dowad |
Add test suite for mobile variants of UTF-8 (and fix bugs)
|
#
51b9d7a5 |
| 27-Jul-2021 |
Alex Dowad |
Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. D
Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.
show more ...
|
#
39131219 |
| 11-Jun-2021 |
Nikita Popov |
Migrate more SKIPIF -> EXTENSIONS (#7139) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.
|
#
beef5971 |
| 20-Oct-2020 |
Alex Dowad |
Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a
Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...
show more ...
|