#
b721d0f7 |
| 10-Mar-2023 |
pakutoma |
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are ca
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see 6fc8d014df. This commit is backporting the change to PHP 8.1.)
show more ...
|
#
5812b4fe |
| 04-Aug-2022 |
Alex Dowad |
In legacy text conversion filters, reset filter state in 'flush' function Up until now, I believed that mbstring had been designed such that (legacy) text conversion filter objects shoul
In legacy text conversion filters, reset filter state in 'flush' function Up until now, I believed that mbstring had been designed such that (legacy) text conversion filter objects should not be re-used after the 'flush' function is called to complete a text conversion operation. However, it turns out that the implementation of _php_mb_encoding_handler_ex DID re-use filter objects after flush. That means that functions which were based on _php_mb_encoding_handler_ex, including mb_parse_str and php_mb_post_handler, would break in some cases; state left over from converting one substring (perhaps a variable name) would affect the results of converting another substring (perhaps the value of the same variable), and could cause extraneous characters to get inserted into the output. All this code should be deleted soon, but fixing it helps me to avoid spurious failures when fuzzing the new/old code to look for differences in behavior. (This bug fix commit was originally applied to PHP-8.2 when fuzzing the new mbstring text conversion code to check for differences with the old code. Later, Kentaro Ohkouchi kindly reported a problem with mb_encode_mimeheader under PHP 8.1 which was caused by the same issue. Hence, this commit was backported to PHP-8.1.) Fixes GH-9683.
show more ...
|
Revision tags: php-8.1.7RC1, php-8.1.4RC1, php-8.1.3, php-8.1.2RC1, php-8.1.0, php-7.3.33, php-7.3.32, php-7.3.31, php-7.3.30 |
|
#
e3f6a9fb |
| 13-Aug-2021 |
Alex Dowad |
CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has a
CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
show more ...
|
#
776296e1 |
| 30-Aug-2021 |
Alex Dowad |
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.
show more ...
|
Revision tags: php-7.3.29 |
|
#
958ef47d |
| 08-May-2021 |
Alex Dowad |
When flushing CP5022x conversion filter, also flush next filter in chain All the mbstring encoding conversion filters do this. I missed it when adding a flush function for CP5022x. |
Revision tags: php-7.3.28, php-7.3.27 |
|
#
ebe6500a |
| 20-Jan-2021 |
Alex Dowad |
Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count o
Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count of 'bad' characters. After a conversion operation finishes, this count is checked to see if there was any error. The problem with CP50220 was that mbstring used a chain of two conversion filter objects. The 'bad character count' would be incremented on the second object in the chain, but this didn't do anything, as only the count on the first such object is ever checked. Fix this by implementing the conversion using a single conversion filter object, rather than a chain of two. This is possible because of the recent refactoring, which pulled out the needed logic for CP50220 conversion into a helper function.
show more ...
|
#
319a3408 |
| 03-Jan-2021 |
Alex Dowad |
Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt->
Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt->opaque`. Some other things were also being done in an unnecessarily roundabout way. Also, the 'copy' function for CP50220 conversion filters was *both* broken and unnecessary. Broken, because it malloc'd memory which was never freed by anything. Unnecessary, because the point of the copy is so that various algorithms can try running bytes through a conversion filter and see how many output bytes or characters result, and then back out by restoring the filters to their previous state. But here's the thing; CP50220 conversion filters don't hold cached bytes, which is the main thing which would need to be restored to a previous state.
show more ...
|
#
636251a5 |
| 03-Jan-2021 |
Alex Dowad |
Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init This constructor function doesn't do anything different than the generic one. There's no need to invoke it, either, when initi
Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init This constructor function doesn't do anything different than the generic one. There's no need to invoke it, either, when initializing a CP50220 conversion filter.
show more ...
|
Revision tags: php-7.3.26, php-7.3.26RC1, php-7.3.25, php-7.3.25RC1, php-7.3.24 |
|
#
a06c20a1 |
| 18-Oct-2020 |
Alex Dowad |
Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't
Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.
show more ...
|
#
34ece408 |
| 17-Oct-2020 |
Alex Dowad |
Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstr
Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstring. Since these encodings are very similar, some code can be shared. Actually, conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when converting from Unicode to CP50220/1/2 that some small differences arise in how certain katakana are handled. The most important common code was a function called `mbfl_filt_wchar_jis_ms`. The `jis_ms` part doubtless refers to the fact that these encodings are modified versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported 'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested 'JIS-ms' conversion, they got something like CP50220/1/2, minus their special ways of handling half-width katakana when converting from Unicode. But... that 'encoding' is not something which actually exists in the world outside of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'. For a text encoding conversion library, inventing new variant encodings and implementing them is not very productive. Our interest is in handling text encodings which real people actually use for... you know, storing actual text and things like that.
show more ...
|
Revision tags: php-7.3.24RC1 |
|
#
fcbe45de |
| 07-Oct-2020 |
Alex Dowad |
Remove useless mbstring encoding 'CP50220-raw' CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some Unicode characters which are not representable in ISO-2022-JP
Remove useless mbstring encoding 'CP50220-raw' CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some Unicode characters which are not representable in ISO-2022-JP by converting them to similar characters which are representable. What, then, is CP50220-raw? An Internet search turns up absolutely nothing. Reference works which I consulted don't say anything about it. Other text conversion libraries don't support it. From looking at the code: It's just the same as CP50220, but it accepts unmapped JIS X 0208 characters passed through from other Japanese encodings and silently encodes them using the usual ISO-2022-JP escape sequence and representation for JIS X 0208 characters. It's hard to see how this could be useful. OK, let me come out and say it: it's _not_ useful. We can confidently jettison this (mis)feature.
show more ...
|
#
888f5d77 |
| 13-Jan-2021 |
Alex Dowad |
CP5022{0,1,2}: treat truncated multibyte characters as error |
#
0ec34da8 |
| 05-Jan-2021 |
Alex Dowad |
CP5022{0,1,2}: treat unrecognized escapes as error |
#
a50607d1 |
| 05-Jan-2021 |
Alex Dowad |
CP5022{0,1,2}: use JISX0201 for U+203E (overline) Same issue as d497c0e96f addressed for JIS7/JIS8, but for CP5022{0,1,2} this time. |
#
5e5243ab |
| 11-Oct-2020 |
Alex Dowad |
CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly Unicode has a range of 'private' codepoints which individual applications can use for their own purposes.
CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly Unicode has a range of 'private' codepoints which individual applications can use for their own purposes. When they were inventing CP932, MicroSoft mapped these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208 character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114 for private characters.) `mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94 rather than 95-114. The code included a link to a document on the OpenGroup web site, dating back to 1996 [1], which proposed mapping private codepoints to these rows. However, that is not consistent with what mbstring does when converting CP5022x to Unicode. There seems to be a dearth of information on CP5022x on the web. However, I did find one (Japanese-language) page on CP50221, which states that it maps kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2]. As a side note, using rows higher than 95 does seem to defeat one purpose of using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be "7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up, so 8 bits are needed. [1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html [2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221
show more ...
|
#
6e9c8386 |
| 11-Oct-2020 |
Alex Dowad |
CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS. As Shift-JIS and ISO-2022-JP both encode charact
CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS. As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset, CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters added as MicroSoft vendor extensions. Among the added characters are a number of symbols which MS put in the 13th row of the 94x94 character table. (In JIS X 0208, that row is empty.) mbfilter_cp50220x.c had an `if` clause which was intended to handle the conversion of characters in that 13th row, but it was dead code, as the previous clause was always true in those cases. The solution is to reverse the order of those two clauses (just as they already appeared in mbfilter_cp932.c).
show more ...
|
#
cdd07242 |
| 08-Oct-2020 |
Alex Dowad |
Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding Don't allow escape sequences to start in the middle of a multibyte character. Also, don't silently pass t
Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding Don't allow escape sequences to start in the middle of a multibyte character. Also, don't silently pass through illegal bytes which appear where the 2nd byte of a multibyte character should be.
show more ...
|
#
ecf71847 |
| 14-Nov-2020 |
Alex Dowad |
Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in d
Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.
show more ...
|
#
5ffcf563 |
| 07-Oct-2020 |
Alex Dowad |
Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 021
Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when converting to another Japanese encoding.
show more ...
|
#
8ae04733 |
| 07-Oct-2020 |
Alex Dowad |
Don't pass invalid JIS X 0208 characters through Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and so on encode characters from the JIS X 0208 character set. J
Don't pass invalid JIS X 0208 characters through Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and so on encode characters from the JIS X 0208 character set. JIS X 0208 is based on the concept of a 94x94 table, with numbered rows and columns. However, more than a thousand of the cells in that table are empty; JIS X 0208 does not actually use all 94x94=8,836 possible kuten codes. mbstring had a dubious feature whereby, if a Japanese string contained one of these 'unmapped' kuten codes, and it was being converted to another Japanese encoding which was also based on JIS X 0208, the non-existent character would be silently passed through, and the unmapped kuten code would be re-encoded using the normal encoding method of the target text encoding. Again, this _only_ happened if converting the text with the funky kuten code to a Japanese encoding. If one tried converting it to Unicode, mbstring would treat that as an error. If somebody, somewhere, made their own private extension to JIS X 0208, and used the regular Japanese encodings like Shift JIS and EUC-JP to encode this private character set, then this feature might conceivably be useful. But how likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to encode a funky version of JIS X 0208 with extra characters added, then that should be treated as a separate text encoding. The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained solely for error reporting in `mbfl_filt_conv_illegal_output`.
show more ...
|
#
3e7acf90 |
| 04-Nov-2020 |
Alex Dowad |
Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string.
Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.
show more ...
|
Revision tags: php-7.3.23, php-7.3.23RC1, php-7.3.22, php-7.3.22RC1, php-7.3.21, php-7.3.21RC1 |
|
#
0ffc1f55 |
| 16-Jul-2020 |
Alex Dowad |
Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for thi
Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for things which will never be NULL) - Don't return success/failure code from functions which can never fail - For encoding structs, don't use pointers to pointers to pointers for the list of alias strings. Pointers to pointers (2 levels of indirection) is what actually makes sense. This gets rid of some extraneous dereference operations.
show more ...
|
#
3f1851de |
| 19-Sep-2020 |
Alex Dowad |
Avoid compiler warnings related to mbstring flush functions |
#
b1c5532a |
| 15-Sep-2020 |
Remi Collet |
fix mbfl function prototypes re-add mbfl_convert_filter_feed API re-add pointer cast |
#
a2b40ee9 |
| 16-Jul-2020 |
Alex Dowad |
Remove unneeded function mbfl_filt_ident_common_dtor This was the default destructor for mbfl_identify_filter structs, but there's nothing we actually need to do to those structs before
Remove unneeded function mbfl_filt_ident_common_dtor This was the default destructor for mbfl_identify_filter structs, but there's nothing we actually need to do to those structs before freeing them.
show more ...
|