mbfilter_cp5022x.c - OpenGrok history log for /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 6fc8d014	21-Mar-2023	pakutoma	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are ca Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# f3c8efd7	04-Aug-2022	Alex Dowad	In legacy text conversion filters, reset filter state in 'flush' function Up until now, I believed that mbstring had been designed such that (legacy) text conversion filter objects shoul In legacy text conversion filters, reset filter state in 'flush' function Up until now, I believed that mbstring had been designed such that (legacy) text conversion filter objects should not be re-used after the 'flush' function is called to complete a text conversion operation. However, it turns out that the implementation of _php_mb_encoding_handler_ex DID re-use filter objects after flush. That means that functions which were based on _php_mb_encoding_handler_ex, including mb_parse_str and php_mb_post_handler, would break in some cases; state left over from converting one substring (perhaps a variable name) would affect the results of converting another substring (perhaps the value of the same variable), and could cause extraneous characters to get inserted into the output. All this code should be deleted soon, but fixing it helps me to avoid spurious failures when fuzzing the new/old code to look for differences in behavior. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 3517a70f	04-Aug-2022	Alex Dowad	Fix legacy text conversion filter for CP50220 CP50220 converts some codepoints which represent kana (hiragana/katakana) to a different form. This is the only difference between CP502 Fix legacy text conversion filter for CP50220 CP50220 converts some codepoints which represent kana (hiragana/katakana) to a different form. This is the only difference between CP50220 and CP50221 (which doesn't perform such conversion). In some cases, this conversion means collapsing two codepoints to a single output byte sequence. Since the legacy text conversion filters only worked a byte at a time, the legacy filter had to cache a byte, then wait until it was called again with the next byte to compare the cached byte with the following one. That was all fine, but it didn't work as intended when there were errors (invalid byte sequences) in the input. Our code (both old and new) for emitting error markers recursively calls the same conversion filter. When the old CP50220 filter was called recursively, the logic for managing cached bytes did not behave as intended. As a result, the error markers could be reordered with other characters in the output. I used an ugly hack to fix this in 6938e3512; when making a recursive call to emit an error marker, temporarily swap out `filter->filter_function` to bypass the byte-caching code, so the error marker immediately goes through to the output. This worked, but I overlooked the fact that the very same problem can occur if an invalid byte sequence is detected in the flush function. Apply the same (ugly) fix. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 88d13491	04-Aug-2022	Alex Dowad	Make control flow in mb_wchar_to_cp50220 a bit clearer /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 78ee1841	26-Jul-2022	Alex Dowad	Move kana conversion function to mbfilter_cp5022x.c ...To avoid a dependency from libmbfl to mbstring. Thanks to Nikita Popov for pointing this issue out. /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 44b4fb2c	23-Jul-2022	Alex Dowad	Fix legacy text conversion filter for CP50220 In my recent commit which replaced the implementation of mb_convert_kana, the commit message noted that mb_convert_kana previously had a Fix legacy text conversion filter for CP50220 In my recent commit which replaced the implementation of mb_convert_kana, the commit message noted that mb_convert_kana previously had a bug whereby null bytes would be 'swallowed' and not passed to the output. This was actually the reason. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 9ac49c0d	12-Jul-2022	Alex Dowad	New implementation of mb_convert_kana mb_convert_kana now uses the new text encoding conversion filters. Microbenchmarking shows speed gains of 50%-150% across various text encodings New implementation of mb_convert_kana mb_convert_kana now uses the new text encoding conversion filters. Microbenchmarking shows speed gains of 50%-150% across various text encodings and input string lengths. The behavior is the same as the old mb_convert_kana except for one fix: if the 'zero codepoint' U+0000 appeared in the input, the old implementation would sometimes drop it, not passing it through to the output. This is now fixed. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 3cf43279	25-Jun-2022	Alex Dowad	Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end of one buffer and the other at the beginning of the next buffer, they were not converted correctly. This was discovered while fuzzing the new implementation of mb_decode_numericentity. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 6938e351	05-Jul-2022	Alex Dowad	Fix legacy conversion filter for CP50220 /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 8533fccd	06-Jun-2022	Alex Dowad	Assert minimum size of wchar buffer in text conversion filters In all text conversion filters which require the wchar buffer used for output to have some minimum size, it's better to inc Assert minimum size of wchar buffer in text conversion filters In all text conversion filters which require the wchar buffer used for output to have some minimum size, it's better to include an assertion; this will help us to catch bugs, and will also help future readers to understand what we expect of the function arguments. For UTF-7 and UTF7-IMAP, these assertions were already there, but I have added comments explaining why the minimum size is what it is. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# e2c4fc57	23-May-2022	Alex Dowad	Fix buffer overflow bugs in CP50222 text conversion code /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 53ffba96	20-Dec-2021	Alex Dowad	Implement fast text conversion interface for CP5022{0,1,2} /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 3c732251	21-Jul-2021	Alex Dowad	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typica New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.) show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# dcaa010f	07-Aug-2020	Alex Dowad	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inver Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all). show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 0957f54e	31-Aug-2021	Alex Dowad	Treat truncated escape sequences for CP5022{0,1,2} as error /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 64e379d8	31-Aug-2021	Alex Dowad	Declare CP50222 flush function as 'static' /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 626f0fec	30-Jul-2021	Alex Dowad	Remove some dead code from mbstring mbstring has a great deal of dead code. Some common types are: - Default switch clauses which will never be taken - If clauses intended to co Remove some dead code from mbstring mbstring has a great deal of dead code. Some common types are: - Default switch clauses which will never be taken - If clauses intended to convert codepoints which were not present in a conversion table... but the codepoint in question is in the table, so the if clause is not needed. - Bounds checks in places where it is not possible for a value to ever be out of bounds. - Checks to see if an unmatched Unicode codepoint is in CP932 extension range 3... but every codepoint in range 3 is also in range 2, so no codepoint will ever be matched and converted by that code. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# e3f6a9fb	13-Aug-2021	Alex Dowad	CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has a CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min]; show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 776296e1	30-Aug-2021	Alex Dowad	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 958ef47d	08-May-2021	Alex Dowad	When flushing CP5022x conversion filter, also flush next filter in chain All the mbstring encoding conversion filters do this. I missed it when adding a flush function for CP5022x. /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# ebe6500a	20-Jan-2021	Alex Dowad	Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count o Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count of 'bad' characters. After a conversion operation finishes, this count is checked to see if there was any error. The problem with CP50220 was that mbstring used a chain of two conversion filter objects. The 'bad character count' would be incremented on the second object in the chain, but this didn't do anything, as only the count on the first such object is ever checked. Fix this by implementing the conversion using a single conversion filter object, rather than a chain of two. This is possible because of the recent refactoring, which pulled out the needed logic for CP50220 conversion into a helper function. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 319a3408	03-Jan-2021	Alex Dowad	Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt-> Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt->opaque`. Some other things were also being done in an unnecessarily roundabout way. Also, the 'copy' function for CP50220 conversion filters was both broken and unnecessary. Broken, because it malloc'd memory which was never freed by anything. Unnecessary, because the point of the copy is so that various algorithms can try running bytes through a conversion filter and see how many output bytes or characters result, and then back out by restoring the filters to their previous state. But here's the thing; CP50220 conversion filters don't hold cached bytes, which is the main thing which would need to be restored to a previous state. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 636251a5	03-Jan-2021	Alex Dowad	Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init This constructor function doesn't do anything different than the generic one. There's no need to invoke it, either, when initi Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init This constructor function doesn't do anything different than the generic one. There's no need to invoke it, either, when initializing a CP50220 conversion filter. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# a06c20a1	18-Oct-2020	Alex Dowad	Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
# 34ece408	17-Oct-2020	Alex Dowad	Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstr Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstring. Since these encodings are very similar, some code can be shared. Actually, conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when converting from Unicode to CP50220/1/2 that some small differences arise in how certain katakana are handled. The most important common code was a function called `mbfl_filt_wchar_jis_ms`. The `jis_ms` part doubtless refers to the fact that these encodings are modified versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported 'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested 'JIS-ms' conversion, they got something like CP50220/1/2, minus their special ways of handling half-width katakana when converting from Unicode. But... that 'encoding' is not something which actually exists in the world outside of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'. For a text encoding conversion library, inventing new variant encodings and implementing them is not very productive. Our interest is in handling text encodings which real people actually use for... you know, storing actual text and things like that. show more ... /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c
12 3