mbfl_convert.c - OpenGrok history log for /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# c717c79a	14-Apr-2023	Alex Dowad	Combine CJK encoding conversion code in a single source file This will make it easier to combine duplicated code between all the CJK text encodings (a significant amount is already combi Combine CJK encoding conversion code in a single source file This will make it easier to combine duplicated code between all the CJK text encodings (a significant amount is already combined in this commit, such as the repeated definitions of SJIS_DECODE and SJIS_ENCODE), but I hope to remove even more redundancy in the future. The table used to implement mb_strlen for CP932 has been changed to the same table as "SJIS-win". show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 4427b2e1	10-Jan-2023	Alex Dowad	Mark UTF-8 strings emitted by mbstring functions as valid UTF-8 We now have a couple of mbstring functions which have fast paths for strings marked as 'valid UTF-8'. Later, we may likely Mark UTF-8 strings emitted by mbstring functions as valid UTF-8 We now have a couple of mbstring functions which have fast paths for strings marked as 'valid UTF-8'. Later, we may likely have more. So that these fast paths can be used more frequently, mark UTF-8 strings emitted by mbstring as 'valid UTF-8'. This is always a correct thing to do, because mbstring never returns invalid UTF-8 as the result of a conversion (or similar) operation. Internally, we do have a conversion mode which deliberately emits invalid UTF-8 in some cases. (This is done to prevent unwanted matches when we are converting strings to UTF-8 before performing matching operations on them.) For such strings, don't set the 'valid UTF-8' flag. It probably wouldn't hurt anything to set it, because strings generated using that special conversion mode should never be returned to userland, and I don't think we do anything with them which cares about the IS_STR_VALID_UTF8 flag... but still, it would likely cause confusion for developers. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# b9cd1cdb	04-Dec-2022	Alex Dowad	Implement mb_substr_count using fast text conversion filters The performance gain from this change depends on the text encoding and input string size. For very small strings, other overh Implement mb_substr_count using fast text conversion filters The performance gain from this change depends on the text encoding and input string size. For very small strings, other overheads tend to swamp the performance gains to some extent, such that the speedup is less than 2x. For medium-length strings (~100 bytes or so), the speedup is typically around 2.5x. The greatest performance gains are for UTF-8 strings which have already been marked as valid (using the GC flags on the zend_string object); for those, the speedup is more than 10x in many cases. The previous implementation first converted the haystack and needle to wchars, then searched for matches between the two sequences of wchars. Because we use -1 as an error marker when converting to wchars, error markers from invalid byte sequences in the haystack would match error markers from invalid byte sequences in the needle, even if the specific invalid byte sequence was different. I am not sure whether this behavior is really desirable or not, but anyways, this new implementation follows the same behavior so as not to cause BC breaks. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 3ce888a8	04-Oct-2022	Alex Dowad	Use uint32_t for 'illegal_substchar' codepoint in mbstring This value is a wchar, so the best type for it is uint32_t. /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.2.0RC1, php-8.1.10, php-8.0.23, php-8.0.23RC1, php-8.1.10RC1, php-8.2.0beta3
# 983a29d3	06-Aug-2022	Alex Dowad	Legacy conversion code for '7bit' to '8bit' inserts error markers The use of a special 'vtbl' for converting between '7bit' and '8bit' text meant that '7bit' text would not be converted Legacy conversion code for '7bit' to '8bit' inserts error markers The use of a special 'vtbl' for converting between '7bit' and '8bit' text meant that '7bit' text would not be converted to wchars before going to '8bit'. This meant that the special value MBFL_BAD_INPUT, which we use to flag an erroneous byte sequence in input text (and which is required by functions like mb_check_encoding), would pass directly to the output, instead of being converted to the error marker specified by mb_substitute_character. This issue dates back to the time when I removed the mbfl 'identify filters' and made encoding validity checking and encoding detection rely only on the conversion filters. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# a4656895	04-Aug-2022	Alex Dowad	Imitate legacy behavior when converting non-encodings using mbstring Fuzzing revealed that something was missed here when making the new encoding conversion code match the behavior of th Imitate legacy behavior when converting non-encodings using mbstring Fuzzing revealed that something was missed here when making the new encoding conversion code match the behavior of the old code. In the next major release of PHP, support for these non-encodings will be dropped, but in the meantime, it is better to match the legacy behavior. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.2.0beta2, php-8.1.9, php-8.0.22, php-8.1.9RC1, php-8.2.0beta1, php-8.0.22RC1, php-8.0.21, php-8.1.8, php-8.2.0alpha3, php-8.1.8RC1, php-8.2.0alpha2, php-8.0.21RC1, php-8.0.20, php-8.1.7, php-8.2.0alpha1, php-7.4.30, php-8.1.7RC1, php-8.0.20RC1
# 0154a5ac	13-May-2022	Alex Dowad	Use fast text conversion filters to implement php_mb_convert_encoding_ex /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.1.6, php-8.0.19
# e4b9aa18	08-May-2022	Alex Dowad	Add assertions to help catch buffer overflows in mbstring text conversion code /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.1.6RC1, php-8.0.19RC1, php-8.0.18, php-8.1.5, php-7.4.29, php-8.1.5RC1, php-8.0.18RC1, php-8.1.4, php-8.0.17, php-8.1.4RC1, php-8.0.17RC1, php-8.1.3, php-8.0.16, php-7.4.28, php-8.1.3RC1, php-8.0.16RC1, php-8.1.2, php-8.0.15, php-8.1.2RC1, php-8.0.15RC1, php-8.0.14, php-8.1.1, php-7.4.27, php-8.1.1RC1, php-8.0.14RC1, php-7.4.27RC1, php-8.1.0, php-8.0.13, php-7.4.26, php-7.3.33, php-8.1.0RC6, php-7.4.26RC1, php-8.0.13RC1, php-8.1.0RC5, php-7.3.32, php-7.4.25, php-8.0.12, php-8.1.0RC4, php-8.0.12RC1, php-7.4.25RC1, php-8.1.0RC3, php-8.0.11, php-7.4.24, php-7.3.31, php-8.1.0RC2, php-7.4.24RC1, php-8.0.11RC1, php-8.1.0RC1, php-7.4.23, php-8.0.10, php-7.3.30, php-8.1.0beta3, php-8.0.10RC1, php-7.4.23RC1, php-8.1.0beta2, php-8.0.9, php-7.4.22
# 3c732251	21-Jul-2021	Alex Dowad	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typica New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.) show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 929d8471	03-Dec-2021	Christoph M. Becker	Fix #81693: mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be Fix #81693: mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be found. Since we don't actually need to convert to wchar, we encode to 8bit. Closes GH-7712. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# f303fc8a	30-Aug-2021	Alex Dowad	Use bool in mbfl_filt_conv_output_hex (rather than int) /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 776296e1	30-Aug-2021	Alex Dowad	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 97b7fc89	24-Jul-2021	Alex Dowad	Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an illegal byte sequence for the encoding it is using, it should emit an 'illegal character' marker, which can either be a single character like '?', an HTML hexadecimal entity, or a marker string like 'BAD+XXXX'. Because of the use of signed integers to hold 4-byte characters, illegal 4-byte sequences with a 'negative' value (one with the high bit set) were not handled correctly when emitting the illegal char marker. The result is that such illegal sequences were just skipped over (and the marker was not emitted to the output). Fix that. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.1.0beta1, php-7.4.22RC1, php-8.0.9RC1, php-8.1.0alpha3, php-7.4.21, php-7.3.29, php-8.0.8, php-8.1.0alpha2, php-7.4.21RC1, php-8.0.8RC1, php-8.1.0alpha1, php-8.0.7, php-7.4.20, php-8.0.7RC1, php-7.4.20RC1, php-8.0.6, php-7.4.19, php-7.4.18, php-7.3.28, php-8.0.5, php-8.0.5RC1, php-7.4.18RC1, php-8.0.4RC1, php-7.4.17RC1, php-8.0.3, php-7.4.16, php-8.0.3RC1, php-7.4.16RC1, php-8.0.2, php-7.4.15, php-7.3.27, php-8.0.2RC1, php-7.4.15RC2, php-7.4.15RC1, php-8.0.1, php-7.4.14, php-7.3.26, php-7.4.14RC1, php-8.0.1RC1, php-7.3.26RC1, php-8.0.0, php-7.3.25, php-7.4.13, php-8.0.0RC5, php-7.4.13RC1, php-8.0.0RC4, php-7.3.25RC1, php-7.4.12, php-8.0.0RC3, php-7.3.24
# e2459857	22-Oct-2020	Alex Dowad	Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# a900ec33	03-Jan-2021	Alex Dowad	Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct This function pointer is only called when initializing the struct. After that nothing is done with it. Therefore, the Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct This function pointer is only called when initializing the struct. After that nothing is done with it. Therefore, there is no need to keep it in the struct. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# e169ad3b	03-Nov-2020	Alex Dowad	Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way. /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# b05ad511	09-Nov-2020	Alex Dowad	Don't redundantly flush mbstring filters multiple times Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not nece Don't redundantly flush mbstring filters multiple times Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not necessary to explicitly flush the second filter in a chain. (Due to this bug, in many cases, flush functions were actually being called three times.) show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 3e7acf90	04-Nov-2020	Alex Dowad	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# cc03c54c	04-Nov-2020	Alex Dowad	Remove useless byte{2,4}{be,le} encodings from mbstring There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. Remove useless byte{2,4}{be,le} encodings from mbstring There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# fde77945	18-Oct-2020	Alex Dowad	Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c ...Plus some dead code related to ISO-8859-1. /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.0.0RC2, php-7.4.12RC1, php-7.3.24RC1, php-7.2.34, php-8.0.0rc1, php-7.4.11, php-7.3.23
# 3f1851de	19-Sep-2020	Alex Dowad	Avoid compiler warnings related to mbstring flush functions /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-8.0.0beta4
# b1c5532a	15-Sep-2020	Remi Collet	fix mbfl function prototypes re-add mbfl_convert_filter_feed API re-add pointer cast /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-7.4.11RC1, php-7.3.23RC1, php-8.0.0beta3, php-7.4.10, php-7.3.22, php-8.0.0beta2, php-7.3.22RC1, php-7.4.10RC1, php-8.0.0beta1, php-7.4.9, php-7.2.33, php-7.3.21, php-8.0.0alpha3, php-7.4.9RC1, php-7.3.21RC1
# dcd6c604	16-Jul-2020	Alex Dowad	Remove unneeded function mbfl_filt_conv_common_dtor This is a default destructor for mbfl_convert_filter structs. The thing is: there isn't really anything that needs to be done to those Remove unneeded function mbfl_filt_conv_common_dtor This is a default destructor for mbfl_convert_filter structs. The thing is: there isn't really anything that needs to be done to those structs before freeing them. The default destructor just zeroed out some fields, but there's no reason why we should actually do that. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
# 409aa20a	15-Jul-2020	Alex Dowad	Refactor mbfl_convert.c /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
Revision tags: php-7.4.8, php-7.2.32, php-8.0.0alpha2, php-7.3.20
# 62317d59	04-Jul-2020	Alex Dowad	Remove redundant includes from mbstring (and make sure correct config.h is used) Very interesting... it turns out that when Valgrind support was enabled, `#include "config.h"` from withi Remove redundant includes from mbstring (and make sure correct config.h is used) Very interesting... it turns out that when Valgrind support was enabled, `#include "config.h"` from within mbstring was actually including the file "config.h" from Valgrind, and not the one from mbstring!! This is because -I/usr/include/valgrind was added to the compiler invocation _before_ -Iext/mbstring/libmbfl. Make sure we actually include the file which was intended. show more ... /PHP-8.3/ext/mbstring/libmbfl/mbfl/mbfl_convert.c
12 3