#
1f0cf133 |
| 30-Sep-2023 |
Alex Dowad |
Add fast mb_strcut implementation for UTF-8 The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant
Add fast mb_strcut implementation for UTF-8 The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time. For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.) The new implementation behaves identically to the old one on VALID UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: in some cases, the old implementation will pass invalid byte sequences through unchanged, while in others it will remove them. The new implementation has behavior which is perhaps slightly more predictable: it simply backs up the starting and ending cut points to the preceding "starter byte" (one which is not a UTF-8 continuation byte).
show more ...
|
#
b721d0f7 |
| 10-Mar-2023 |
pakutoma |
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are ca
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see 6fc8d014df. This commit is backporting the change to PHP 8.1.)
show more ...
|
#
6fc8d014 |
| 21-Mar-2023 |
pakutoma |
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are ca
Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
show more ...
|
#
a1a69c37 |
| 06-Dec-2022 |
Alex Dowad |
Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding In b5ff87ca71, I made a number of adjustments to our conversion code for CP1252. One of the adjustments was to make
Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding In b5ff87ca71, I made a number of adjustments to our conversion code for CP1252. One of the adjustments was to make the mappings match those published by the Unicode Consortium in the file CP1252.TXT. These do not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and 0x9D. Rostyslav Gulka reported that this caused a problem. His application stores binary JPEG data in an MS-SQL database. When they SELECT the binary data out of the database, it is treated as CP1252 text and automatically converted to UTF-8. To recover the original binary data, they then do a conversion from UTF-8 to CP1252. Obviously, that does not work if certain CP1252 bytes do not map to any Unicode codepoint at all. While this is a very unusual application of text encoding conversion, and we might choose not to support it if there was no other basis for including those mappings, it seems that Microsoft does actually include them in the Win32 API as "best fit" mappings. These are extra mappings from Unicode to other text encodings, which the Win32 API function WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS flag was passed. A list of these "best fit" mappings for CP1252 can be found here: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
show more ...
|
Revision tags: php-8.2.0RC1, php-8.1.10, php-8.0.23, php-8.0.23RC1, php-8.1.10RC1, php-8.2.0beta3, php-8.2.0beta2, php-8.1.9, php-8.0.22, php-8.1.9RC1, php-8.2.0beta1, php-8.0.22RC1, php-8.0.21, php-8.1.8, php-8.2.0alpha3 |
|
#
cebb8009 |
| 29-Jun-2022 |
Alex Dowad |
Fix legacy conversion filters for... almost all 8-bit text encodings |
Revision tags: php-8.1.8RC1, php-8.2.0alpha2, php-8.0.21RC1, php-8.0.20, php-8.1.7, php-8.2.0alpha1, php-7.4.30, php-8.1.7RC1, php-8.0.20RC1, php-8.1.6, php-8.0.19, php-8.1.6RC1, php-8.0.19RC1, php-8.0.18, php-8.1.5, php-7.4.29, php-8.1.5RC1, php-8.0.18RC1, php-8.1.4, php-8.0.17, php-8.1.4RC1, php-8.0.17RC1, php-8.1.3, php-8.0.16, php-7.4.28, php-8.1.3RC1, php-8.0.16RC1, php-8.1.2, php-8.0.15, php-8.1.2RC1, php-8.0.15RC1, php-8.0.14, php-8.1.1, php-7.4.27, php-8.1.1RC1, php-8.0.14RC1, php-7.4.27RC1, php-8.1.0, php-8.0.13, php-7.4.26, php-7.3.33, php-8.1.0RC6, php-7.4.26RC1, php-8.0.13RC1, php-8.1.0RC5, php-7.3.32, php-7.4.25, php-8.0.12, php-8.1.0RC4, php-8.0.12RC1, php-7.4.25RC1, php-8.1.0RC3, php-8.0.11, php-7.4.24, php-7.3.31, php-8.1.0RC2, php-7.4.24RC1, php-8.0.11RC1, php-8.1.0RC1, php-7.4.23, php-8.0.10, php-7.3.30, php-8.1.0beta3, php-8.0.10RC1, php-7.4.23RC1, php-8.1.0beta2, php-8.0.9, php-7.4.22 |
|
#
3c732251 |
| 21-Jul-2021 |
Alex Dowad |
New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typica
New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.)
show more ...
|
#
776296e1 |
| 30-Aug-2021 |
Alex Dowad |
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like
mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.
show more ...
|
#
9363b0b5 |
| 24-Jul-2021 |
Alex Dowad |
Declare ARMSCII-8 conversion functions as 'static' |
#
28500fe4 |
| 11-Aug-2021 |
Nikita Popov |
Fixed bug #81349 The ascii to wchar was reporting errors using conv_illegal_output, while it should have been using WCSGROUP_THROUGH. Effectively that replaced illegal characters wit
Fixed bug #81349 The ascii to wchar was reporting errors using conv_illegal_output, while it should have been using WCSGROUP_THROUGH. Effectively that replaced illegal characters with '?' for the purpose of identification.
show more ...
|
Revision tags: php-8.1.0beta1, php-7.4.22RC1, php-8.0.9RC1, php-8.1.0alpha3, php-7.4.21, php-7.3.29, php-8.0.8, php-8.1.0alpha2, php-7.4.21RC1, php-8.0.8RC1, php-8.1.0alpha1, php-8.0.7, php-7.4.20, php-8.0.7RC1, php-7.4.20RC1 |
|
#
01b3fc03 |
| 06-May-2021 |
KsaR |
Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as
Update http->https in license (#6945) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before |
show more ...
|
Revision tags: php-8.0.6, php-7.4.19, php-7.4.18, php-7.3.28, php-8.0.5, php-8.0.5RC1, php-7.4.18RC1, php-8.0.4RC1, php-7.4.17RC1, php-8.0.3, php-7.4.16, php-8.0.3RC1, php-7.4.16RC1, php-8.0.2, php-7.4.15, php-7.3.27, php-8.0.2RC1, php-7.4.15RC2, php-7.4.15RC1, php-8.0.1, php-7.4.14, php-7.3.26, php-7.4.14RC1, php-8.0.1RC1, php-7.3.26RC1, php-8.0.0, php-7.3.25, php-7.4.13, php-8.0.0RC5, php-7.4.13RC1, php-8.0.0RC4, php-7.3.25RC1 |
|
#
e169ad3b |
| 03-Nov-2020 |
Alex Dowad |
Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way. |