#
fb9bf5b6 |
| 03-Aug-2017 |
Nikita Popov |
Revert/fix substitution character fallback The introduced checks were not correct in two respects: * It was checked whether the source encoding of the string matches the internal
Revert/fix substitution character fallback The introduced checks were not correct in two respects: * It was checked whether the source encoding of the string matches the internal encoding, while the actually relevant encoding is the *target* encoding. * Even if the correct encoding is used, the checks are still too conservative. Just because something is not a "Unicode-encoding" does not mean that it does not map any non-ASCII characters. I've reverted the added checks and instead adjusted mbfl_convert to first try to use the provided substitution character and if that fails, perform the fallback to '?' at that point. This means that any codepoint mapped in the target encoding should now be correctly supported and anything else should fall back to '?'.
show more ...
|
#
a8a9e93e |
| 03-Aug-2017 |
Nikita Popov |
Revert/fix mb_substitute_character() codepoint checks The introduced checks did not treat "non-Unicode" encodings correctly, because they treated the passed integer as encoded in the int
Revert/fix mb_substitute_character() codepoint checks The introduced checks did not treat "non-Unicode" encodings correctly, because they treated the passed integer as encoded in the internal encoding in that case, while in actuality the substitute character is always a Unicode codepoint. Additionally checking the codepoint against the internal encoding is not correct in any case, because the substitution character must be mapped in the *target* encoding of the conversion, which does not necessarily coincide with the internal encoding (the internal encoding is the default *source* encoding, not *target* encoding). This reverts the checks back to simple range checks, but in a way that still resolves #69079: Characters outside the Basic Multilingual Plane are now accepted and Surrogate Codepoints are rejected. A distinction between UTF-8 and non-UTF-8 encodings is not made for surrogate checks (as in the original patch), as surrogates are always illegal on their own. Specifying a surrogate as substitution character would only make sense if you could specify a substitution string with more than one character -- however we do not support that.
show more ...
|
#
2cc1cbf2 |
| 28-Jul-2017 |
Fabien Villepinte |
Fix Bug #75001: Wrong reflection on mb_eregi_replace
|
#
582a65b0 |
| 27-Jul-2017 |
Nikita Popov |
Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats:
Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats: * Only language-agnostic and unconditional full case mapping is implemented. The only language-agnostic conditional case mapping rule relates to Greek sigma in final position (Final_Sigma). Correctly handling this requires both arbitrary lookahead and lookbehind, which would require some larger changes to how the case mapping is implemented. This is a possible future extension. * The only language-specific handling that is implemented is for Turkish dotted/undotted Is, if the ISO-8859-9 encoding is used. This matches the previous behavior and makes sure that no codepoints not supported by the encoding are produced. A future extension would be to also handle the Turkish mappings specified by SpecialCasing.txt based on the mbfl internal language. * Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string. * mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are: * MB_CASE_LOWER (used by mb_strtolower) * MB_CASE_UPPER (used by mb_strtolower) * MB_CASE_TITLE * MB_CASE_FOLD * MB_CASE_LOWER_SIMPLE * MB_CASE_UPPER_SIMPLE * MB_CASE_TITLE_SIMPLE * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)
show more ...
|
#
9ac7c1e7 |
| 27-Jul-2017 |
Nikita Popov |
Use case-folding for case insensitive comparisons Instead of using lowercasing.
|
#
f56b0afe |
| 26-Jul-2017 |
Nikita Popov |
Avoid some unnecessary mbfl_strlen() calculations
|
#
13a26290 |
| 25-Jul-2017 |
Anatol Belski |
size_t fixes
|
#
445e13b1 |
| 23-Jul-2017 |
Nikita Popov |
Add MBFL_SUBSTR_TO_END mode to mbfl_substr This takes the substr from the offset to the end of the string. This avoids pointless searching for the end position and also saves us a le
Add MBFL_SUBSTR_TO_END mode to mbfl_substr This takes the substr from the offset to the end of the string. This avoids pointless searching for the end position and also saves us a length calculation in the strstr family of functions.
show more ...
|
#
bff11c38 |
| 23-Jul-2017 |
Nikita Popov |
Remove more obsolete length checks
|
#
78944bdf |
| 23-Jul-2017 |
Anatol Belski |
remove cast
|
#
6809be20 |
| 23-Jul-2017 |
Anatol Belski |
fix warnings and datatype ident
|
#
bd63c0f5 |
| 23-Jul-2017 |
Nikita Popov |
Fix bug #73528
|
#
80463579 |
| 23-Jul-2017 |
Nikita Popov |
Remove confusing null checks in mb_send_mail These are required parameters, they cannot be missing.
|
#
9af5b7f3 |
| 23-Jul-2017 |
Nikita Popov |
Fix use after free in mb_send_mail
|
#
4fbd7ccb |
| 22-Jul-2017 |
Anatol Belski |
touch yet more places for datatypes
|
#
61784bcb |
| 22-Jul-2017 |
Anatol Belski |
sync libmbfl allocator with the size_t changes
|
#
e0825ec6 |
| 22-Jul-2017 |
Anatol Belski |
Mitigation for ssize_t issue in 22a5f554a8766d63fd2c2ce91a90ebacb13c0f6a and some more
|
#
1388751f |
| 20-Jul-2017 |
Nikita Popov |
Use fast zpp in mb_strlen() For short strings this function is now sufficiently fast for zpp to be a bottleneck.
|
#
b3c1d9d1 |
| 20-Jul-2017 |
Nikita Popov |
Directly use encodings instead of no_encoding in libmbfl In particular strings now store encoding rather than the no_encoding. I've also pruned out libmbfl APIs that existed in
Directly use encodings instead of no_encoding in libmbfl In particular strings now store encoding rather than the no_encoding. I've also pruned out libmbfl APIs that existed in two forms, one using no_encoding and the other using encoding. We were not actually using any of the former.
show more ...
|
#
77cb7bd8 |
| 20-Jul-2017 |
Nikita Popov |
Free last_used_encoding_name in RSHUTDOWN efree() cannot be used in GSHUTDOWN
|
#
ba383b82 |
| 20-Jul-2017 |
Nikita Popov |
Add basic mbstring encoding cache Store the last used encoding and compare against it. It's quite likely that an application is going to be using the same encoding again and again.
Add basic mbstring encoding cache Store the last used encoding and compare against it. It's quite likely that an application is going to be using the same encoding again and again. The actual mbfl_name2encoding() function could also be optimized to use a hash lookup rather than a linear scan, but we don't have a hashtable implmentation in libmbfl...
show more ...
|
#
264387e3 |
| 20-Jul-2017 |
Nikita Popov |
Add php_mb_get_no_encoding() helper function
|
#
adaea775 |
| 20-Jul-2017 |
Nikita Popov |
Switch libmbfl to use size_t Switch mbfl_string and related structures to use size_t lengths. Quite likely that I broke some things along the way...
|
#
9c73be89 |
| 19-Jul-2017 |
Nikita Popov |
Directly accept encoding in php_unicode_convert_case() As a side-effect mb_strtolower() and mb_strtoupper() now correctly handle a NULL encoding parameter by using the internal encoding.
Directly accept encoding in php_unicode_convert_case() As a side-effect mb_strtolower() and mb_strtoupper() now correctly handle a NULL encoding parameter by using the internal encoding. This is what caused the two test changes.
show more ...
|
#
4128746b |
| 19-Jul-2017 |
Nikita Popov |
Add php_mb_get_encoding() convenience function
|