common_codepoints.txt - OpenGrok history log for /php-src/ext/mbstring/common

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 81faab92	22-Aug-2023	Alex Dowad	Improve mb_detect_encoding accuracy for text containing vowels with macrons Among other world languages, the Māori language commonly uses vowels with macrons.
# f40c3fca	29-Dec-2022	Alex Dowad	Improve mb_detect_encoding's recognition of Turkish text Add 4 codepoints commonly used to write Turkish text to our table of 'commonly used' Unicode codepoints. These are: • U+ Improve mb_detect_encoding's recognition of Turkish text Add 4 codepoints commonly used to write Turkish text to our table of 'commonly used' Unicode codepoints. These are: • U+011F LATIN SMALL LETTER G WITH BREVE • U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE • U+0131 LATIN SMALL LETTER DOTLESS I • U+015F LATIN SMALL LETTER S WITH CEDILLA show more ...
Revision tags: php-8.2.0RC1, php-8.1.10, php-8.0.23, php-8.0.23RC1, php-8.1.10RC1, php-8.2.0beta3, php-8.2.0beta2, php-8.1.9, php-8.0.22, php-8.1.9RC1, php-8.2.0beta1, php-8.0.22RC1, php-8.0.21, php-8.1.8, php-8.2.0alpha3, php-8.1.8RC1, php-8.2.0alpha2, php-8.0.21RC1, php-8.0.20, php-8.1.7, php-8.2.0alpha1, php-7.4.30
# 58d0aad7	25-May-2022	Alex Dowad	mb_detect_encoding recognizes all letters in Hungarian alphabet /php-src/ext/mbstring/common_codepoints.txt
Revision tags: php-8.1.7RC1, php-8.0.20RC1
# 6a4b6d23	24-May-2022	Alex Dowad	mb_detect_encoding recognizes all letters in Czech alphabet /php-src/ext/mbstring/common_codepoints.txt
Revision tags: php-8.1.6, php-8.0.19, php-8.1.6RC1, php-8.0.19RC1
# 9bb97ee8	25-Apr-2022	Alex Dowad	Fix mb_detect_encoding's recognition of Slavic names Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Fix mb_detect_encoding's recognition of Slavic names Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names. show more ... /php-src/ext/mbstring/common_codepoints.txt
Revision tags: php-8.0.18, php-8.1.5, php-7.4.29, php-8.1.5RC1, php-8.0.18RC1, php-8.1.4, php-8.0.17, php-8.1.4RC1, php-8.0.17RC1, php-8.1.3, php-8.0.16, php-7.4.28, php-8.1.3RC1, php-8.0.16RC1, php-8.1.2, php-8.0.15, php-8.1.2RC1, php-8.0.15RC1, php-8.0.14, php-8.1.1, php-7.4.27, php-8.1.1RC1, php-8.0.14RC1, php-7.4.27RC1
# d573054e	25-Nov-2021	Alex Dowad	Enable encoding detection for Polish text Previously, some accented letters commonly used to write Polish text were counted as 'rare' codepoints. Treat them as 'common' instead. Enable encoding detection for Polish text Previously, some accented letters commonly used to write Polish text were counted as 'rare' codepoints. Treat them as 'common' instead. Thanks to Alec for pointing this out. show more ... /php-src/ext/mbstring/common_codepoints.txt
Revision tags: php-8.1.0, php-8.0.13, php-7.4.26, php-7.3.33, php-8.1.0RC6, php-7.4.26RC1, php-8.0.13RC1, php-8.1.0RC5, php-7.3.32, php-7.4.25, php-8.0.12, php-8.1.0RC4
# 28b346bc	09-Oct-2021	Alex Dowad	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. Howeve Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In 3e7acf901d, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array. show more ... /php-src/ext/mbstring/common_codepoints.txt