History log of /php-src/ext/mbstring/mbstring.h (Results 1 – 25 of 161)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 3ab10da7 30-Apr-2023 Alex Dowad

Take order of candidate encodings into account when guessing text encoding

The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding fo

Take order of candidate encodings into account when guessing text encoding

The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding for string `string` from an
ordered list of candidates".

Prior to 28b346bc06, mb_detect_encoding did not really attempt to
determine the "most likely" text encoding for the input string. It
would just return the first candidate encoding for which the string was
valid. In 28b346bc06, I amended this function so that it uses heuristics
to try to guess which candidate encoding is "most likely".

However, the caller did not have any way to indicate which candidate
text encoding(s) they consider to be more likely, in case the
heuristics applied are inconclusive. In the language of Bayesian
probability, there was no way for the caller to indicate their 'prior'
assignment of probabilities.

Further, the documentation for mb_detect_encoding also says that the
second parameter `encodings` is "a list of character encodings to try,
in order". The documentation clearly implies that the order of
the `encodings` argument should be significant.

Therefore, amend mb_detect_encoding so that while it still uses
heuristics to guess the most likely text encoding for the input string,
it favors those which are earlier in the list of candidate encodings.

One complication is that many callers of mb_detect_encoding use it
in this way:

mb_detect_encoding($string, mb_list_encodings());

In a majority of cases, this is bad code; mb_detect_encoding will both
be much slower and the results will be less reliable than if a smaller
list of candidates is used. However, since such code already exists and
people are using it in production, we should not unnecessarily break it.
The order of candidate encodings obviously does not express any prior
belief of which candidates are more likely in this case, and treating
it as if it did will degrade the accuracy of the result.

Since mb_list_encodings now returns a single, immutable array on each
call, we can avoid that problem by turning off the new behavior when
we receive the array of encodings returned by mb_list_encodings.
This implementation means that if the user does this:

$a = mb_list_encodings();
mb_detect_encoding($string, $a);

...then the order of candidate encodings will not be considered.
However, if the user explicitly initializes their own array of all
supported legacy text encodings, then the order *will* be considered.

The other functions which also follow this new behavior are:

• mb_convert_variables
• mb_convert_encoding (when multiple candidate input encodings are
listed)

Other places where "detection" (or really "guessing") of text encoding
may be performed include:

• mb_send_mail
• Zend engine, when determining the encoding of a PHP script
• mbstring processing of HTTP request contents, when http_input INI
parameter is set to a list

In these cases, the new logic based on order of candidate encodings
is *not* enabled. It *might* be logical to consider the order of
candidate encodings in some or all of these cases, but I'm not sure if
that is true, so it seems wiser to avoid more behavior changes than is
necessary. Further, ever since the new encoding detection heuristics
were implemented in 28b346bc06, we have not received any complaints of
user code being broken in these areas. So I am reluctant to "fix what
isn't broken".

Well, some might say that applying the new detection heuristics
to mb_send_mail, etc. in 28b346bc06 was "fixing what wasn't broken",
but (cough cough) I don't have any comment on that...

show more ...


# 97e29bed 22-Apr-2023 Alex Dowad

Use shared, immutable array for return value of mb_list_encodings

This will allow us to easily check in other mbstring functions if the
list of all supported encodings, returned by mb_li

Use shared, immutable array for return value of mb_list_encodings

This will allow us to easily check in other mbstring functions if the
list of all supported encodings, returned by mb_list_encodings, is
passed in as input to another function.

Co-authored-by: Ilija Tovilo <ilija.tovilo@me.com>

show more ...


# 6df7557e 02-Apr-2023 Alex Dowad

mb_parse_str, mb_http_input, and mb_convert_variables use fast text conversion code for automatic encoding detection

For mb_parse_str, when mbstring.http_input (INI parameter) is a list of

mb_parse_str, mb_http_input, and mb_convert_variables use fast text conversion code for automatic encoding detection

For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.

When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)

show more ...


# a9a67204 14-Dec-2022 Alex Dowad

Implement mb_output_handler using fast text conversion filters


# 0c0774f5 04-Oct-2022 Alex Dowad

Use fast text conversion filters for mb_strpos, mb_stripos, mb_substr, etc

This boosts the performance of mb_strpos, mb_stripos, mb_strrpos,
mb_strripos, mb_strstr, mb_stristr, mb_strrch

Use fast text conversion filters for mb_strpos, mb_stripos, mb_substr, etc

This boosts the performance of mb_strpos, mb_stripos, mb_strrpos,
mb_strripos, mb_strstr, mb_stristr, mb_strrchr, and mb_strrichr when
used on non-UTF-8 strings. mb_substr is also faster.

With UTF-8 input, there is no appreciable difference in performance for
mb_strpos, mb_stripos, mb_strrpos, etc. This is expected, since the only
real difference here (aside from shorter and simpler code) is that the
new text conversion code is used when converting non-UTF-8 input strings
to UTF-8. (This is done because internally, mb_strpos, etc. work only
on UTF-8 text.)

For ASCII, speed is boosted by 30-65%. For other legacy text encodings,
the degree of performance improvement will depend on how slow the
legacy conversion code was.

One other minor, but notable difference is that strings encoded using
UTF-8 variants from Japanese mobile vendors (SoftBank, KDDI, Docomo)
will not undergo encoding conversion but will be processed "as is". It
is expected that this will result in a large performance boost for
such input strings; but realistically, the number of users who work
with such strings is probably minute.

I was not originally planning to include mb_substr in this commit, but
fuzzing of the reimplemented mb_strstr revealed that mb_substr needed
to be reimplemented, too; using the old mbfl_substr, which was based
on the old text conversion filters, in combination with functions which
use the new text conversion filters caused bugs.

The performance boost for mb_substr varies from 10%-500%, depending
on the encoding and input string used.

show more ...


# d0d83442 05-Oct-2022 Alex Dowad

Cache UTF-8-validity status of strings in GC flags

The PCRE extension is already doing this. The flag is set when a string
is determined to be valid UTF-8, and cleared in
zend_string

Cache UTF-8-validity status of strings in GC flags

The PCRE extension is already doing this. The flag is set when a string
is determined to be valid UTF-8, and cleared in
zend_string_forget_hash_val.

We might as well make good use of it in mbstring as well.

This should result in a negligible slowdown for non-UTF-8 strings,
bad UTF-8 strings, and good UTF-8 strings which are checked only once.
However, when microbenchmarking this change using a variety of text
encodings and string lengths, I found that in most of these cases,
the 'new' code was a few percent faster. In a couple of cases, the 'old'
code was a few percent faster. This was not a result of sampling error,
since I could reproduce these test results repeatedly, and even when
running a large number of iterations. Both the new and old code
were compiled with -O3 -march=native. My (unproved) hypothesis is that
although the new code appears to only add one more conditional branch,
the compiler may emit slightly different code from before (perhaps
with different register allocation and so on), and this may cause some
cases to run slightly faster and others to run slightly slower. I have
not disassembled the old and new binaries to see if an examination of
the emitted assembly code would support this hypothesis.

For good UTF-8 strings which are checked repeatedly, the speedup is
about 40% even for strings 1-5 bytes in length. For ~100 byte strings,
it is ~700%, and for ~10000 byte strings, it is ~80000%.

I tried fuzzing MBString's php_mb_check_encoding function and
pcre2lib's valid_utf function to see if I could find any cases where
their output would be different. After running the fuzzer for a couple
of minutes, it had tried more than 1 million test cases without finding
any where the output was different. Therefore, it appears that
MBString's UTF-8 validation is compatible with PCRE's.

show more ...


# 3ce888a8 04-Oct-2022 Alex Dowad

Use uint32_t for 'illegal_substchar' codepoint in mbstring

This value is a wchar, so the best type for it is uint32_t.


Revision tags: php-8.2.0RC1, php-8.1.10, php-8.0.23, php-8.0.23RC1, php-8.1.10RC1, php-8.2.0beta3, php-8.2.0beta2, php-8.1.9, php-8.0.22, php-8.1.9RC1, php-8.2.0beta1, php-8.0.22RC1, php-8.0.21, php-8.1.8, php-8.2.0alpha3, php-8.1.8RC1, php-8.2.0alpha2, php-8.0.21RC1, php-8.0.20, php-8.1.7, php-8.2.0alpha1, php-7.4.30
# 49202116 28-May-2022 Alex Dowad

php_mb_convert_encoding{,_ex} returns zend_string

That's what all existing callers want anyways. This avoids 2
unnecessary copies of the converted string.

Revision tags: php-8.1.7RC1, php-8.0.20RC1, php-8.1.6, php-8.0.19, php-8.1.6RC1, php-8.0.19RC1, php-8.0.18, php-8.1.5, php-7.4.29, php-8.1.5RC1, php-8.0.18RC1, php-8.1.4, php-8.0.17, php-8.1.4RC1, php-8.0.17RC1, php-8.1.3, php-8.0.16, php-7.4.28, php-8.1.3RC1, php-8.0.16RC1, php-8.1.2, php-8.0.15, php-8.1.2RC1, php-8.0.15RC1, php-8.0.14, php-8.1.1, php-7.4.27, php-8.1.1RC1, php-8.0.14RC1, php-7.4.27RC1, php-8.1.0, php-8.0.13, php-7.4.26, php-7.3.33, php-8.1.0RC6, php-7.4.26RC1, php-8.0.13RC1, php-8.1.0RC5, php-7.3.32, php-7.4.25, php-8.0.12, php-8.1.0RC4, php-8.0.12RC1, php-7.4.25RC1, php-8.1.0RC3, php-8.0.11, php-7.4.24, php-7.3.31, php-8.1.0RC2, php-7.4.24RC1, php-8.0.11RC1, php-8.1.0RC1, php-7.4.23, php-8.0.10, php-7.3.30, php-8.1.0beta3, php-8.0.10RC1, php-7.4.23RC1, php-8.1.0beta2, php-8.0.9, php-7.4.22, php-8.1.0beta1, php-7.4.22RC1, php-8.0.9RC1, php-8.1.0alpha3, php-7.4.21, php-7.3.29, php-8.0.8, php-8.1.0alpha2, php-7.4.21RC1, php-8.0.8RC1, php-8.1.0alpha1, php-8.0.7, php-7.4.20, php-8.0.7RC1, php-7.4.20RC1, php-8.0.6, php-7.4.19, php-7.4.18, php-7.3.28, php-8.0.5, php-8.0.5RC1, php-7.4.18RC1, php-8.0.4RC1, php-7.4.17RC1, php-8.0.3, php-7.4.16, php-8.0.3RC1, php-7.4.16RC1, php-8.0.2, php-7.4.15, php-7.3.27, php-8.0.2RC1, php-7.4.15RC2, php-7.4.15RC1, php-8.0.1, php-7.4.14, php-7.3.26, php-7.4.14RC1, php-8.0.1RC1, php-7.3.26RC1, php-8.0.0, php-7.3.25, php-7.4.13, php-8.0.0RC5, php-7.4.13RC1, php-8.0.0RC4, php-7.3.25RC1, php-7.4.12, php-8.0.0RC3, php-7.3.24
# d3f56e5a 18-Oct-2020 Alex Dowad

Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes

...And remove the original php_mb_mbchar_bytes, which was not being
used.

# abf83e50 18-Oct-2020 Alex Dowad

Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr

...And remove the original php_mb_safe_strrchr, which was not being
used anywhere.

# 01b3fc03 06-May-2021 KsaR

Update http->https in license (#6945)

1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as

Update http->https in license (#6945)

1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |

show more ...

# 3e01f5af 15-Jan-2021 Nikita Popov

Replace zend_bool uses with bool

We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.

Of course, zend_bool

Replace zend_bool uses with bool

We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.

Of course, zend_bool is retained as an alias.

show more ...

Revision tags: php-8.0.0RC2, php-7.4.12RC1, php-7.3.24RC1, php-7.2.34, php-8.0.0rc1, php-7.4.11, php-7.3.23, php-8.0.0beta4, php-7.4.11RC1, php-7.3.23RC1, php-8.0.0beta3, php-7.4.10, php-7.3.22, php-8.0.0beta2, php-7.3.22RC1, php-7.4.10RC1, php-8.0.0beta1, php-7.4.9, php-7.2.33, php-7.3.21, php-8.0.0alpha3, php-7.4.9RC1, php-7.3.21RC1, php-7.4.8, php-7.2.32, php-8.0.0alpha2, php-7.3.20
# 7eddcabe 05-Jul-2020 Alex Dowad

Don't guard mbstring code with #ifdef HAVE_MBSTRING

This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just c

Don't guard mbstring code with #ifdef HAVE_MBSTRING

This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the
use of this? Why compile the source files and link against them if you don't want the
mbstring extension? It doesn't make any kind of sense.

show more ...

# 62317d59 04-Jul-2020 Alex Dowad

Remove redundant includes from mbstring (and make sure correct config.h is used)

Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from withi

Remove redundant includes from mbstring (and make sure correct config.h is used)

Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.

show more ...

Revision tags: php-8.0.0alpha1, php-7.4.8RC1, php-7.3.20RC1, php-7.4.7, php-7.3.19, php-7.4.7RC1, php-7.3.19RC1
# 68164f40 12-May-2020 George Peter Banyard

Fix [-Wundef] warning in MBString extension

Revision tags: php-7.4.6, php-7.2.31, php-7.4.6RC1, php-7.3.18RC1, php-7.2.30, php-7.4.5, php-7.3.17
# 21cfa03f 05-Apr-2020 Máté Kocsis

Generate function entries for another batch of extensions

Closes GH-5352

# 11f0e1d1 31-Mar-2020 Nikita Popov

Move encoding fetching out of php_mb_convert_encoding()

Revision tags: php-7.4.5RC1, php-7.3.17RC1
# 7cea789c 30-Mar-2020 Nikita Popov

Parse mb_convert_encoding() encodings only once

Instead of re-parsing them for every converted value. Also reuse
the generic parse_array() helper.

# ed850f27 30-Mar-2020 Nikita Popov

Move encoding fetching outside php_mb_stripos()

Revision tags: php-7.3.18, php-7.4.4, php-7.2.29, php-7.3.16, php-7.4.4RC1, php-7.3.16RC1, php-7.4.3, php-7.2.28, php-7.3.15RC1, php-7.4.3RC1
# 7db3a518 28-Jan-2020 Nikita Popov

Only fetch to_encoding once in mb_convert_encoding()

Instead of doing it on every conversion. This is both more efficient
and avoids generating multiple warnings.

Revision tags: php-7.3.15, php-7.2.27, php-7.4.2, php-7.3.14, php-7.3.14RC1, php-7.4.2RC1, php-7.4.1, php-7.2.26, php-7.3.13, php-7.4.1RC1, php-7.3.13RC1, php-7.2.26RC1, php-7.4.0, php-7.2.25, php-7.3.12, php-7.4.0RC6, php-7.3.12RC1, php-7.2.25RC1, php-7.4.0RC5, php-7.1.33, php-7.2.24, php-7.3.11, php-7.4.0RC4, php-7.3.11RC1, php-7.2.24RC1
# 21e631e4 06-Oct-2019 Nikita Popov

Merge branch 'PHP-7.4'


# 6623e7ac 02-Oct-2019 Nikita Popov

Add support for mbstring.regex_retry_limit

This is very similar to the existing mbstring.regex_stack_limit,
but for backtracking. The default value matches pcre.backtrack_limit.
Only

Add support for mbstring.regex_retry_limit

This is very similar to the existing mbstring.regex_stack_limit,
but for backtracking. The default value matches pcre.backtrack_limit.
Only used on libonig >= 2.8.0.

show more ...

Revision tags: php-7.4.0RC3
# 5d6e923d 24-Sep-2019 Gabriel Caruso

Remove mention of PHP major version in Copyright headers

Closes GH-4732.

Revision tags: php-7.2.23, php-7.3.10, php-7.4.0RC2, php-7.2.23RC1, php-7.3.10RC1, php-7.4.0RC1, php-7.1.32, php-7.2.22, php-7.3.9, php-7.4.0beta4, php-7.2.22RC1, php-7.3.9RC1, php-7.4.0beta2, php-7.1.31, php-7.2.21, php-7.3.8, php-7.4.0beta1, php-7.2.21RC1, php-7.3.8RC1, php-7.4.0alpha3, php-7.3.7, php-7.2.20, php-7.4.0alpha2, php-7.3.7RC3, php-7.3.7RC2, php-7.2.20RC2, php-7.4.0alpha1, php-7.3.7RC1, php-7.2.20RC1, php-7.2.19, php-7.3.6, php-7.1.30, php-7.2.19RC1, php-7.3.6RC1, php-7.1.29, php-7.2.18, php-7.3.5
# 1d53d6df 17-Apr-2019 Nikita Popov

Merge branch 'PHP-7.4'


# f73f190c 16-Apr-2019 Nikita Popov

Fix internal_encoding fallback in mbstring

By introducing a hook that is called whenever one of
internal_encoding / input_encoding / output_encoding changes, so
that mbstring can adj

Fix internal_encoding fallback in mbstring

By introducing a hook that is called whenever one of
internal_encoding / input_encoding / output_encoding changes, so
that mbstring can adjust it's internal state.

This also makes internal_encoding work with zend multibyte.

show more ...

1234567