History log of /PHP-8.2/ext/mbstring/mbstring.c (Results 1 – 25 of 832)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 6fc8d014 21-Mar-2023 pakutoma

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are ca

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

show more ...

# 805dafdd 07-Mar-2023 Ilija Tovilo

Merge branch 'PHP-8.1' into PHP-8.2

* PHP-8.1:
Enable GitHub actions cancel-in-progress for PRs
mb_encode_mimeheader does not crash if provided encoding has no MIME name set


# 7c1ee5a0 03-Mar-2023 Alex Dowad

mb_encode_mimeheader does not crash if provided encoding has no MIME name set

# 73f9ffc5 20-Feb-2023 George Peter Banyard

Merge branch 'PHP-8.1' into PHP-8.2

* PHP-8.1:
Fix GH-10627: mb_convert_encoding crashes PHP on Windows
ext/mbstring: fix new_value length check


# ed0c0df3 19-Feb-2023 Niels Dossche <7771979+nielsdos@users.noreply.github.com>

Fix GH-10627: mb_convert_encoding crashes PHP on Windows

Fixes GH-10627

The php_mb_convert_encoding() function can return NULL on error, but
this case was not handled, which led

Fix GH-10627: mb_convert_encoding crashes PHP on Windows

Fixes GH-10627

The php_mb_convert_encoding() function can return NULL on error, but
this case was not handled, which led to a NULL pointer dereference and
hence a crash.

Closes GH-10628

Signed-off-by: George Peter Banyard <girgias@php.net>

show more ...

# 243865ae 07-Feb-2023 Max Kellermann

ext/mbstring: fix new_value length check

Commit 8bbd0952e5bba88 added a check rejecting empty strings; in the
merge commiot 379d9a1cfc6462 however it was changed to a NULL check,
one

ext/mbstring: fix new_value length check

Commit 8bbd0952e5bba88 added a check rejecting empty strings; in the
merge commiot 379d9a1cfc6462 however it was changed to a NULL check,
one that did not make sense because ZSTR_VAL() is guaranteed to never
be NULL; the length check was accidently removed by that merge commit.

This bug was found by GCC's -Waddress warning:

ext/mbstring/mbstring.c:748:27: warning: the comparison will always evaluate as ‘true’ for the address of ‘val’ will never be NULL [-Waddress]
748 | if (!new_value || !ZSTR_VAL(new_value)) {
| ^

Closes GH-10532

Signed-off-by: George Peter Banyard <girgias@php.net>

show more ...

# cc931af3 30-Dec-2022 Jakub Zelenka

Fix GH-8086: Introduce mail.mixed_lf_and_crlf INI

When this INI option is enabled, it reverts the line separator for
headers and message to LF which was a non conformant behavior in PHP

Fix GH-8086: Introduce mail.mixed_lf_and_crlf INI

When this INI option is enabled, it reverts the line separator for
headers and message to LF which was a non conformant behavior in PHP 7.
It is done because some non conformant MTAs fail to parse CRLF line
separator for headers and body.

This is used for mail and mb_send_mail functions.

show more ...

# f7a19181 28-Dec-2022 Alex Dowad

Allow 'h' and 'k' flags to be combined for mb_convert_kana

The 'h' flag makes mb_convert_kana convert zenkaku hiragana to hankaku
katakana; 'k' makes it convert zenkaku katakana to hanka

Allow 'h' and 'k' flags to be combined for mb_convert_kana

The 'h' flag makes mb_convert_kana convert zenkaku hiragana to hankaku
katakana; 'k' makes it convert zenkaku katakana to hankaku katakana.

When working on the implementation of mb_convert_kana, I added some
additional checks to catch combinations of flags which do not make
sense; but there is no conflict between 'h' and 'k' (they control
conversions for two disjoint ranges of codepoints) and this combination
should not have been restricted.

Thanks to the GitHub user 'akira345' for reporting this problem.

Closes GH-10174.

show more ...

# 8df51555 03-Aug-2022 Alex Dowad

Remove unused 'to_language' and 'from_language' struct fields

# d013d949 08-Aug-2022 Christoph M. Becker

Fix GH-9248: Segmentation fault in mb_strimwidth()

We need to initialize the optional argument `trimmarker` with its
default value.

Closes GH-9273.

# 5370f344 30-Jul-2022 Alex Dowad

mb_strimwidth inserts error markers in invalid input string (for backwards compatibility)

The old implementation did this. It also did the same to the
trim marker, if the trim marker was

mb_strimwidth inserts error markers in invalid input string (for backwards compatibility)

The old implementation did this. It also did the same to the
trim marker, if the trim marker was invalid in the specified
encoding, but I have not imitated that behavior (for performance).

show more ...

# 78ee1841 26-Jul-2022 Alex Dowad

Move kana conversion function to mbfilter_cp5022x.c

...To avoid a dependency from libmbfl to mbstring.

Thanks to Nikita Popov for pointing this issue out.

# 72990960 19-Jul-2022 Alex Dowad

New implementation of mb_strimwidth

This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:

• mb_st

New implementation of mb_strimwidth

This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:

• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).

It used the following code:

if ((from < 0) || (width < 0)) {
swidth = mbfl_strwidth(&string);
}
if (from < 0) {
from += swidth;
}

Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.

• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.

Code:

if (width < 0) {
width = swidth + width - from;
}

'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.

As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.

Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.

• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.

• The same applies to the 'trim marker' string.

• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.

In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.

• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.

This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.

PERFORMANCE:

• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.

• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.

show more ...

# 94fde156 19-Jul-2022 Alex Dowad

Move implementation of mb_strlen to mbstring.c

mbfl_strlen (in mbfilter.c) is still being used in a couple
of places but will go away soon.

# 3e922bf0 20-Jul-2022 Christoph M. Becker

Merge branch 'PHP-8.1'

* PHP-8.1:
Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings


# c2bdaa48 20-Jul-2022 Christoph M. Becker

Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings

Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`. Therefore, we

Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings

Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`. Therefore, we need to remove the non-
encodings from the `elist` in this case as well. Thus, we duplicate
the global `elist`, so we can modify it.

Closes GH-9063.

show more ...

# 9ac49c0d 12-Jul-2022 Alex Dowad

New implementation of mb_convert_kana

mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings

New implementation of mb_convert_kana

mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.

show more ...

# 76a92c26 18-Jul-2022 Alex Dowad

mb_decode_numericentity decodes valid entities which are truncated at end of string

Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to b

mb_decode_numericentity decodes valid entities which are truncated at end of string

Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to be terminated by ANY non-digit
character, it doesn't make sense that valid entities which butt
up against the end of the input string are not converted.

As it turned out, supporting this case also made it possible
to simplify the code nicely.

show more ...

# 5d6bd557 15-Jul-2022 Alex Dowad

mb_decode_numericentity converts entities which immediately follow a valid/invalid entity

Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation w

mb_decode_numericentity converts entities which immediately follow a valid/invalid entity

Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation which the new mb_decode_numericentity
implementation took care to maintain were actually bugs and should
be fixed. Thanks also to Trevor Rowbotham for providing a link to
the HTML specification, showing how HTML numeric entities should
be interpreted.

mb_decode_numericentity now processes numeric entities in the
following situations where the old implementation would not:

- &<ENTITY> (for example, &&#65;)
- &#<ENTITY>
- &#x<ENTITY>
- <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65&#65;)
- <VALID BUT UNTERMINATED HEX ENTITY><ENTITY>
- <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why
the first entity is invalid; the value could be too big, it could have
too many digits, or it could not match the 'convmap' parameter)
- <INVALID AND UNTERMINATED HEX ENTITY><ENTITY>

This is consistent with the way that web browsers process
HTML entities.

show more ...

# 91969e90 14-May-2022 Alex Dowad

New implementation of mb_{de,en}code_numericentity

This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the di

New implementation of mb_{de,en}code_numericentity

This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.

show more ...

# d6fc1650 15-Jul-2022 Christoph M. Becker

Drop useless TODO comment

Cf. <https://github.com/php/php-src/pull/9018#issuecomment-1185481492>.

# 56137cd2 23-Jun-2022 Máté Kocsis

Declare ext/mbstring constants in stubs (#8798)

# 880803a2 13-May-2022 Alex Dowad

Use fast conversion filters to implement php_mb_ord

Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is

Use fast conversion filters to implement php_mb_ord

Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is
enormous, since the old code would convert the ENTIRE string, just
to pick out the first codepoint.

show more ...

# 950a7db9 02-May-2022 Alex Dowad

Use fast text conversion filters to implement mb_check_encoding

Benchmarking reveals that this is about 8% slower for UTF-8 strings
which have a bad codepoint at the very beginning of th

Use fast text conversion filters to implement mb_check_encoding

Benchmarking reveals that this is about 8% slower for UTF-8 strings
which have a bad codepoint at the very beginning of the string.
For good strings, or those where the first bad codepoint is much
later in the string, it is significantly faster (2-3 times faster
in many cases).

show more ...

# 5b2e413e 03-Jun-2022 Remi Collet

Merge branch 'PHP-8.1'

* PHP-8.1:
NEWS for GH-8685
NEWS for GH-8685
Fix GH-8685 mbstring requires pcre


12345678910>>...34