History log of /php-src/ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c (Results 1 – 25 of 31)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 1f0cf133 30-Sep-2023 Alex Dowad

Add fast mb_strcut implementation for UTF-8

The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant

Add fast mb_strcut implementation for UTF-8

The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)

The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).

show more ...


# a57fdea1 16-Jul-2023 Alex Dowad

Add assertion to mb_utf7imap_to_wchar to catch buffer overrun

I don't believe such a buffer overrun will ever occur, but just in
case the code is changed in the future, it will be good t

Add assertion to mb_utf7imap_to_wchar to catch buffer overrun

I don't believe such a buffer overrun will ever occur, but just in
case the code is changed in the future, it will be good to have an
assertion here to help catch bugs. (A similar assertion is already
used in the UTF-7 version of this function.)

show more ...


# 732d92c0 28-Apr-2023 Javier Eguiluz

[skip ci] Fix various typos and grammar issues (#11143)


# b721d0f7 10-Mar-2023 pakutoma

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are ca

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)

show more ...


# 6fc8d014 21-Mar-2023 pakutoma

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are ca

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

show more ...


# a6186823 19-Nov-2022 Alex Dowad

For UTF-7, flag unnecessary extra trailing byte in Base64 section as error

This bug was found when I was fuzzing a patch related to mb_strpos.
In some cases, the legacy text conversion c

For UTF-7, flag unnecessary extra trailing byte in Base64 section as error

This bug was found when I was fuzzing a patch related to mb_strpos.
In some cases, the legacy text conversion code for UTF-7 (and
UTF7-IMAP) would correctly recognize an error for a Base64-encoded
section which was not correctly padded with zero bits, but the new
(and faster) text conversion code would not.

Specifically, if the input string ended abruptly after the 4th or 7th
byte of a Base64-encoded section, the new conversion code would
confirm that the trailing padding bits from the previous byte (3rd or
6th) were zeroes, but would not check whether the 4th or 7th byte
itself encoded any non-zero bits. The legacy conversion code did
perform this check and would treat the input string as invalid.

Actually, even if the 4th or 7th byte does encode only (padding) zero
bits, this is still a problem, because there is no reason to have a
4th (or 7th) byte in that case. The UTF-7 string should have ended
on the previous byte instead.

Apply the same fix for both UTF-7 and UTF7-IMAP.

show more ...


# d3933e0b 14-Nov-2022 Alex Dowad

Fix regression test for GH-9535 on PHP-8.2+

Some of the legacy text encodings which were used in this regression
test are deprecated in PHP-8.2+. The deprecation warnings break the
e

Fix regression test for GH-9535 on PHP-8.2+

Some of the legacy text encodings which were used in this regression
test are deprecated in PHP-8.2+. The deprecation warnings break the
expected output. Since using these encodings in mbstring is now
deprecated, I think there is little point in keeping them in this test.
So they are now removed from it.

Further, in 219fff376b, I made a change to avoid a situation where the
legacy UTF7-IMAP conversion code gets stuck in a wrong state when its
attempt to emit a character fails. When a Base64-encoded section of
input ended with -, the previous code would FIRST emit a character if
necessary (using the CK or "check" macro, which causes the function to
return immediately if the downstream filter function returns an error
code), and THEN update its own state to indicate that it is now in
ASCII rather than Base64 mode.

If the downstream filter function returned an error code, the CK macro
would then cause the UTF7-IMAP filter function to return immediately
WITHOUT setting its own state to indicate that the Base64-encoded
section was done.

I fixed this by updating the filter state as needed BEFORE calling CK...
but I missed updating the filter state in the case where the Base64
section ends normally and there is no need to emit anything.

Again, in 6d525a425e, I modified the legacy conversion code for
ISO-2022-KR to try to comply more closely with the RFC for this
text encoding. The RFC states that before any occurrence of 'Shift In'
or 'Shift Out' codes in a ISO-2022-KR string, a special escape
sequence must appear at least ONCE, at the beginning of a line.
The previous code did not comply with this requirement. I made it
comply by always emitting this escape sequence at the beginning of
the first line.

Since mb_strcut (wrongly) determines when it has consumed enough of
the input string by looking at the length of its output in bytes, this
extra escape sequence makes mb_strcut consume 4 bytes less of an
ISO-2022-KR string than would otherwise be the case. When this
strange behavior of mb_strcut is fixed, this test will have to be
adjusted to restore the previous expected outputs for ISO-2022-KR.

show more ...


Revision tags: php-8.2.0RC1, php-8.1.10, php-8.0.23, php-8.0.23RC1, php-8.1.10RC1, php-8.2.0beta3
# 5812b4fe 04-Aug-2022 Alex Dowad

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects shoul

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.

However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.

All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.

(This bug fix commit was originally applied to PHP-8.2 when fuzzing
the new mbstring text conversion code to check for differences with
the old code. Later, Kentaro Ohkouchi kindly reported a problem with
mb_encode_mimeheader under PHP 8.1 which was caused by the same issue.
Hence, this commit was backported to PHP-8.1.)

Fixes GH-9683.

show more ...


# f3c8efd7 04-Aug-2022 Alex Dowad

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects shoul

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.

However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.

All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.

show more ...

Revision tags: php-8.2.0beta2, php-8.1.9, php-8.0.22
# 219fff37 25-Jul-2022 Alex Dowad

Fix legacy text conversion filter for UTF7-IMAP

Make necessary updates to filter state before using CK macro.

Revision tags: php-8.1.9RC1, php-8.2.0beta1, php-8.0.22RC1, php-8.0.21, php-8.1.8, php-8.2.0alpha3, php-8.1.8RC1, php-8.2.0alpha2, php-8.0.21RC1, php-8.0.20, php-8.1.7, php-8.2.0alpha1, php-7.4.30
# 8533fccd 06-Jun-2022 Alex Dowad

Assert minimum size of wchar buffer in text conversion filters

In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to inc

Assert minimum size of wchar buffer in text conversion filters

In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to include an assertion; this will
help us to catch bugs, and will also help future readers to understand what
we expect of the function arguments.

For UTF-7 and UTF7-IMAP, these assertions were already there, but I have
added comments explaining why the minimum size is what it is.

show more ...

Revision tags: php-8.1.7RC1, php-8.0.20RC1, php-8.1.6, php-8.0.19, php-8.1.6RC1, php-8.0.19RC1, php-8.0.18, php-8.1.5, php-7.4.29, php-8.1.5RC1, php-8.0.18RC1, php-8.1.4, php-8.0.17, php-8.1.4RC1, php-8.0.17RC1, php-8.1.3, php-8.0.16, php-7.4.28, php-8.1.3RC1, php-8.0.16RC1
# 0d635d93 21-Jan-2022 Alex Dowad

Implement fast text conversion interface for UTF7-IMAP

The old code would convert a 0x00 byte in the input to 0x00 in the
output, but this clearly violates the RFC which defines UTF7-IMA

Implement fast text conversion interface for UTF7-IMAP

The old code would convert a 0x00 byte in the input to 0x00 in the
output, but this clearly violates the RFC which defines UTF7-IMAP.

show more ...

Revision tags: php-8.1.2, php-8.0.15, php-8.1.2RC1, php-8.0.15RC1, php-8.0.14, php-8.1.1, php-7.4.27, php-8.1.1RC1, php-8.0.14RC1, php-7.4.27RC1, php-8.1.0, php-8.0.13, php-7.4.26, php-7.3.33, php-8.1.0RC6, php-7.4.26RC1, php-8.0.13RC1, php-8.1.0RC5, php-7.3.32, php-7.4.25, php-8.0.12, php-8.1.0RC4, php-8.0.12RC1, php-7.4.25RC1, php-8.1.0RC3, php-8.0.11, php-7.4.24, php-7.3.31, php-8.1.0RC2, php-7.4.24RC1, php-8.0.11RC1, php-8.1.0RC1, php-7.4.23, php-8.0.10, php-7.3.30, php-8.1.0beta3, php-8.0.10RC1, php-7.4.23RC1, php-8.1.0beta2, php-8.0.9, php-7.4.22
# 3c732251 21-Jul-2021 Alex Dowad

New internal interface for fast text conversion in mbstring

When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typica

New internal interface for fast text conversion in mbstring

When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)

show more ...

# 626f0fec 30-Jul-2021 Alex Dowad

Remove some dead code from mbstring

mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to co

Remove some dead code from mbstring

mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
a conversion table... but the codepoint in question *is* in the table,
so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
range 3... but every codepoint in range 3 is also in range 2, so no
codepoint will ever be matched and converted by that code.

show more ...

# 16a1e0a2 27-Aug-2021 Alex Dowad

In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly

# 776296e1 30-Aug-2021 Alex Dowad

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.

show more ...

Revision tags: php-8.1.0beta1, php-7.4.22RC1, php-8.0.9RC1, php-8.1.0alpha3, php-7.4.21, php-7.3.29, php-8.0.8, php-8.1.0alpha2, php-7.4.21RC1, php-8.0.8RC1, php-8.1.0alpha1, php-8.0.7, php-7.4.20, php-8.0.7RC1, php-7.4.20RC1, php-8.0.6, php-7.4.19, php-7.4.18, php-7.3.28, php-8.0.5, php-8.0.5RC1, php-7.4.18RC1
# 78dc160e 04-Apr-2021 Alex Dowad

Catch and handle errors in mUTF-7 (IMAP) conversion

# cef4b94e 04-Apr-2021 Alex Dowad

Code cleanup in mbfilter_utf7imap.c

Revision tags: php-8.0.4RC1, php-7.4.17RC1, php-8.0.3, php-7.4.16, php-8.0.3RC1, php-7.4.16RC1, php-8.0.2, php-7.4.15, php-7.3.27, php-8.0.2RC1, php-7.4.15RC2, php-7.4.15RC1, php-8.0.1, php-7.4.14, php-7.3.26, php-7.4.14RC1, php-8.0.1RC1, php-7.3.26RC1, php-8.0.0, php-7.3.25, php-7.4.13, php-8.0.0RC5, php-7.4.13RC1, php-8.0.0RC4, php-7.3.25RC1, php-7.4.12, php-8.0.0RC3, php-7.3.24
# a06c20a1 18-Oct-2020 Alex Dowad

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.

show more ...

Revision tags: php-8.0.0RC2, php-7.4.12RC1, php-7.3.24RC1, php-7.2.34, php-8.0.0rc1, php-7.4.11, php-7.3.23, php-8.0.0beta4, php-7.4.11RC1, php-7.3.23RC1
# 7bb5b435 11-Sep-2020 Alex Dowad

mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly

Instead of looking the other way and letting things slide, report errors when
the input does not follow

mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly

Instead of looking the other way and letting things slide, report errors when
the input does not follow the RFC.

show more ...

# b43a7dea 12-Sep-2020 Alex Dowad

Add 'mUTF-7' alias for UTF7-IMAP encoding

# b9758172 10-Sep-2020 Alex Dowad

Add comment explaining mUTF-7 to mbfilter_utf7imap.c

Revision tags: php-8.0.0beta3, php-7.4.10, php-7.3.22, php-8.0.0beta2, php-7.3.22RC1, php-7.4.10RC1, php-8.0.0beta1, php-7.4.9, php-7.2.33, php-7.3.21, php-8.0.0alpha3, php-7.4.9RC1, php-7.3.21RC1
# dcd6c604 16-Jul-2020 Alex Dowad

Remove unneeded function mbfl_filt_conv_common_dtor

This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those

Remove unneeded function mbfl_filt_conv_common_dtor

This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those structs before freeing them.
The default destructor just zeroed out some fields, but there's no reason why
we should actually do that.

show more ...

# cdc66404 11-Jul-2020 Alex Dowad

Comment constants in mbfl_consts.h, remove unused ones

These were unused, and almost certainly will never be used:

- MBFL_ENCTYPE_MWC4BE
- MBFL_ENCTYPE_MWC4LE
- MBFL_ENCTYPE

Comment constants in mbfl_consts.h, remove unused ones

These were unused, and almost certainly will never be used:

- MBFL_ENCTYPE_MWC4BE
- MBFL_ENCTYPE_MWC4LE
- MBFL_ENCTYPE_SHFTCODE
- MBFL_ENCTYPE_ENC_STRM

For the latter two, there were some encodings which were marked with these flags;
but nothing ever _checked_ these particular flags.

show more ...

Revision tags: php-7.4.8, php-7.2.32, php-8.0.0alpha2, php-7.3.20
# 62317d59 04-Jul-2020 Alex Dowad

Remove redundant includes from mbstring (and make sure correct config.h is used)

Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from withi

Remove redundant includes from mbstring (and make sure correct config.h is used)

Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.

show more ...

12