History log of /PHP-8.2/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c (Results 1 – 25 of 57)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 6fc8d014 21-Mar-2023 pakutoma

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are ca

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

show more ...

# f3c8efd7 04-Aug-2022 Alex Dowad

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects shoul

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.

However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.

All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.

show more ...

# 3517a70f 04-Aug-2022 Alex Dowad

Fix legacy text conversion filter for CP50220

CP50220 converts some codepoints which represent kana
(hiragana/katakana) to a different form. This is the only difference
between CP502

Fix legacy text conversion filter for CP50220

CP50220 converts some codepoints which represent kana
(hiragana/katakana) to a different form. This is the only difference
between CP50220 and CP50221 (which doesn't perform such conversion).
In some cases, this conversion means collapsing two codepoints to
a single output byte sequence. Since the legacy text conversion
filters only worked a byte at a time, the legacy filter had to
cache a byte, then wait until it was called again with the next
byte to compare the cached byte with the following one.

That was all fine, but it didn't work as intended when there were
errors (invalid byte sequences) in the input. Our code (both old
and new) for emitting error markers recursively calls the same
conversion filter. When the old CP50220 filter was called
recursively, the logic for managing cached bytes did not behave
as intended. As a result, the error markers could be reordered
with other characters in the output.

I used an ugly hack to fix this in 6938e3512; when making a
recursive call to emit an error marker, temporarily swap out
`filter->filter_function` to bypass the byte-caching code,
so the error marker immediately goes through to the output.

This worked, but I overlooked the fact that the very same
problem can occur if an invalid byte sequence is detected
*in the flush function*. Apply the same (ugly) fix.

show more ...

# 88d13491 04-Aug-2022 Alex Dowad

Make control flow in mb_wchar_to_cp50220 a bit clearer

# 78ee1841 26-Jul-2022 Alex Dowad

Move kana conversion function to mbfilter_cp5022x.c

...To avoid a dependency from libmbfl to mbstring.

Thanks to Nikita Popov for pointing this issue out.

# 44b4fb2c 23-Jul-2022 Alex Dowad

Fix legacy text conversion filter for CP50220

In my recent commit which replaced the implementation of
mb_convert_kana, the commit message noted that mb_convert_kana
previously had a

Fix legacy text conversion filter for CP50220

In my recent commit which replaced the implementation of
mb_convert_kana, the commit message noted that mb_convert_kana
previously had a bug whereby null bytes would be 'swallowed'
and not passed to the output.

This was actually the reason.

show more ...

# 9ac49c0d 12-Jul-2022 Alex Dowad

New implementation of mb_convert_kana

mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings

New implementation of mb_convert_kana

mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.

show more ...

# 3cf43279 25-Jun-2022 Alex Dowad

Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer)

If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end

Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer)

If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end of one buffer and the other at the
beginning of the next buffer, they were not converted correctly.
This was discovered while fuzzing the new implementation of
mb_decode_numericentity.

show more ...

# 6938e351 05-Jul-2022 Alex Dowad

Fix legacy conversion filter for CP50220

# 8533fccd 06-Jun-2022 Alex Dowad

Assert minimum size of wchar buffer in text conversion filters

In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to inc

Assert minimum size of wchar buffer in text conversion filters

In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to include an assertion; this will
help us to catch bugs, and will also help future readers to understand what
we expect of the function arguments.

For UTF-7 and UTF7-IMAP, these assertions were already there, but I have
added comments explaining why the minimum size is what it is.

show more ...

# e2c4fc57 23-May-2022 Alex Dowad

Fix buffer overflow bugs in CP50222 text conversion code

# 53ffba96 20-Dec-2021 Alex Dowad

Implement fast text conversion interface for CP5022{0,1,2}

# 3c732251 21-Jul-2021 Alex Dowad

New internal interface for fast text conversion in mbstring

When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typica

New internal interface for fast text conversion in mbstring

When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)

show more ...

# dcaa010f 07-Aug-2020 Alex Dowad

Strict validation of conversion flags to mb_convert_kana

mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inver

Strict validation of conversion flags to mb_convert_kana

mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.

But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.

Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.

We can also reject unrecognized flags. This may help users to catch bugs.

Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).

show more ...

# 0957f54e 31-Aug-2021 Alex Dowad

Treat truncated escape sequences for CP5022{0,1,2} as error

# 64e379d8 31-Aug-2021 Alex Dowad

Declare CP50222 flush function as 'static'

# 626f0fec 30-Jul-2021 Alex Dowad

Remove some dead code from mbstring

mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to co

Remove some dead code from mbstring

mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
a conversion table... but the codepoint in question *is* in the table,
so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
range 3... but every codepoint in range 3 is also in range 2, so no
codepoint will ever be matched and converted by that code.

show more ...

# e3f6a9fb 13-Aug-2021 Alex Dowad

CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119

mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has a

CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119

mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

} else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];

show more ...

# 776296e1 30-Aug-2021 Alex Dowad

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.

show more ...

# 958ef47d 08-May-2021 Alex Dowad

When flushing CP5022x conversion filter, also flush next filter in chain

All the mbstring encoding conversion filters do this. I missed it when
adding a flush function for CP5022x.

# ebe6500a 20-Jan-2021 Alex Dowad

Fix error reporting bug for Unicode -> CP50220 conversion

To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count o

Fix error reporting bug for Unicode -> CP50220 conversion

To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.

The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.

Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.

show more ...

# 319a3408 03-Jan-2021 Alex Dowad

Simplify code for working with halfwidth/fullwidth kana conversion filter

There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->

Simplify code for working with halfwidth/fullwidth kana conversion filter

There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.

Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.

show more ...

# 636251a5 03-Jan-2021 Alex Dowad

Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init

This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initi

Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init

This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initializing a CP50220 conversion
filter.

show more ...

# a06c20a1 18-Oct-2020 Alex Dowad

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.

show more ...

# 34ece408 17-Oct-2020 Alex Dowad

Remove useless mbstring encoding 'JIS-ms'

MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstr

Remove useless mbstring encoding 'JIS-ms'

MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.

Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.

The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.

But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.

For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.

show more ...

123