History log of /PHP-8.1/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c (Results 1 – 25 of 42)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# b721d0f7 10-Mar-2023 pakutoma

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are ca

Fix phpGH-10648: add check function pointer into mbfl_encoding

Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)

show more ...


# 5812b4fe 04-Aug-2022 Alex Dowad

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects shoul

In legacy text conversion filters, reset filter state in 'flush' function

Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.

However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.

All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.

(This bug fix commit was originally applied to PHP-8.2 when fuzzing
the new mbstring text conversion code to check for differences with
the old code. Later, Kentaro Ohkouchi kindly reported a problem with
mb_encode_mimeheader under PHP 8.1 which was caused by the same issue.
Hence, this commit was backported to PHP-8.1.)

Fixes GH-9683.

show more ...


Revision tags: php-8.1.7RC1, php-8.1.4RC1, php-8.1.3, php-8.1.2RC1, php-8.1.0, php-7.3.33, php-7.3.32, php-7.3.31, php-7.3.30
# e3f6a9fb 13-Aug-2021 Alex Dowad

CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119

mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has a

CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119

mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

} else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];

show more ...

# 776296e1 30-Aug-2021 Alex Dowad

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like

mbstring no longer provides 'long' substitutions for erroneous input bytes

Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.

show more ...

Revision tags: php-7.3.29
# 958ef47d 08-May-2021 Alex Dowad

When flushing CP5022x conversion filter, also flush next filter in chain

All the mbstring encoding conversion filters do this. I missed it when
adding a flush function for CP5022x.

Revision tags: php-7.3.28, php-7.3.27
# ebe6500a 20-Jan-2021 Alex Dowad

Fix error reporting bug for Unicode -> CP50220 conversion

To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count o

Fix error reporting bug for Unicode -> CP50220 conversion

To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.

The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.

Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.

show more ...

# 319a3408 03-Jan-2021 Alex Dowad

Simplify code for working with halfwidth/fullwidth kana conversion filter

There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->

Simplify code for working with halfwidth/fullwidth kana conversion filter

There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.

Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.

show more ...

# 636251a5 03-Jan-2021 Alex Dowad

Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init

This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initi

Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init

This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initializing a CP50220 conversion
filter.

show more ...

Revision tags: php-7.3.26, php-7.3.26RC1, php-7.3.25, php-7.3.25RC1, php-7.3.24
# a06c20a1 18-Oct-2020 Alex Dowad

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't

Remove useless constant MBFL_ENCTYPE_MBCS

This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.

show more ...

# 34ece408 17-Oct-2020 Alex Dowad

Remove useless mbstring encoding 'JIS-ms'

MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstr

Remove useless mbstring encoding 'JIS-ms'

MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.

Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.

The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.

But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.

For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.

show more ...

Revision tags: php-7.3.24RC1
# fcbe45de 07-Oct-2020 Alex Dowad

Remove useless mbstring encoding 'CP50220-raw'

CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP

Remove useless mbstring encoding 'CP50220-raw'

CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP by converting
them to similar characters which are representable.

What, then, is CP50220-raw? An Internet search turns up absolutely nothing.
Reference works which I consulted don't say anything about it. Other text
conversion libraries don't support it.

From looking at the code: It's just the same as CP50220, but it accepts
unmapped JIS X 0208 characters passed through from other Japanese encodings
and silently encodes them using the usual ISO-2022-JP escape sequence and
representation for JIS X 0208 characters.

It's hard to see how this could be useful. OK, let me come out and say it:
it's _not_ useful. We can confidently jettison this (mis)feature.

show more ...

# 888f5d77 13-Jan-2021 Alex Dowad

CP5022{0,1,2}: treat truncated multibyte characters as error

# 0ec34da8 05-Jan-2021 Alex Dowad

CP5022{0,1,2}: treat unrecognized escapes as error

# a50607d1 05-Jan-2021 Alex Dowad

CP5022{0,1,2}: use JISX0201 for U+203E (overline)

Same issue as d497c0e96f addressed for JIS7/JIS8, but for CP5022{0,1,2} this time.

# 5e5243ab 11-Oct-2020 Alex Dowad

CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly

Unicode has a range of 'private' codepoints which individual applications can
use for their own purposes.

CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly

Unicode has a range of 'private' codepoints which individual applications can
use for their own purposes. When they were inventing CP932, MicroSoft mapped
these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208
character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114
for private characters.)

`mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94
rather than 95-114. The code included a link to a document on the OpenGroup
web site, dating back to 1996 [1], which proposed mapping private codepoints to
these rows. However, that is not consistent with what mbstring does when
converting CP5022x to Unicode.

There seems to be a dearth of information on CP5022x on the web. However, I
did find one (Japanese-language) page on CP50221, which states that it maps
kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2].

As a side note, using rows higher than 95 does seem to defeat one purpose of
using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be
"7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up,
so 8 bits are needed.

[1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html
[2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221

show more ...

# 6e9c8386 11-Oct-2020 Alex Dowad

CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly

Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS.
As Shift-JIS and ISO-2022-JP both encode charact

CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly

Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS.
As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset,
CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters
added as MicroSoft vendor extensions.

Among the added characters are a number of symbols which MS put in the 13th row
of the 94x94 character table. (In JIS X 0208, that row is empty.)

mbfilter_cp50220x.c had an `if` clause which was intended to handle the
conversion of characters in that 13th row, but it was dead code, as the previous
clause was always true in those cases. The solution is to reverse the order of
those two clauses (just as they already appeared in mbfilter_cp932.c).

show more ...

# cdd07242 08-Oct-2020 Alex Dowad

Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding

Don't allow escape sequences to start in the middle of a multibyte character.
Also, don't silently pass t

Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding

Don't allow escape sequences to start in the middle of a multibyte character.
Also, don't silently pass through illegal bytes which appear where the 2nd
byte of a multibyte character should be.

show more ...

# ecf71847 14-Nov-2020 Alex Dowad

Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants

By entering this character in the JIS X 0208 conversion table, we can
remove a bunch of explicit `if` clauses in d

Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants

By entering this character in the JIS X 0208 conversion table, we can
remove a bunch of explicit `if` clauses in different conversion filters.
It also means that U+FF5E can be converted into SJIS-mac now; I don't
know why this one SJIS variant rejected U+FF5E before, since 0x8160
means the same thing in SJIS-mac as the others.

show more ...

# 5ffcf563 07-Oct-2020 Alex Dowad

Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through

Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped
in the JIS X 0212, JIS X 021

Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through

Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped
in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when
converting to another Japanese encoding.

show more ...

# 8ae04733 07-Oct-2020 Alex Dowad

Don't pass invalid JIS X 0208 characters through

Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and
so on encode characters from the JIS X 0208 character set. J

Don't pass invalid JIS X 0208 characters through

Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and
so on encode characters from the JIS X 0208 character set. JIS X 0208 is based
on the concept of a 94x94 table, with numbered rows and columns. However,
more than a thousand of the cells in that table are empty; JIS X 0208 does not
actually use all 94x94=8,836 possible kuten codes.

mbstring had a dubious feature whereby, if a Japanese string contained one of
these 'unmapped' kuten codes, and it was being converted to another Japanese
encoding which was also based on JIS X 0208, the non-existent character would
be silently passed through, and the unmapped kuten code would be re-encoded
using the normal encoding method of the target text encoding.

Again, this _only_ happened if converting the text with the funky kuten code
to a Japanese encoding. If one tried converting it to Unicode, mbstring would
treat that as an error.

If somebody, somewhere, made their own private extension to JIS X 0208, and
used the regular Japanese encodings like Shift JIS and EUC-JP to encode this
private character set, then this feature might conceivably be useful. But how
likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to
encode a funky version of JIS X 0208 with extra characters added, then that
should be treated as a separate text encoding.

The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained
solely for error reporting in `mbfl_filt_conv_illegal_output`.

show more ...

# 3e7acf90 04-Nov-2020 Alex Dowad

Remove mbstring identify filters

mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.

Remove mbstring identify filters

mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.

show more ...

Revision tags: php-7.3.23, php-7.3.23RC1, php-7.3.22, php-7.3.22RC1, php-7.3.21, php-7.3.21RC1
# 0ffc1f55 16-Jul-2020 Alex Dowad

Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c

- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for thi

Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c

- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
list of alias strings. Pointers to pointers (2 levels of indirection)
is what actually makes sense. This gets rid of some extraneous
dereference operations.

show more ...

# 3f1851de 19-Sep-2020 Alex Dowad

Avoid compiler warnings related to mbstring flush functions

# b1c5532a 15-Sep-2020 Remi Collet

fix mbfl function prototypes
re-add mbfl_convert_filter_feed API
re-add pointer cast

# a2b40ee9 16-Jul-2020 Alex Dowad

Remove unneeded function mbfl_filt_ident_common_dtor

This was the default destructor for mbfl_identify_filter structs, but there's nothing
we actually need to do to those structs before

Remove unneeded function mbfl_filt_ident_common_dtor

This was the default destructor for mbfl_identify_filter structs, but there's nothing
we actually need to do to those structs before freeing them.

show more ...

12