xref: /PHP-5.4/ext/mbstring/README_PHP3-i18n-ja (revision ac40c0b5)
1==========================================
2  README for I18N Package
3==========================================
4
5o Name and location of package
6
7Name:           php-3.0.18-i18n-ja-2
8Location:       http://www.happysize.co.jp/techie/php-ja-jp/
9                ftp://ftp.happysize.co.jp/php-ja-jp/
10                http://php.vdomains.org/
11                ftp://ftp.vdomains.org/pub/php-ja-jp/
12                http://php.jpnnet.com/
13
14Currently, this I18N version of PHP only adds Japanese support to base
15PHP.  It allows you to use Japanese in scripts, as well as conversion
16between various Japanese encodings.  It will work perfectly fine with
17ASCII with i18n option enabled.  (note: executable is bit larger due
18to UNICODE table).  The basic design aproach is to allow for other
19languages to be added in the future.  Developers are encourage to join
20us!
21
22For more information on Japanese encodings, please refer to the
23section "Additional Notes."
24
25
26o What is this package?
27
28This package allows you to handle multiple Japanese encodings (SJIS, EUC,
29UTF-8, JIS) in PHP.  If you find any bugs in this package, please report
30them to the appropriate mailing list.  For now, the PHP-jp mailing list
31is the best place for this.
32
33PHP-jp ML       mailto:PHP-jp@sidecar.ics.es.osaka-u.ac.jp
34                http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
35                (discussions are in Japanese)
36
37
38o Who should use this
39
40Due to lack of documentation, it's not intended for beginners.  If
41something goes wrong, be prepared to fix it on your own.
42
43
44o Warranty and Copyright
45
46There is no warranty with this package.  Use it at your own risk.
47
48Please refer to the source code for the copyrights.  In general, each
49program's copyright is owned by the programmer.  Unless you obey the
50copyright holders restrictions, you are not allowed to use it in any
51form.
52
53
54o Redistribution
55
56As described in the source code, this package and the components are
57allowed to be redistributed with certain restrictions.
58
59Due to this package being still in beta, please try to redistribute
60it as an entire package.  Please try not to distribute it as a form
61of patch.  Because we would prefer to have this package distributed
62as one single package (not patch of patch of patch), avoid releasing
63any patch to this package.
64
65
66o Who made this
67
68A team of volunteers, PHP3 Internationalization, has been contributing
69their free time producing it.  Although we are not related to the core
70PHP programmers, we are hoping to have our modifications merged into the
71core distribution in the near future.  Thus, we did not call this a
72"Japanese Patch" (or distribution).  Our final goal is to have true
73i18nized PHP!
74
75For anyone interested in this project, please drop us a line.
76
77Contact Address:
78        phpj-dev@kage.net
79        (Discussions are in Japanese, but feel free to write us in English)
80
81Webpage (English and Japanese):
82        http://php.jpnnet.com/
83
84Project Outline (Japanese):
85        http://www.happysize.co.jp/techie/php-ja-jp/spec.htm
86
87Developers:
88        Hironori Sato <satoh@jpnnet.com>
89        Shigeru Kanemoto <sgk@happysize.co.jp>
90        Tsukada Takuya <tsukada@fminn.nagano.nagano.jp>
91        U. Kenkichi <kenkichi@axes.co.jp>
92        Tateyama  <tateyan@amy.hi-ho.ne.jp>
93        Other gracious contributors
94
95
96o Future plans
97
98- fulfilling what's written in outline
99- support for other languages other than Japanese
100- make the character conversion as a library (?)
101- more testing
102
103
104o Special Thanks to
105
106PHP Japanese webpage maintainer, Hirokawa-san
107        http://www.cityfujisawa.ne.jp/%7Elouis/apps/phpfi/
108PHP-JP ML's Yamamoto-san
109        http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
110Previous jp-patch developers
111
112
113
114==========================================
115  Advantages of using I18N package
116==========================================
117
118- allows you to use various character encodings for script files and
119  http output
120- distinguish character encoding in POST/GET/COOKIE
121- proper mail output using JIS as body and MIME/Base64/JIS subject
122- if http output's Content-Type is text/html, it will set proper charset
123- stable character encoding conversion
124- multibyte regex
125
126
127
128==========================================
129  Installation
130==========================================
131
132o Summary
133
134Add --enable-i18n option when running configure.  For your own setup,
135add any other appropriate options as well.
136
137Don't forget to copy php3.ini-dist to desired location.
138(ex. /usr/local/lib/php3.ini)
139
140If you have already installed PHP3, copy all the entries in php3.ini-dist
141which start with "i18n.xxxx" to php3.ini.
142
143
144o configure option
145    --enable-i18n
146      include i18n features
147
148    --enable-mbregex
149      include multibyte regex library
150      (without i18n enabled, mbregex functions will not function)
151
152
153o creating cgi version
154
155    % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
156    % cd php-3.0.18-i18n-ja-2
157    % ./configure --enable-i18n --enable-mbregex
158    % make
159
160
161o creating Apache version (regular module)
162
163    % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
164    % tar xvzf apache_1.3.x.tar.gz
165    % cd apache_1.3.x
166    % ./configure
167    % cd ../php-3.0.18-i18n-ja-2
168    % ./configure --with-apache=../apache_1.3.x --enable-i18n --enable-mbregex
169    % make
170    % make install
171    % cd ../apache_1.3.x
172    % ./configure --activate-module=src/modules/php3/libphp3.a
173    % make
174    % make install
175
176
177o creating Apache DSO version
178
179    create DSO capable Apache first
180    % tar xvzf apache_1.3.x.tar.gz
181    % cd apache-1.3.x
182    % ./configure --enable-shared=max
183    % make
184    % make install
185
186    now create php3
187    % cd php-3.0.18-i18n-ja-2
188    % ./configure --with-apxs=/usr/local/apache/bin/apxs --enable-i18n \
189        --enable-mbregex
190    % make
191    % make install
192
193
194==========================================
195  Additional Notes
196==========================================
197
198o Multibyte regex library
199
200From beta4, we have included the multibyte (mb) regex library which comes with
201Ruby.  With this addition, you can now use regex in EUC, SJIS and UTF-8
202encoding.  To avoid any conflicts with HSREGEX included with Apache,
203each function name has been changed.  Therefore, mb regex functions are
204named differently from the original ereg functions in PHP.  The character
205encoding used in mb regex is configured in i18n.internal_encoding.
206
207
208o Binary Output
209
210If http output encoding is set to other than 'pass', conversion of encoding
211from internal encoding to http output is done automatically.  Thus,
212if you prefer to spit out anything in raw binary format, your data
213may be corrupted.  In such event, set http_output to 'pass'.
214
215ex.
216        <?
217            i18n_http_output("pass");
218            ...
219            echo $the_binary_data_string;
220        ?>
221
222
223o Content-Type
224
225Depending on the setting of http_output, PHP will output the proper charset.
226ex. Content-Type: text/html; charset="..."
227
228Be aware of following:
229
230- If you set Content-Type header using header() function, that will
231  override the automatic addition of charset.
232- Be cautious when you set i18n_http_output, since if any output is
233  made prior to this, proper header may have been sent out to the
234  client already.
235
236
237o In the event of trouble
238
239If you find any bugs or trouble, please contact us at the above address.
240It may help us to track the problem if you send us the script as well.
241
242If you encounter any memory related error such as segmentation violation,
243add --enable-debug when you run configure.  This will give you more
244detail information on where error has occurred.  The error is stored
245in the server log or regular http output in CGI mode.
246
247
248o About Japanese encodings
249
250Due to historical reason, there are multiple character encodings used
251for Japanese.  The most common encodings are: SJIS, EUC, JIS, and UTF-8.
252Here are (very) brief description of them:
253
254EUC
255  commonly used in UNIX environment
256  8bit-8bit combo
257  always >=0x80
258
259SJIS
260  commonly used in Mac or PCs
261  similar to EUC
262  mostly 8bit-8bit (some 8bit-7bit)
263  mostly >=0x80
264  there are some halfwidth (size of ASCII) multibytes
265
266JIS
267  commonly used in 7bit environment (nntp and smtp)
268  starts with escaping char, \033 and a few more characters
269
270UTF-8
271  16bit+ encoding
272  defines many languages existing in this world
273  see http://www.unicode.org/ for more detail
274
275Because of having all these character encodings, PHP needs to translate
276between these encodings on the fly.  Also, the addition of the mb regex
277library allows you to handle mb strings without fear of getting mb char
278chopped in half.
279
280Since Japanese is not the only language with multiple encodings, we
281encourage other developers to modify our code to suit your needs.  We
282definitely need people to work with Korean, Chinese (both traditional
283and simplified), and Russian.  Let us know if you are interested in
284this project!
285
286
287
288==========================================
289  php3.ini setting
290==========================================
291
292The following init options will allow you to change the default settings.
293Define these settings in the global section of php3.ini.
294
295All keywords are case-insensitive.
296
297o Encoding naming
298
299    For each encoding, there are three names: standarized, alias, MIME
300
301    - UTF-8
302         standard: UTF-8
303         alias: N/A
304         mime: UTF-8
305
306    - ASCII
307         standard: ASCII
308         alias: N/A
309         mime: US-ASCII
310
311    - Japanese EUC
312         standard: EUC-JP
313         alias: EUC, EUC_JP, eucJP, x-euc-jp
314         mime: EUC-JP
315
316    - Shift JIS
317         standard: SJIS
318         alias: x-sjis, MS_Kanji
319         mime: Shift_JIS
320
321    - JIS
322         standard: JIS
323         alias: N/A
324         mime: ISO-2022-JP
325
326    - Quoted-Printable
327         standard: Quoted-Printable
328         alias: qprint
329         mime: N/A
330
331    - BASE64
332         standard: BASE64
333         alias: N/A
334         mime: N/A
335
336    - no conversion
337         standard: pass
338         alias: none
339         mime: N/A
340
341    - auto encoding detection
342         standard: auto
343         alias: unknown
344         mime: N/A
345
346    * N/A - Not Applicapable
347
348o i18n.http_output - default http output encoding
349
350    i18n.http_output = EUC-JP|SJIS|JIS|UTF-8|pass
351        EUC-JP : EUC
352        SJIS: SJIS
353        JIS : JIS
354        UTF-8: UTF-8
355        pass: no conversion
356
357    The default is pass (internal encoding is used)
358    It can be re-configured on the fly using i18n_http_output().
359
360
361o i18n.internal_encoding - internal encoding
362
363    i18n.internal_encoding = EUC-JP|SJIS|UTF-8
364        EUC-JP : EUC
365        SJIS: SJIS
366        UTF-8: UTF-8
367
368    The default is EUC-JP.
369
370    PHP parser is designed based on using ISO-8859-1.  For other
371    encodings, following conditions have to be satisfied in order
372    to use them:
373       - per byte encoding
374       - single byte charactor in range of 00h-7fh which is compatible
375         with ASCII
376       - multibyte without 00h-7fh
377    In case of Japanese, EUC-JP and UTF-8 are the only encoding that
378    meets this criteria.
379
380    If i18n.internal_encoding and i18n.http_output differs, conversion
381    takes place at the time of output.  If you convert any data within
382    PHP scripts to URL encoding, BASE64 or Quoted-Printable, encoding
383    stays as defined in i18n.internal_encoding.  Thus, if you would
384    prefer to encode in compliance with i18n.http_output, you need
385    to manually convert encoding.
386
387    ex. $str = urlencode( i18n_convert($str, i18n_http_output()) );
388
389    Encoding such as ISO-2022-** and HZ encoding which uses escape
390    sequences can not be used as internal encoding.  If used, they
391    result in following errors:
392       - parser pukes funky error
393       - magic_quotes_*** breaks encoding (SJIS may have similar problem)
394       - string manipulation and regex will malfunction
395
396
397o i18n.script_encoding - script encoding
398
399    i18n.script_encoding = auto|EUC-JP|SJIS|JIS|UTF-8
400        auto: automatic
401        EUC-JP : EUC
402        SJIS: SJIS
403        JIS : JIS
404        UTF-8: UTF-8
405
406    The default is auto.
407    The script's encoding is converted to i18n.internal_encoding before
408    entering the script parser.
409
410    Be aware that auto detection may fail under some conditions.
411    For best auto detection, add multibyte charactor at beginning of
412    script.
413
414
415o i18n.http_input - handling of http input (GET/POST/COOKIE)
416
417    i18n.http_input = pass|auto
418        auto: auto conversion
419        pass: no conversion
420
421    The default is auto.
422    If set to pass, no conversion will take place.
423    If set to auto, it will automatically detect the encoding.  If
424    detection is successful, it will convert to the proper internal
425    encoding.  If not, it will assume the input as defined in
426    i18n.http_input_default.
427
428o i18n.http_input_default - default http input encoding
429
430    i18n.http_input_default = pass|EUC-JP|SJIS|JIS|UTF-8
431        pass: no conversion
432        EUC-JP : EUC
433        SJIS: SJIS
434        JIS : JIS
435        UTF-8: UTF-8
436
437    The default is pass.
438    This option is only effective as long as i18n.http_input is set to
439    auto.  If the auto detection fails, this encoding is used as an
440    assumption to convert the http input to the internal encoding.
441    If set to pass, no conversion will take place.
442
443o sample settings
444
445    1) For most flexibility, we recommend using following example.
446         i18n.http_output = SJIS
447         i18n.internal_encoding = EUC-JP
448         i18n.script_encoding = auto
449         i18n.http_input = auto
450         i18n.http_input_default = SJIS
451
452    2) To avoid unexpected encoding problems, try these:
453
454         i18n.http_output = pass
455         i18n.internal_encoding = EUC-JP
456         i18n.script_encoding = pass
457         i18n.http_input = pass
458         i18n.http_input_default = pass
459
460
461
462==========================================
463  PHP functions
464==========================================
465
466The following describes the additional PHP functions.
467
468All keywords are case-insensitive.
469
470o i18n_http_output(encoding)
471o encoding = i18n_http_output()
472
473    This will set the http output encoding.  Any output following this
474    function will be controlled by this function.  If no argument is given,
475    the current http output encode setting is returned.
476
477    encodings
478        EUC-JP : EUC
479        SJIS: SJIS
480        JIS : JIS
481        UTF-8: UTF-8
482        pass: no conversion
483
484    NONE is not allowed
485
486
487o encoding = i18n_internal_encoding()
488
489    Returns the current internal encoding as a string.
490
491    internal encoding
492        EUC-JP : EUC
493        SJIS: SJIS
494        UTF-8: UTF-8
495
496
497o encoding = i18n_http_input()
498
499    Returns http input encoding.
500
501    encodings
502        EUC-JP : EUC
503        SJIS: SJIS
504        JIS : JIS
505        UTF-8: UTF-8
506        pass: no conversion (only if i18n.http_input is set to pass)
507
508
509o string = i18n_convert(string, encoding)
510  string = i18n_convert(string, encoding, pre-conversion-encoding)
511
512    Returns converted string in desired encoding.  If
513    pre-conversion-encoding is not defined, the given
514    string is assumed to be in internal encoding.
515
516    encoding
517        EUC-JP : EUC
518        SJIS: SJIS
519        JIS : JIS
520        UTF-8: UTF-8
521        pass: no conversion
522
523    pre-conversion-encoding
524        EUC-JP : EUC
525        SJIS: SJIS
526        JIS : JIS
527        UTF-8: UTF-8
528        pass: no conversion
529        auto: auto detection
530
531
532o encoding = i18n_discover_encoding(string)
533
534    Encoding of the given string is returned (as a string).
535
536    encoding
537        EUC-JP : EUC
538        SJIS: SJIS
539        JIS : JIS
540        UTF-8: UTF-8
541        ASCII: ASCII (only 09h, 0Ah, 0Dh, 20h-7Eh)
542        pass: unable to determine (text is too short to determine)
543        unknown: unknown or possible error
544
545
546o int = mbstrlen(string)
547o int = mbstrlen(string, encoding)
548
549    Returns character length of a given string.  If no encoding is defined,
550    the encoding of string is assumed to be the internal encoding.
551
552    encoding
553        EUC-JP : EUC
554        SJIS: SJIS
555        JIS : JIS
556        UTF-8: UTF-8
557        auto: automatic
558
559
560o int = mbstrpos(string1, string2)
561o int = mbstrpos(string1, string2, start)
562o int = mbstrpos(string1, string2, start, encoding)
563
564    Same as strpos.  If no encoding is defined, the encoding of string
565    is assumed to be the internal encoding.
566
567    encoding
568        EUC-JP : EUC
569        SJIS: SJIS
570        JIS : JIS
571        UTF-8: UTF-8
572
573
574o int = mbstrrpos(string1, string2)
575o int = mbstrrpos(string1, string2, encoding)
576
577    Same as strrpos.  If no encoding is defined, the encoding of string
578    is assumed to be the internal encoding.
579
580    encoding
581        EUC-JP : EUC
582        SJIS: SJIS
583        JIS : JIS
584        UTF-8: UTF-8
585
586
587o string = mbsubstr(string, position)
588o string = mbsubstr(string, position, length)
589o string = mbsubstr(string, position, length, encoding)
590
591    Same as substr.  If no encoding is defined, the encoding of string
592    is assumed to be the internal encoding.
593
594    encoding
595        EUC-JP : EUC
596        SJIS: SJIS
597        JIS : JIS
598        UTF-8: UTF-8
599
600
601o string = mbstrcut(string, position)
602o string = mbstrcut(string, position, length)
603o string = mbstrcut(string, position, length, encoding)
604
605    Same as subcut.  If position is the 2nd byte of a mb character, it will cut
606    from the first byte of that character.  It will cut the string without
607    chopping a single byte from a mb character.  In another words, if you
608    set length to 5, you will only get two mb characters.  If no encoding
609    is defined, the encoding of string is assumed to be the internal encoding.
610
611    encoding
612        EUC-JP : EUC
613        SJIS: SJIS
614        JIS : JIS
615        UTF-8: UTF-8
616
617
618o string = i18n_mime_header_encode(string)
619    MIME encode the string in the format of =?ISO-2022-JP?B?[string]?=.
620
621
622o string = i18n_mime_header_decode(string)
623    MIME decodes the string.
624
625
626o string = i18n_ja_jp_hantozen(string)
627o string = i18n_ja_jp_hantozen(string, option)
628o string = i18n_ja_jp_hantozen(string, option, encoding)
629
630    Conversion between full width character and halfwidth character.
631
632    option
633    The following options are allowed.  The default is "KV".
634    Acronym: FW = fullwidth, HW = halfwidth
635
636    "r" :  FW alphabet -> HW alphabet
637
638    "R" :  HW alphabet -> FW alphabet
639
640    "n" :  FW number -> HW number
641
642    "N" :  HW number -> FW number
643
644    "a" :  FW alpha numeric (21h-7Eh) -> HW alpha numeric
645
646    "A" :  HW alpha numeric (21h-7Eh) -> FW alpha numeric
647
648    "k" :  FW katakana -> HW katakana
649
650    "K" :  HW katakana -> FW katakana
651
652    "h" :  FW hiragana -> HW hiragana
653
654    "H" :  HW hiragana -> FW katakana
655
656    "c" :  FW katakana -> FW hiragana
657
658    "C" :  FW hiragana -> FW katakana
659
660    "V" :  merge dakuon character.  only works with "K" and "H" option
661
662    encoding
663    If no encoding is defined, the encoding of string is assumed to be
664    the internal encoding.
665        EUC-JP : EUC
666        SJIS: SJIS
667        JIS : JIS
668        UTF-8: UTF-8
669
670
671int = mbereg(regex_pattern, string, string)
672int = mberegi(regex_pattern, string, string)
673    mb version of ereg() and eregi()
674
675
676string = mbereg_replace(regex_pattern, string, string)
677string = mberegi_replace(regex_pattern, string, string)
678    mb version of ereg_replace() and eregi_replace()
679
680
681string_array = mbsplit(regex, string, limit)
682    mb version of split()
683
684
685
686==========================================
687  FAQ
688==========================================
689
690Here, we have gathered some commonly asked questions on PHP-jp mailing
691list.
692
693o To use Japanese in GET method
694
695If you need to assign Japanese text in GET method with argument, such as;
696xxxx.php?data=<Japanese text>, use urlencode function in PHP.  If not,
697text may not be passed onto action php properly.
698
699ex: <a href="hoge.php?data=<? echo urlencode($data) ?>">Link</a>
700
701
702o When passing data via GET/POST/COOKIE, \ character sneaks in
703
704When using SJIS as internal encoding, or passed-on data includes '"\,
705PHP automatically inserts escaping character, \.  Set magic_quotes_gpc
706in php3.ini from On to Off.  An alternative work around to this problem
707is to use StripSlashes().
708
709If $quote_str is in SJIS and you would like to extract Japanese text,
710use ereg_replace as follows:
711
712ereg_replace(sprintf("([%c-%c%c-%c]\\\\)\\\\",0x81,0x9f,0xe0,0xfc),
713	"\\1",$quote_str);
714
715This will effectively extract Japanese text out of $quote_str.
716
717
718o Sometimes, encoding detection fails
719
720If i18n_http_input() returns 'pass', it's likely that PHP failed to
721detect whether it's SJIS or EUC.  In such case, use <input type=hidden
722value="some Japanese text"> to properly detect the incoming text's
723encoding.
724
725
726
727==========================================
728  Japanese Manual
729==========================================
730Translated manual done by "PHP Japanese Manual Project" :
731
732http://www.php.net/manual/ja/manual.php
733
734Starting 3.0.18-i18n-ja, we have removed doc-jp from tarball package.
735
736
737==========================================
738  Change Logs
739==========================================
740
741o 2000-10-28, Rui Hirokawa <hirokawa@php.net>
742
743This patch is derived from php-3.0.15-i18n-ja as well as php-3.0.16 by
744Kuwamura applied to original php-3.0.18.  It also includes following fixes:
745
7461) allows you to set charset in mail().
7472) fixed mbregex definitions to avoid conflicts with system regex
7483) php3.ini-dist now uses PASS for http_output instead of SJIS
749
750o 2000-11-24, Hironori Sato <satoh@yyplanet.com>
751
752Applied above patched and added detection for gdImageStringTTF in configure.
753Following setups are known to work:
754
755gd-1.3-6, gd-devel-1.3-6, freetype-1.3.1-5, freetype-devel-1.3.1-5
756    ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf",
757        i18n_convert("���ܸ�", "UTF-8"));
758    ImageGif($im);
759
760gd-1.7.3-1k1, gd-devel-1.7.3-1k1, freetype-1.3.1-5, freetype-devel-1.3.1-5
761    ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf","���ܸ�");
762    ImagePng($im);
763    * i18n_internal_encoding = EUC ���� SJIS
764
765For any gd libraries before 1.6.2, you need to use i18n_convert.  For
766gd-1.5.2/3, upgrade to anything above 1.7 to use ImageTTFText without
767using i18n_convert.  As long as you have internal_encoding set to EUC or
768SJIS, ImageTTFText should work without mojibake.  Again, make sure you
769have i18n_http_output("pass") before calling ImageGif, ImagePng, ImageJpeg!
770
771o 2000-12-09, Rui Hirokawa <hirokawa@php.net>
772
773Fixed mail() which was causing segmentation fault when header was null.
774
775