1==========================================
2 README for I18N Package
3==========================================
4
5o Name and location of package
6
7Name: php-3.0.18-i18n-ja-2
8Location: http://www.happysize.co.jp/techie/php-ja-jp/
9 ftp://ftp.happysize.co.jp/php-ja-jp/
10 http://php.vdomains.org/
11 ftp://ftp.vdomains.org/pub/php-ja-jp/
12 http://php.jpnnet.com/
13
14Currently, this I18N version of PHP only adds Japanese support to base
15PHP. It allows you to use Japanese in scripts, as well as conversion
16between various Japanese encodings. It will work perfectly fine with
17ASCII with i18n option enabled. (note: executable is bit larger due
18to UNICODE table). The basic design aproach is to allow for other
19languages to be added in the future. Developers are encourage to join
20us!
21
22For more information on Japanese encodings, please refer to the
23section "Additional Notes."
24
25
26o What is this package?
27
28This package allows you to handle multiple Japanese encodings (SJIS, EUC,
29UTF-8, JIS) in PHP. If you find any bugs in this package, please report
30them to the appropriate mailing list. For now, the PHP-jp mailing list
31is the best place for this.
32
33PHP-jp ML mailto:PHP-jp@sidecar.ics.es.osaka-u.ac.jp
34 http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
35 (discussions are in Japanese)
36
37
38o Who should use this
39
40Due to lack of documentation, it's not intended for beginners. If
41something goes wrong, be prepared to fix it on your own.
42
43
44o Warranty and Copyright
45
46There is no warranty with this package. Use it at your own risk.
47
48Please refer to the source code for the copyrights. In general, each
49program's copyright is owned by the programmer. Unless you obey the
50copyright holders restrictions, you are not allowed to use it in any
51form.
52
53
54o Redistribution
55
56As described in the source code, this package and the components are
57allowed to be redistributed with certain restrictions.
58
59Due to this package being still in beta, please try to redistribute
60it as an entire package. Please try not to distribute it as a form
61of patch. Because we would prefer to have this package distributed
62as one single package (not patch of patch of patch), avoid releasing
63any patch to this package.
64
65
66o Who made this
67
68A team of volunteers, PHP3 Internationalization, has been contributing
69their free time producing it. Although we are not related to the core
70PHP programmers, we are hoping to have our modifications merged into the
71core distribution in the near future. Thus, we did not call this a
72"Japanese Patch" (or distribution). Our final goal is to have true
73i18nized PHP!
74
75For anyone interested in this project, please drop us a line.
76
77Contact Address:
78 phpj-dev@kage.net
79 (Discussions are in Japanese, but feel free to write us in English)
80
81Webpage (English and Japanese):
82 http://php.jpnnet.com/
83
84Project Outline (Japanese):
85 http://www.happysize.co.jp/techie/php-ja-jp/spec.htm
86
87Developers:
88 Hironori Sato <satoh@jpnnet.com>
89 Shigeru Kanemoto <sgk@happysize.co.jp>
90 Tsukada Takuya <tsukada@fminn.nagano.nagano.jp>
91 U. Kenkichi <kenkichi@axes.co.jp>
92 Tateyama <tateyan@amy.hi-ho.ne.jp>
93 Other gracious contributors
94
95
96o Future plans
97
98- fulfilling what's written in outline
99- support for other languages other than Japanese
100- make the character conversion as a library (?)
101- more testing
102
103
104o Special Thanks to
105
106PHP Japanese webpage maintainer, Hirokawa-san
107 http://www.cityfujisawa.ne.jp/%7Elouis/apps/phpfi/
108PHP-JP ML's Yamamoto-san
109 http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
110Previous jp-patch developers
111
112
113
114==========================================
115 Advantages of using I18N package
116==========================================
117
118- allows you to use various character encodings for script files and
119 http output
120- distinguish character encoding in POST/GET/COOKIE
121- proper mail output using JIS as body and MIME/Base64/JIS subject
122- if http output's Content-Type is text/html, it will set proper charset
123- stable character encoding conversion
124- multibyte regex
125
126
127
128==========================================
129 Installation
130==========================================
131
132o Summary
133
134Add --enable-i18n option when running configure. For your own setup,
135add any other appropriate options as well.
136
137Don't forget to copy php3.ini-dist to desired location.
138(ex. /usr/local/lib/php3.ini)
139
140If you have already installed PHP3, copy all the entries in php3.ini-dist
141which start with "i18n.xxxx" to php3.ini.
142
143
144o configure option
145 --enable-i18n
146 include i18n features
147
148 --enable-mbregex
149 include multibyte regex library
150 (without i18n enabled, mbregex functions will not function)
151
152
153o creating cgi version
154
155 % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
156 % cd php-3.0.18-i18n-ja-2
157 % ./configure --enable-i18n --enable-mbregex
158 % make
159
160
161o creating Apache version (regular module)
162
163 % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
164 % tar xvzf apache_1.3.x.tar.gz
165 % cd apache_1.3.x
166 % ./configure
167 % cd ../php-3.0.18-i18n-ja-2
168 % ./configure --with-apache=../apache_1.3.x --enable-i18n --enable-mbregex
169 % make
170 % make install
171 % cd ../apache_1.3.x
172 % ./configure --activate-module=src/modules/php3/libphp3.a
173 % make
174 % make install
175
176
177o creating Apache DSO version
178
179 create DSO capable Apache first
180 % tar xvzf apache_1.3.x.tar.gz
181 % cd apache-1.3.x
182 % ./configure --enable-shared=max
183 % make
184 % make install
185
186 now create php3
187 % cd php-3.0.18-i18n-ja-2
188 % ./configure --with-apxs=/usr/local/apache/bin/apxs --enable-i18n \
189 --enable-mbregex
190 % make
191 % make install
192
193
194==========================================
195 Additional Notes
196==========================================
197
198o Multibyte regex library
199
200From beta4, we have included the multibyte (mb) regex library which comes with
201Ruby. With this addition, you can now use regex in EUC, SJIS and UTF-8
202encoding. To avoid any conflicts with HSREGEX included with Apache,
203each function name has been changed. Therefore, mb regex functions are
204named differently from the original ereg functions in PHP. The character
205encoding used in mb regex is configured in i18n.internal_encoding.
206
207
208o Binary Output
209
210If http output encoding is set to other than 'pass', conversion of encoding
211from internal encoding to http output is done automatically. Thus,
212if you prefer to spit out anything in raw binary format, your data
213may be corrupted. In such event, set http_output to 'pass'.
214
215ex.
216 <?
217 i18n_http_output("pass");
218 ...
219 echo $the_binary_data_string;
220 ?>
221
222
223o Content-Type
224
225Depending on the setting of http_output, PHP will output the proper charset.
226ex. Content-Type: text/html; charset="..."
227
228Be aware of following:
229
230- If you set Content-Type header using header() function, that will
231 override the automatic addition of charset.
232- Be cautious when you set i18n_http_output, since if any output is
233 made prior to this, proper header may have been sent out to the
234 client already.
235
236
237o In the event of trouble
238
239If you find any bugs or trouble, please contact us at the above address.
240It may help us to track the problem if you send us the script as well.
241
242If you encounter any memory related error such as segmentation violation,
243add --enable-debug when you run configure. This will give you more
244detail information on where error has occurred. The error is stored
245in the server log or regular http output in CGI mode.
246
247
248o About Japanese encodings
249
250Due to historical reason, there are multiple character encodings used
251for Japanese. The most common encodings are: SJIS, EUC, JIS, and UTF-8.
252Here are (very) brief description of them:
253
254EUC
255 commonly used in UNIX environment
256 8bit-8bit combo
257 always >=0x80
258
259SJIS
260 commonly used in Mac or PCs
261 similar to EUC
262 mostly 8bit-8bit (some 8bit-7bit)
263 mostly >=0x80
264 there are some halfwidth (size of ASCII) multibytes
265
266JIS
267 commonly used in 7bit environment (nntp and smtp)
268 starts with escaping char, \033 and a few more characters
269
270UTF-8
271 16bit+ encoding
272 defines many languages existing in this world
273 see http://www.unicode.org/ for more detail
274
275Because of having all these character encodings, PHP needs to translate
276between these encodings on the fly. Also, the addition of the mb regex
277library allows you to handle mb strings without fear of getting mb char
278chopped in half.
279
280Since Japanese is not the only language with multiple encodings, we
281encourage other developers to modify our code to suit your needs. We
282definitely need people to work with Korean, Chinese (both traditional
283and simplified), and Russian. Let us know if you are interested in
284this project!
285
286
287
288==========================================
289 php3.ini setting
290==========================================
291
292The following init options will allow you to change the default settings.
293Define these settings in the global section of php3.ini.
294
295All keywords are case-insensitive.
296
297o Encoding naming
298
299 For each encoding, there are three names: standarized, alias, MIME
300
301 - UTF-8
302 standard: UTF-8
303 alias: N/A
304 mime: UTF-8
305
306 - ASCII
307 standard: ASCII
308 alias: N/A
309 mime: US-ASCII
310
311 - Japanese EUC
312 standard: EUC-JP
313 alias: EUC, EUC_JP, eucJP, x-euc-jp
314 mime: EUC-JP
315
316 - Shift JIS
317 standard: SJIS
318 alias: x-sjis, MS_Kanji
319 mime: Shift_JIS
320
321 - JIS
322 standard: JIS
323 alias: N/A
324 mime: ISO-2022-JP
325
326 - Quoted-Printable
327 standard: Quoted-Printable
328 alias: qprint
329 mime: N/A
330
331 - BASE64
332 standard: BASE64
333 alias: N/A
334 mime: N/A
335
336 - no conversion
337 standard: pass
338 alias: none
339 mime: N/A
340
341 - auto encoding detection
342 standard: auto
343 alias: unknown
344 mime: N/A
345
346 * N/A - Not Applicapable
347
348o i18n.http_output - default http output encoding
349
350 i18n.http_output = EUC-JP|SJIS|JIS|UTF-8|pass
351 EUC-JP : EUC
352 SJIS: SJIS
353 JIS : JIS
354 UTF-8: UTF-8
355 pass: no conversion
356
357 The default is pass (internal encoding is used)
358 It can be re-configured on the fly using i18n_http_output().
359
360
361o i18n.internal_encoding - internal encoding
362
363 i18n.internal_encoding = EUC-JP|SJIS|UTF-8
364 EUC-JP : EUC
365 SJIS: SJIS
366 UTF-8: UTF-8
367
368 The default is EUC-JP.
369
370 PHP parser is designed based on using ISO-8859-1. For other
371 encodings, following conditions have to be satisfied in order
372 to use them:
373 - per byte encoding
374 - single byte character in range of 00h-7fh which is compatible
375 with ASCII
376 - multibyte without 00h-7fh
377 In case of Japanese, EUC-JP and UTF-8 are the only encoding that
378 meets this criteria.
379
380 If i18n.internal_encoding and i18n.http_output differs, conversion
381 takes place at the time of output. If you convert any data within
382 PHP scripts to URL encoding, BASE64 or Quoted-Printable, encoding
383 stays as defined in i18n.internal_encoding. Thus, if you would
384 prefer to encode in compliance with i18n.http_output, you need
385 to manually convert encoding.
386
387 ex. $str = urlencode( i18n_convert($str, i18n_http_output()) );
388
389 Encoding such as ISO-2022-** and HZ encoding which uses escape
390 sequences can not be used as internal encoding. If used, they
391 result in following errors:
392 - parser pukes funky error
393 - magic_quotes_*** breaks encoding (SJIS may have similar problem)
394 - string manipulation and regex will malfunction
395
396
397o i18n.script_encoding - script encoding
398
399 i18n.script_encoding = auto|EUC-JP|SJIS|JIS|UTF-8
400 auto: automatic
401 EUC-JP : EUC
402 SJIS: SJIS
403 JIS : JIS
404 UTF-8: UTF-8
405
406 The default is auto.
407 The script's encoding is converted to i18n.internal_encoding before
408 entering the script parser.
409
410 Be aware that auto detection may fail under some conditions.
411 For best auto detection, add multibyte character at beginning of
412 script.
413
414
415o i18n.http_input - handling of http input (GET/POST/COOKIE)
416
417 i18n.http_input = pass|auto
418 auto: auto conversion
419 pass: no conversion
420
421 The default is auto.
422 If set to pass, no conversion will take place.
423 If set to auto, it will automatically detect the encoding. If
424 detection is successful, it will convert to the proper internal
425 encoding. If not, it will assume the input as defined in
426 i18n.http_input_default.
427
428o i18n.http_input_default - default http input encoding
429
430 i18n.http_input_default = pass|EUC-JP|SJIS|JIS|UTF-8
431 pass: no conversion
432 EUC-JP : EUC
433 SJIS: SJIS
434 JIS : JIS
435 UTF-8: UTF-8
436
437 The default is pass.
438 This option is only effective as long as i18n.http_input is set to
439 auto. If the auto detection fails, this encoding is used as an
440 assumption to convert the http input to the internal encoding.
441 If set to pass, no conversion will take place.
442
443o sample settings
444
445 1) For most flexibility, we recommend using following example.
446 i18n.http_output = SJIS
447 i18n.internal_encoding = EUC-JP
448 i18n.script_encoding = auto
449 i18n.http_input = auto
450 i18n.http_input_default = SJIS
451
452 2) To avoid unexpected encoding problems, try these:
453
454 i18n.http_output = pass
455 i18n.internal_encoding = EUC-JP
456 i18n.script_encoding = pass
457 i18n.http_input = pass
458 i18n.http_input_default = pass
459
460
461
462==========================================
463 PHP functions
464==========================================
465
466The following describes the additional PHP functions.
467
468All keywords are case-insensitive.
469
470o i18n_http_output(encoding)
471o encoding = i18n_http_output()
472
473 This will set the http output encoding. Any output following this
474 function will be controlled by this function. If no argument is given,
475 the current http output encode setting is returned.
476
477 encodings
478 EUC-JP : EUC
479 SJIS: SJIS
480 JIS : JIS
481 UTF-8: UTF-8
482 pass: no conversion
483
484 NONE is not allowed
485
486
487o encoding = i18n_internal_encoding()
488
489 Returns the current internal encoding as a string.
490
491 internal encoding
492 EUC-JP : EUC
493 SJIS: SJIS
494 UTF-8: UTF-8
495
496
497o encoding = i18n_http_input()
498
499 Returns http input encoding.
500
501 encodings
502 EUC-JP : EUC
503 SJIS: SJIS
504 JIS : JIS
505 UTF-8: UTF-8
506 pass: no conversion (only if i18n.http_input is set to pass)
507
508
509o string = i18n_convert(string, encoding)
510 string = i18n_convert(string, encoding, pre-conversion-encoding)
511
512 Returns converted string in desired encoding. If
513 pre-conversion-encoding is not defined, the given
514 string is assumed to be in internal encoding.
515
516 encoding
517 EUC-JP : EUC
518 SJIS: SJIS
519 JIS : JIS
520 UTF-8: UTF-8
521 pass: no conversion
522
523 pre-conversion-encoding
524 EUC-JP : EUC
525 SJIS: SJIS
526 JIS : JIS
527 UTF-8: UTF-8
528 pass: no conversion
529 auto: auto detection
530
531
532o encoding = i18n_discover_encoding(string)
533
534 Encoding of the given string is returned (as a string).
535
536 encoding
537 EUC-JP : EUC
538 SJIS: SJIS
539 JIS : JIS
540 UTF-8: UTF-8
541 ASCII: ASCII (only 09h, 0Ah, 0Dh, 20h-7Eh)
542 pass: unable to determine (text is too short to determine)
543 unknown: unknown or possible error
544
545
546o int = mbstrlen(string)
547o int = mbstrlen(string, encoding)
548
549 Returns character length of a given string. If no encoding is defined,
550 the encoding of string is assumed to be the internal encoding.
551
552 encoding
553 EUC-JP : EUC
554 SJIS: SJIS
555 JIS : JIS
556 UTF-8: UTF-8
557 auto: automatic
558
559
560o int = mbstrpos(string1, string2)
561o int = mbstrpos(string1, string2, start)
562o int = mbstrpos(string1, string2, start, encoding)
563
564 Same as strpos. If no encoding is defined, the encoding of string
565 is assumed to be the internal encoding.
566
567 encoding
568 EUC-JP : EUC
569 SJIS: SJIS
570 JIS : JIS
571 UTF-8: UTF-8
572
573
574o int = mbstrrpos(string1, string2)
575o int = mbstrrpos(string1, string2, encoding)
576
577 Same as strrpos. If no encoding is defined, the encoding of string
578 is assumed to be the internal encoding.
579
580 encoding
581 EUC-JP : EUC
582 SJIS: SJIS
583 JIS : JIS
584 UTF-8: UTF-8
585
586
587o string = mbsubstr(string, position)
588o string = mbsubstr(string, position, length)
589o string = mbsubstr(string, position, length, encoding)
590
591 Same as substr. If no encoding is defined, the encoding of string
592 is assumed to be the internal encoding.
593
594 encoding
595 EUC-JP : EUC
596 SJIS: SJIS
597 JIS : JIS
598 UTF-8: UTF-8
599
600
601o string = mbstrcut(string, position)
602o string = mbstrcut(string, position, length)
603o string = mbstrcut(string, position, length, encoding)
604
605 Same as subcut. If position is the 2nd byte of a mb character, it will cut
606 from the first byte of that character. It will cut the string without
607 chopping a single byte from a mb character. In another words, if you
608 set length to 5, you will only get two mb characters. If no encoding
609 is defined, the encoding of string is assumed to be the internal encoding.
610
611 encoding
612 EUC-JP : EUC
613 SJIS: SJIS
614 JIS : JIS
615 UTF-8: UTF-8
616
617
618o string = i18n_mime_header_encode(string)
619 MIME encode the string in the format of =?ISO-2022-JP?B?[string]?=.
620
621
622o string = i18n_mime_header_decode(string)
623 MIME decodes the string.
624
625
626o string = i18n_ja_jp_hantozen(string)
627o string = i18n_ja_jp_hantozen(string, option)
628o string = i18n_ja_jp_hantozen(string, option, encoding)
629
630 Conversion between full width character and halfwidth character.
631
632 option
633 The following options are allowed. The default is "KV".
634 Acronym: FW = fullwidth, HW = halfwidth
635
636 "r" : FW alphabet -> HW alphabet
637
638 "R" : HW alphabet -> FW alphabet
639
640 "n" : FW number -> HW number
641
642 "N" : HW number -> FW number
643
644 "a" : FW alpha numeric (21h-7Eh) -> HW alpha numeric
645
646 "A" : HW alpha numeric (21h-7Eh) -> FW alpha numeric
647
648 "k" : FW katakana -> HW katakana
649
650 "K" : HW katakana -> FW katakana
651
652 "h" : FW hiragana -> HW hiragana
653
654 "H" : HW hiragana -> FW katakana
655
656 "c" : FW katakana -> FW hiragana
657
658 "C" : FW hiragana -> FW katakana
659
660 "V" : merge dakuon character. only works with "K" and "H" option
661
662 encoding
663 If no encoding is defined, the encoding of string is assumed to be
664 the internal encoding.
665 EUC-JP : EUC
666 SJIS: SJIS
667 JIS : JIS
668 UTF-8: UTF-8
669
670
671int = mbereg(regex_pattern, string, string)
672int = mberegi(regex_pattern, string, string)
673 mb version of ereg() and eregi()
674
675
676string = mbereg_replace(regex_pattern, string, string)
677string = mberegi_replace(regex_pattern, string, string)
678 mb version of ereg_replace() and eregi_replace()
679
680
681string_array = mbsplit(regex, string, limit)
682 mb version of split()
683
684
685
686==========================================
687 FAQ
688==========================================
689
690Here, we have gathered some commonly asked questions on PHP-jp mailing
691list.
692
693o To use Japanese in GET method
694
695If you need to assign Japanese text in GET method with argument, such as;
696xxxx.php?data=<Japanese text>, use urlencode function in PHP. If not,
697text may not be passed onto action php properly.
698
699ex: <a href="hoge.php?data=<? echo urlencode($data) ?>">Link</a>
700
701
702o When passing data via GET/POST/COOKIE, \ character sneaks in
703
704When using SJIS as internal encoding, or passed-on data includes '"\,
705PHP automatically inserts escaping character, \. Set magic_quotes_gpc
706in php3.ini from On to Off. An alternative work around to this problem
707is to use StripSlashes().
708
709If $quote_str is in SJIS and you would like to extract Japanese text,
710use ereg_replace as follows:
711
712ereg_replace(sprintf("([%c-%c%c-%c]\\\\)\\\\",0x81,0x9f,0xe0,0xfc),
713 "\\1",$quote_str);
714
715This will effectively extract Japanese text out of $quote_str.
716
717
718o Sometimes, encoding detection fails
719
720If i18n_http_input() returns 'pass', it's likely that PHP failed to
721detect whether it's SJIS or EUC. In such case, use <input type=hidden
722value="some Japanese text"> to properly detect the incoming text's
723encoding.
724
725
726
727==========================================
728 Japanese Manual
729==========================================
730Translated manual done by "PHP Japanese Manual Project" :
731
732http://www.php.net/manual/ja/manual.php
733
734Starting 3.0.18-i18n-ja, we have removed doc-jp from tarball package.
735
736
737==========================================
738 Change Logs
739==========================================
740
741o 2000-10-28, Rui Hirokawa <hirokawa@php.net>
742
743This patch is derived from php-3.0.15-i18n-ja as well as php-3.0.16 by
744Kuwamura applied to original php-3.0.18. It also includes following fixes:
745
7461) allows you to set charset in mail().
7472) fixed mbregex definitions to avoid conflicts with system regex
7483) php3.ini-dist now uses PASS for http_output instead of SJIS
749
750o 2000-11-24, Hironori Sato <satoh@yyplanet.com>
751
752Applied above patched and added detection for gdImageStringTTF in configure.
753Following setups are known to work:
754
755gd-1.3-6, gd-devel-1.3-6, freetype-1.3.1-5, freetype-devel-1.3.1-5
756 ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf",
757 i18n_convert("���ܸ�", "UTF-8"));
758 ImageGif($im);
759
760gd-1.7.3-1k1, gd-devel-1.7.3-1k1, freetype-1.3.1-5, freetype-devel-1.3.1-5
761 ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf","���ܸ�");
762 ImagePng($im);
763 * i18n_internal_encoding = EUC ���� SJIS
764
765For any gd libraries before 1.6.2, you need to use i18n_convert. For
766gd-1.5.2/3, upgrade to anything above 1.7 to use ImageTTFText without
767using i18n_convert. As long as you have internal_encoding set to EUC or
768SJIS, ImageTTFText should work without mojibake. Again, make sure you
769have i18n_http_output("pass") before calling ImageGif, ImagePng, ImageJpeg!
770
771o 2000-12-09, Rui Hirokawa <hirokawa@php.net>
772
773Fixed mail() which was causing segmentation fault when header was null.
774