xref: /PHP-5.4/ext/intl/doc/Tutorial.txt (revision ac40c0b5)
11. Collator::getAvailableLocales().
2Return the locales available at the time of the call, including registered locales.
3If a sever error occurs (such as out of memory condition) this will return null.
4If there is no locale data, an empty enumeration will be returned.
5Returned locales list is a strings in format of RFC4646 standart (see http://www.rfc-editor.org/rfc/rfc4646.txt).
6Examle of locales format: 'en_US', 'ru_UA', 'ua_UA' (see http://demo.icu-project.org/icu-bin/locexp).
7
8
92. Collator::getDisplayName( $obj_locale, $disp_locale ).
10Get name of the object for the desired Locale, in the desired language. Both arguments
11must be from getAvailableLocales method.
12
13     @param  string  $obj_locale   Locale to get display name for.
14     @param  string  $disp_locale  Specifies the desired locale for output
15
16Both parameters are case insensitive.
17For locale format see RFC4647 standart in ftp://ftp.rfc-editor.org/in-notes/rfc4647.txt
18
193. Collator::getLocaleByType( $type ).
20Allow user to select whether she wants information on requested, valid or actual locale.
21Returned locale tag is a string formatted to a RFC4646 standart and normalize to normal form -
22value is a string from
23For example, a collator for "en_US_CALIFORNIA" was requested. In the current state of ICU (2.0),
24the requested locale is "en_US_CALIFORNIA", the valid locale is "en_US" (most specific locale
25supported by ICU) and the actual locale is "root" (the collation data comes unmodified from the UCA)
26The locale is considered supported by ICU if there is a core ICU bundle for that locale (although
27it may be empty).
28
29
304. VariableTop
31The Variable_Top attribute is only meaningful if the Alternate attribute is not set to NonIgnorable.
32In such a case, it controls which characters count as ignorable. The string value specifies
33the "highest" character (in UCA order) weight that is to be considered ignorable.
34Thus, for example, if a user wanted whitespace to be ignorable, but not any visible characters,
35then s/he would use the value Variable_Top="\u0020" (space). The string should only be a
36single character. All characters of the same primary weight are equivalent, so
37Variable_Top="\u3000" (ideographic space) has the same effect as Variable_Top="\u0020".
38This setting (alone) has little impact on string comparison performance; setting it lower or higher
39will make sort keys slightly shorter or longer respectively.
40
41
425. Strength
43The ICU Collation Service supports many levels of comparison (named "Levels", but also
44known as "Strengths"). Having these categories enables ICU to sort strings precisely
45according to local conventions. However, by allowing the levels to be selectively
46employed, searching for a string in text can be performed with various matching
47conditions.
48Performance optimizations have been made for ICU collation with the default level
49settings. Performance specific impacts are discussed in the Performance section below.
50Following is a list of the names for each level and an example usage:
51
521. Primary Level: Typically, this is used to denote differences between base characters
53(for example, "a" < "b"). It is the strongest difference. For example, dictionaries are
54divided into different sections by base character. This is also called the level1
55strength.
56
572. Secondary Level: Accents in the characters are considered secondary differences (for
58example, "as" < "as" < "at"). Other differences between letters can also be considered
59secondary differences, depending on the language. A secondary difference is ignored
60when there is a primary difference anywhere in the strings. This is also called the
61level2 strength.
62Note: In some languages (such as Danish), certain accented letters are considered to
63be separate base characters. In most languages, however, an accented letter only has a
64secondary difference from the unaccented version of that letter.
65
663. Tertiary Level: Upper and lower case differences in characters are distinguished at the
67tertiary level (for example, "ao" < "Ao" < "ao"). In addition, a variant of a letter differs
68from the base form on the tertiary level (such as "A" and " "). Another ? example is the
69difference between large and small Kana. A tertiary difference is ignored when there is
70a primary or secondary difference anywhere in the strings. This is also called the level3
71strength.
72
734. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations ) at level
7413, an additional level can be used to distinguish words with and without punctuation
75(for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary,
76secondary or tertiary difference. This is also known as the level4 strength. The
77quaternary level should only be used if ignoring punctuation is required or when
78processing Japanese text (see Hiragana processing).
79
805. Identical Level: When all other levels are equal, the identical level is used as a
81tiebreaker. The Unicode code point values of the NFD form of each string are
82compared at this level, just in case there is no difference at levels 14
83. For example, Hebrew cantillation marks are only distinguished at this level. This level should be
84used sparingly, as only code point values differences between two strings is an
85extremely rare occurrence. Using this level substantially decreases the performance for
86both incremental comparison and sort key generation (as well as increasing the sort
87key length). It is also known as level 5 strength.
88
89For example, people may choose to ignore accents or ignore accents and case when searching
90for text. Almost all characters are distinguished by the first three levels, and in most
91locales the default value is thus Tertiary. However, if Alternate is set to be Shifted,
92then the Quaternary strength can be used to break ties among whitespace, punctuation, and
93symbols that would otherwise be ignored. If very fine distinctions among characters are required,
94then the Identical strength can be used (for example, Identical Strength distinguishes
95between the Mathematical Bold Small A and the Mathematical Italic Small A.). However, using
96levels higher than Tertiary the Identical strength result in significantly longer sort
97keys, and slower string comparison performance for equal strings.
98
99
100
1016. Collator::__construct( $locale ).
102The Locale attribute is typically the most important attribute for correct sorting and matching,
103according to the user expectations in different countries and regions. The default UCA
104ordering will only sort a few languages such as Dutch and Portuguese correctly ("correctly"
105meaning according to the normal expectations for users of the languages).
106Otherwise, you need to supply the locale to UCA in order to properly collate text for a
107given language. Thus a locale needs to be supplied so as to choose a collator that is correctly
108tailored for that locale. The choice of a locale will automatically preset the values for
109all of the attributes to something that is reasonable for that locale. Thus most of the time the
110other attributes do not need to be explicitly set. In some cases, the choice of locale will make a
111difference in string comparison performance and/or sort key length.
112In short attribute names, <language>_<script>_<region>_<keyword>.
113Not all the elements are required. Valid values for locale elements are general valid values
114for RFC4646 locale naming, and RFC 4647 lookup algorithm.
115Example:
116Locale="sv" (Swedish) "Kypper" < "Kopfe"
117Locale="de" (German) "Kopfe" < "Kypper"
118
119
1207. Collator::get/setAttribute.
121ICU uses UCA as a default starting point for ordering. Not all languages have sorting sequences
122that correspond with the UCA because UCA cannot simultaneously encompass the specifics of all
123the languages currently in use. Therefore, ICU provides a data-driven, flexible, and run-time
124customizable mechanism called "tailoring". Tailoring overrides the default order of code points
125and the values of the ICU Collation Service attributes.
126Collator have followed attributes:
127   - FRENCH_COLLATION, possible values are:
128	ON
129	OFF (default)
130	DEFAULT
131
132   - CASE_FIRST, possible values are:
133	OFF (default)
134	LOWER_FIRST
135	UPPER_FIRST
136	DEFAULT
137
138   - CASE_LEVEL, possible values are:
139	OFF (default)
140	ON
141	DEFAULT
142
143   - NORMALIZATION_MODE, possible values are:
144	OFF (default)
145	ON
146	DEFAULT
147
148   - STRENGTH, possible values are:
149	PRIMARY
150	SECONDARY
151	TERTIARY (default)
152	QUATERNARY
153	IDENTICAL
154	DEFAULT
155
156   - ALTERNATE_HANDLING, possible values are:
157	NON_IGNORABLE (default)
158	SHIFTED
159	DEFAULT
160
161   - HIRAGANA_QUATERNARY_MODE, possible values are:
162	ON
163	OFF (default)
164	DEFAULT
165
166   - NUMERIC_COLLATION, possible values are:
167	ON
168	OFF (default)
169	DEFAULT
170
171Description of all of this attributes:
172
173FRENCH_COLLATION - Sort strings with different accents from the back of the string. This attribute
174is automatically set to On for the French locales and a few others. Users normally would
175not need to explicitly set this attribute. There is a string comparison performance cost when
176it is set On, but sort key length is unaffected.
177Example:
178F=X cote < cote < cote < cote
179F=O cote < cote < cote < cote
180
181CASE_FIRST - The Case_First attribute is used to control whether uppercase letters come before
182lowercase letters or vice versa, in the absence of other differences in the strings. The possible
183values are Uppercase_First (U) and Lowercase_First (L), plus the standard Default and Off.
184There is almost no difference between the Off and Lowercase_First options in terms of results,
185so typically users will not use Lowercase_First: only Off or Uppercase_First. (People interested
186in the detailed differences between X and L should consult the Collation Customization).
187Specifying either L or U won't affect string comparison performance, but will affect the sort key
188length.
189Example:
190C=X or C=L "china" < "China" < "denmark" <
191"Denmark"
192C=U "China" < "china" < "Denmark" < "denmark"
193
194CASE_LEVEL - The Case_Level attribute is used when ignoring accents but not case. In such a situation,
195set Strength to be Primary, and Case_Level to be On. In most locales, this setting is Off by default.
196There is a small string comparison performance and sort key impact if this attribute is set to be On.
197Example:
198S=1, E=X role = Role = role
199S=1, E=O role = role < Role
200
201NORMALIZATION_MODE - The Normalization setting determines whether text is thoroughly normalized
202or not in comparison. Even if the setting is off (which is the default for many locales), text as
203represented in common usage will compare correctly (for details, see UTN #5). Only if the accent
204marks are in noncanonical order will there be a problem. If the setting is On, then the best
205results are guaranteed for all possible text input. There is a medium string comparison performance
206cost if this attribute is On, depending on the frequency of sequences that require normalization.
207There is no significant effect on sort key length. If the input text is known to be in NFD or NFKD
208normalization forms, there is no need to enable this Normalization option.
209
210STRENGTH - see Collator::setStrength chapter.
211
212ALTERNATE_HANDLING - The Alternate attribute is used to control the handling of the socalled
213variable characters in the UCA: whitespace, punctuation and symbols. If Alternate is set to
214NonIgnorable (N), then differences among these characters are of the same importance as
215differences among letters. If Alternate is set to Shifted (S), then these characters are of only
216minor importance. The Shifted value is often used in combination with Strength set to Quaternary.
217In such a case, whitespace, punctuation, and symbols are considered when comparing strings,
218but only if all other aspects of the strings (base letters, accents, and case) are identical.
219If Alternate is not set to Shifted, then there is no difference between a Strength of 3 and
220a Strength of 4. For more information and examples, see
221Variable_Weighting in the UCA (http://www.unicode.org/reports/tr10/#Variable_Weighting).
222The reason the Alternate values are not simply On and Off is that additional Alternate values
223may be added in the future. The UCA option Blanked is expressed with Strength set to 3,
224and Alternate set to Shifted. The default for most locales is NonIgnorable. If Shifted is selected,
225it may be slower if there are many strings that are the same except for punctuation;
226sort key length will not be affected unless the strength level is also increased.
227Example:
228S=3, A=N di Silva < Di Silva < diSilva < U.S.A. < USA
229S=3, A=S di Silva = diSilva < Di Silva < U.S.A. = USA
230S=4, A=S di Silva < diSilva < Di Silva < U.S.A. < USA
231
232HIRAGANA_QUATERNARY_MODE - Compatibility with JIS x 4061 requires the introduction of an additional
233level to distinguish Hiragana and Katakana characters. If compatibility with that standard is required,
234then this attribute should be set On, and the strength set to Quaternary. This will affect sort key
235length and string comparison string comparison performance.
236
237NUMERIC_COLLATION - When turned on, this attribute generates a collation key for the
238numeric value of substrings of digits. This is a way to get '100' to sort AFTER '2'.
239
240