11. Collator::getAvailableLocales(). 2Return the locales available at the time of the call, including registered locales. 3If a sever error occurs (such as out of memory condition) this will return null. 4If there is no locale data, an empty enumeration will be returned. 5Returned locales list is a strings in format of RFC4646 standart (see http://www.rfc-editor.org/rfc/rfc4646.txt). 6Examle of locales format: 'en_US', 'ru_UA', 'ua_UA' (see http://demo.icu-project.org/icu-bin/locexp). 7 8 92. Collator::getDisplayName( $obj_locale, $disp_locale ). 10Get name of the object for the desired Locale, in the desired language. Both arguments 11must be from getAvailableLocales method. 12 13 @param string $obj_locale Locale to get display name for. 14 @param string $disp_locale Specifies the desired locale for output 15 16Both parameters are case insensitive. 17For locale format see RFC4647 standart in ftp://ftp.rfc-editor.org/in-notes/rfc4647.txt 18 193. Collator::getLocaleByType( $type ). 20Allow user to select whether she wants information on requested, valid or actual locale. 21Returned locale tag is a string formatted to a RFC4646 standart and normalize to normal form - 22value is a string from 23For example, a collator for "en_US_CALIFORNIA" was requested. In the current state of ICU (2.0), 24the requested locale is "en_US_CALIFORNIA", the valid locale is "en_US" (most specific locale 25supported by ICU) and the actual locale is "root" (the collation data comes unmodified from the UCA) 26The locale is considered supported by ICU if there is a core ICU bundle for that locale (although 27it may be empty). 28 29 304. VariableTop 31The Variable_Top attribute is only meaningful if the Alternate attribute is not set to NonIgnorable. 32In such a case, it controls which characters count as ignorable. The string value specifies 33the "highest" character (in UCA order) weight that is to be considered ignorable. 34Thus, for example, if a user wanted whitespace to be ignorable, but not any visible characters, 35then s/he would use the value Variable_Top="\u0020" (space). The string should only be a 36single character. All characters of the same primary weight are equivalent, so 37Variable_Top="\u3000" (ideographic space) has the same effect as Variable_Top="\u0020". 38This setting (alone) has little impact on string comparison performance; setting it lower or higher 39will make sort keys slightly shorter or longer respectively. 40 41 425. Strength 43The ICU Collation Service supports many levels of comparison (named "Levels", but also 44known as "Strengths"). Having these categories enables ICU to sort strings precisely 45according to local conventions. However, by allowing the levels to be selectively 46employed, searching for a string in text can be performed with various matching 47conditions. 48Performance optimizations have been made for ICU collation with the default level 49settings. Performance specific impacts are discussed in the Performance section below. 50Following is a list of the names for each level and an example usage: 51 521. Primary Level: Typically, this is used to denote differences between base characters 53(for example, "a" < "b"). It is the strongest difference. For example, dictionaries are 54divided into different sections by base character. This is also called the level1 55strength. 56 572. Secondary Level: Accents in the characters are considered secondary differences (for 58example, "as" < "as" < "at"). Other differences between letters can also be considered 59secondary differences, depending on the language. A secondary difference is ignored 60when there is a primary difference anywhere in the strings. This is also called the 61level2 strength. 62Note: In some languages (such as Danish), certain accented letters are considered to 63be separate base characters. In most languages, however, an accented letter only has a 64secondary difference from the unaccented version of that letter. 65 663. Tertiary Level: Upper and lower case differences in characters are distinguished at the 67tertiary level (for example, "ao" < "Ao" < "ao"). In addition, a variant of a letter differs 68from the base form on the tertiary level (such as "A" and " "). Another ? example is the 69difference between large and small Kana. A tertiary difference is ignored when there is 70a primary or secondary difference anywhere in the strings. This is also called the level3 71strength. 72 734. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations ) at level 7413, an additional level can be used to distinguish words with and without punctuation 75(for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary, 76secondary or tertiary difference. This is also known as the level4 strength. The 77quaternary level should only be used if ignoring punctuation is required or when 78processing Japanese text (see Hiragana processing). 79 805. Identical Level: When all other levels are equal, the identical level is used as a 81tiebreaker. The Unicode code point values of the NFD form of each string are 82compared at this level, just in case there is no difference at levels 14 83. For example, Hebrew cantillation marks are only distinguished at this level. This level should be 84used sparingly, as only code point values differences between two strings is an 85extremely rare occurrence. Using this level substantially decreases the performance for 86both incremental comparison and sort key generation (as well as increasing the sort 87key length). It is also known as level 5 strength. 88 89For example, people may choose to ignore accents or ignore accents and case when searching 90for text. Almost all characters are distinguished by the first three levels, and in most 91locales the default value is thus Tertiary. However, if Alternate is set to be Shifted, 92then the Quaternary strength can be used to break ties among whitespace, punctuation, and 93symbols that would otherwise be ignored. If very fine distinctions among characters are required, 94then the Identical strength can be used (for example, Identical Strength distinguishes 95between the Mathematical Bold Small A and the Mathematical Italic Small A.). However, using 96levels higher than Tertiary the Identical strength result in significantly longer sort 97keys, and slower string comparison performance for equal strings. 98 99 100 1016. Collator::__construct( $locale ). 102The Locale attribute is typically the most important attribute for correct sorting and matching, 103according to the user expectations in different countries and regions. The default UCA 104ordering will only sort a few languages such as Dutch and Portuguese correctly ("correctly" 105meaning according to the normal expectations for users of the languages). 106Otherwise, you need to supply the locale to UCA in order to properly collate text for a 107given language. Thus a locale needs to be supplied so as to choose a collator that is correctly 108tailored for that locale. The choice of a locale will automatically preset the values for 109all of the attributes to something that is reasonable for that locale. Thus most of the time the 110other attributes do not need to be explicitly set. In some cases, the choice of locale will make a 111difference in string comparison performance and/or sort key length. 112In short attribute names, <language>_<script>_<region>_<keyword>. 113Not all the elements are required. Valid values for locale elements are general valid values 114for RFC4646 locale naming, and RFC 4647 lookup algorithm. 115Example: 116Locale="sv" (Swedish) "Kypper" < "Kopfe" 117Locale="de" (German) "Kopfe" < "Kypper" 118 119 1207. Collator::get/setAttribute. 121ICU uses UCA as a default starting point for ordering. Not all languages have sorting sequences 122that correspond with the UCA because UCA cannot simultaneously encompass the specifics of all 123the languages currently in use. Therefore, ICU provides a data-driven, flexible, and run-time 124customizable mechanism called "tailoring". Tailoring overrides the default order of code points 125and the values of the ICU Collation Service attributes. 126Collator have followed attributes: 127 - FRENCH_COLLATION, possible values are: 128 ON 129 OFF (default) 130 DEFAULT 131 132 - CASE_FIRST, possible values are: 133 OFF (default) 134 LOWER_FIRST 135 UPPER_FIRST 136 DEFAULT 137 138 - CASE_LEVEL, possible values are: 139 OFF (default) 140 ON 141 DEFAULT 142 143 - NORMALIZATION_MODE, possible values are: 144 OFF (default) 145 ON 146 DEFAULT 147 148 - STRENGTH, possible values are: 149 PRIMARY 150 SECONDARY 151 TERTIARY (default) 152 QUATERNARY 153 IDENTICAL 154 DEFAULT 155 156 - ALTERNATE_HANDLING, possible values are: 157 NON_IGNORABLE (default) 158 SHIFTED 159 DEFAULT 160 161 - HIRAGANA_QUATERNARY_MODE, possible values are: 162 ON 163 OFF (default) 164 DEFAULT 165 166 - NUMERIC_COLLATION, possible values are: 167 ON 168 OFF (default) 169 DEFAULT 170 171Description of all of this attributes: 172 173FRENCH_COLLATION - Sort strings with different accents from the back of the string. This attribute 174is automatically set to On for the French locales and a few others. Users normally would 175not need to explicitly set this attribute. There is a string comparison performance cost when 176it is set On, but sort key length is unaffected. 177Example: 178F=X cote < cote < cote < cote 179F=O cote < cote < cote < cote 180 181CASE_FIRST - The Case_First attribute is used to control whether uppercase letters come before 182lowercase letters or vice versa, in the absence of other differences in the strings. The possible 183values are Uppercase_First (U) and Lowercase_First (L), plus the standard Default and Off. 184There is almost no difference between the Off and Lowercase_First options in terms of results, 185so typically users will not use Lowercase_First: only Off or Uppercase_First. (People interested 186in the detailed differences between X and L should consult the Collation Customization). 187Specifying either L or U won't affect string comparison performance, but will affect the sort key 188length. 189Example: 190C=X or C=L "china" < "China" < "denmark" < 191"Denmark" 192C=U "China" < "china" < "Denmark" < "denmark" 193 194CASE_LEVEL - The Case_Level attribute is used when ignoring accents but not case. In such a situation, 195set Strength to be Primary, and Case_Level to be On. In most locales, this setting is Off by default. 196There is a small string comparison performance and sort key impact if this attribute is set to be On. 197Example: 198S=1, E=X role = Role = role 199S=1, E=O role = role < Role 200 201NORMALIZATION_MODE - The Normalization setting determines whether text is thoroughly normalized 202or not in comparison. Even if the setting is off (which is the default for many locales), text as 203represented in common usage will compare correctly (for details, see UTN #5). Only if the accent 204marks are in noncanonical order will there be a problem. If the setting is On, then the best 205results are guaranteed for all possible text input. There is a medium string comparison performance 206cost if this attribute is On, depending on the frequency of sequences that require normalization. 207There is no significant effect on sort key length. If the input text is known to be in NFD or NFKD 208normalization forms, there is no need to enable this Normalization option. 209 210STRENGTH - see Collator::setStrength chapter. 211 212ALTERNATE_HANDLING - The Alternate attribute is used to control the handling of the socalled 213variable characters in the UCA: whitespace, punctuation and symbols. If Alternate is set to 214NonIgnorable (N), then differences among these characters are of the same importance as 215differences among letters. If Alternate is set to Shifted (S), then these characters are of only 216minor importance. The Shifted value is often used in combination with Strength set to Quaternary. 217In such a case, whitespace, punctuation, and symbols are considered when comparing strings, 218but only if all other aspects of the strings (base letters, accents, and case) are identical. 219If Alternate is not set to Shifted, then there is no difference between a Strength of 3 and 220a Strength of 4. For more information and examples, see 221Variable_Weighting in the UCA (http://www.unicode.org/reports/tr10/#Variable_Weighting). 222The reason the Alternate values are not simply On and Off is that additional Alternate values 223may be added in the future. The UCA option Blanked is expressed with Strength set to 3, 224and Alternate set to Shifted. The default for most locales is NonIgnorable. If Shifted is selected, 225it may be slower if there are many strings that are the same except for punctuation; 226sort key length will not be affected unless the strength level is also increased. 227Example: 228S=3, A=N di Silva < Di Silva < diSilva < U.S.A. < USA 229S=3, A=S di Silva = diSilva < Di Silva < U.S.A. = USA 230S=4, A=S di Silva < diSilva < Di Silva < U.S.A. < USA 231 232HIRAGANA_QUATERNARY_MODE - Compatibility with JIS x 4061 requires the introduction of an additional 233level to distinguish Hiragana and Katakana characters. If compatibility with that standard is required, 234then this attribute should be set On, and the strength set to Quaternary. This will affect sort key 235length and string comparison string comparison performance. 236 237NUMERIC_COLLATION - When turned on, this attribute generates a collation key for the 238numeric value of substrings of digits. This is a way to get '100' to sort AFTER '2'. 239 240