I need correct sorting + collator that sorts all European languages correctly (including e.g. czech characters.
I've tried various approaches and it still messed up accented U.
I rather not go for a custom solution....
The order of the Czech letters is: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, (q), r, ř, s, š, t, ť, u, ú, ů, v, (w), (x), y, ý, z, ž
As is visible, the accented U are actually different characters...
u, ú, ů
Currently I am finding finding the locale for the language code, and the EU country, and using :
collator = Collator.getInstance(selectedLocale);
// Set appropriate strength (PRIMARY or SECONDARY)
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
Also, I am trying the additional library:
<dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
<version>74.2</version>
</dependency>
Unfortunately, I get wrong results...
E.g. I am getting:
"Účetnictví" , "Udržitelná",
And I should be getting the reverse...
Using Java 17...
Update - Jdoodle
import java.text.Collator;
import java.util.*;
public class CzechCollationTest {
public static void main(String[] args) {
Locale locale = new Locale("cs");
List<String> words = Arrays.asList(
"Účetnictví", "Udržitelná", "Uhlovodíky"
);
test(words, locale, null, "Without strength");
test(words, locale, Collator.PRIMARY, "With PRIMARY");
test(words, locale, Collator.SECONDARY, "With SECONDARY");
test(words, locale, Collator.TERTIARY, "With TERTIARY");
test(words, locale, Collator.IDENTICAL, "With IDENTICAL");
}
private static void test(List<String> list, Locale locale, Integer strength, String description) {
// Avoid mutating the existing list.
List<String> clone = new ArrayList<>(list);
Collator collator = Collator.getInstance(locale);
if (strength != null) {
collator.setStrength(strength);
}
clone.sort(collator);
System.out.println(description + ":" + String.join(", ", clone));
}
}
Without strength:Účetnictví, Udržitelná, Uhlovodíky
With PRIMARY:Účetnictví, Udržitelná, Uhlovodíky
With SECONDARY:Účetnictví, Udržitelná, Uhlovodíky
With TERTIARY:Účetnictví, Udržitelná, Uhlovodíky
With IDENTICAL:Účetnictví, Udržitelná, Uhlovodíky
I need correct sorting + collator that sorts all European languages correctly (including e.g. czech characters.
I've tried various approaches and it still messed up accented U.
I rather not go for a custom solution....
The order of the Czech letters is: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, (q), r, ř, s, š, t, ť, u, ú, ů, v, (w), (x), y, ý, z, ž
As is visible, the accented U are actually different characters...
u, ú, ů
Currently I am finding finding the locale for the language code, and the EU country, and using :
collator = Collator.getInstance(selectedLocale);
// Set appropriate strength (PRIMARY or SECONDARY)
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
Also, I am trying the additional library:
<dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
<version>74.2</version>
</dependency>
Unfortunately, I get wrong results...
E.g. I am getting:
"Účetnictví" , "Udržitelná",
And I should be getting the reverse...
Using Java 17...
Update - Jdoodle
https://www.jdoodle/embed/v1/45a4a4a323255661
import java.text.Collator;
import java.util.*;
public class CzechCollationTest {
public static void main(String[] args) {
Locale locale = new Locale("cs");
List<String> words = Arrays.asList(
"Účetnictví", "Udržitelná", "Uhlovodíky"
);
test(words, locale, null, "Without strength");
test(words, locale, Collator.PRIMARY, "With PRIMARY");
test(words, locale, Collator.SECONDARY, "With SECONDARY");
test(words, locale, Collator.TERTIARY, "With TERTIARY");
test(words, locale, Collator.IDENTICAL, "With IDENTICAL");
}
private static void test(List<String> list, Locale locale, Integer strength, String description) {
// Avoid mutating the existing list.
List<String> clone = new ArrayList<>(list);
Collator collator = Collator.getInstance(locale);
if (strength != null) {
collator.setStrength(strength);
}
clone.sort(collator);
System.out.println(description + ":" + String.join(", ", clone));
}
}
https://www.jdoodle/embed/v1/45a4a4a323255661
Without strength:Účetnictví, Udržitelná, Uhlovodíky
With PRIMARY:Účetnictví, Udržitelná, Uhlovodíky
With SECONDARY:Účetnictví, Udržitelná, Uhlovodíky
With TERTIARY:Účetnictví, Udržitelná, Uhlovodíky
With IDENTICAL:Účetnictví, Udržitelná, Uhlovodíky
Share
Improve this question
edited Mar 10 at 15:51
Menelaos
asked Mar 10 at 9:56
MenelaosMenelaos
26.6k20 gold badges97 silver badges164 bronze badges
11
|
Show 6 more comments
1 Answer
Reset to default 3As per the javadoc of RuleBasedCollator
(which is the implementation of collator that the JDK uses), you can make your own:
String Norwegian = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
"< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
"< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
"< \u00E6, \u00C6" + // Latin letter ae & AE
"< \u00F8, \u00D8" + // Latin letter o & O with stroke
"< \u00E5 = a\u030A," + // Latin letter a with ring above
" \u00C5 = A\u030A;" + // Latin letter A with ring above
" aa, AA";
RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
(You can make these read a lot nicer by using the text blocks feature in Java 15+: with triple quotes you can just hit enter in your own string literals without having to close them, + them, newline, open a new string, and so on. The backslash-escaped things are still applied in these, it's not 'raw' mode (and even if it was, \u
is JDK magic that is always applied regardless of where you use them).
This still involves writing the actual rules; the code you end up with is barely shorter than what you came up with yourself in your own answer to this question.
I'm currently searching for where JDK gets its 'rule' string for the cz locale. Presumably, that doesn't exist, or piggybacks on a different language (which is incorrect), or contains a bug. I'll edit this answer if I find it.
UPDATE: My search ends at sun.util.locale.provider.ResourceBundleBasedAdapter
; that's an interface so no further info can be gleaned from there: an implementation of this is the source of the 'rules' string used by the JDK. But, the name kinda yells 'there is a resource bundle file that contains it', thus, our answer is in locale-cz.resource
or whatnot.
This seems to be the locale resource files used by the JDK: jdk/src/jdk.localedata/share/classes/sun/text/resources /ext/
It does not appear to contain a _cz
entry at all. It does contain a _cs
which doesn't even exist anymore (it's the old code for serbia+montenegro) which is all sorts of bizarre. I had a quick look at that file and it does not appear to be an erroneously named file for the czech collation order.
The github repo is just a convenient clone. Possibly we're not looking at the right stuff. But it sure feels like this is a JDK bug issue that should be reported.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744853231a4597250.html
com.ibm.icu.text.Collator
for example - did you try those. – greg-449 Commented Mar 10 at 10:43