Correct sorting in Java for all languages (including e.g. CS) - Stack Overflow

I need correct sorting + collator that sorts all European languages correctly (including e.g. czech cha

I need correct sorting + collator that sorts all European languages correctly (including e.g. czech characters.

I've tried various approaches and it still messed up accented U.

I rather not go for a custom solution....

The order of the Czech letters is: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, (q), r, ř, s, š, t, ť, u, ú, ů, v, (w), (x), y, ý, z, ž

As is visible, the accented U are actually different characters...

u, ú, ů

Currently I am finding finding the locale for the language code, and the EU country, and using :

collator = Collator.getInstance(selectedLocale);

            // Set appropriate strength (PRIMARY or SECONDARY)

            collator.setStrength(Collator.PRIMARY);

            collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);

Also, I am trying the additional library:

    <dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
<version>74.2</version>
</dependency>       

Unfortunately, I get wrong results...

E.g. I am getting:

"Účetnictví" , "Udržitelná",

And I should be getting the reverse...

Using Java 17...

Update - Jdoodle

import java.text.Collator;
import java.util.*;

public class CzechCollationTest {
    public static void main(String[] args) {
        Locale locale = new Locale("cs");
        List<String> words = Arrays.asList(
            "Účetnictví", "Udržitelná", "Uhlovodíky"
        );
        test(words, locale, null, "Without strength");
        test(words, locale, Collator.PRIMARY, "With PRIMARY");
        test(words, locale, Collator.SECONDARY, "With SECONDARY");
        test(words, locale, Collator.TERTIARY, "With TERTIARY");
        test(words, locale, Collator.IDENTICAL, "With IDENTICAL");
    }

    private static void test(List<String> list, Locale locale, Integer strength, String description) {
        // Avoid mutating the existing list.
        List<String> clone = new ArrayList<>(list);
        Collator collator = Collator.getInstance(locale);
        if (strength != null) {
            collator.setStrength(strength);
        }
        clone.sort(collator);
        System.out.println(description + ":" + String.join(", ", clone));
    }
}

Without strength:Účetnictví, Udržitelná, Uhlovodíky
With PRIMARY:Účetnictví, Udržitelná, Uhlovodíky
With SECONDARY:Účetnictví, Udržitelná, Uhlovodíky
With TERTIARY:Účetnictví, Udržitelná, Uhlovodíky
With IDENTICAL:Účetnictví, Udržitelná, Uhlovodíky

I need correct sorting + collator that sorts all European languages correctly (including e.g. czech characters.

I've tried various approaches and it still messed up accented U.

I rather not go for a custom solution....

The order of the Czech letters is: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, (q), r, ř, s, š, t, ť, u, ú, ů, v, (w), (x), y, ý, z, ž

As is visible, the accented U are actually different characters...

u, ú, ů

Currently I am finding finding the locale for the language code, and the EU country, and using :

collator = Collator.getInstance(selectedLocale);

            // Set appropriate strength (PRIMARY or SECONDARY)

            collator.setStrength(Collator.PRIMARY);

            collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);

Also, I am trying the additional library:

    <dependency>
<groupId>com.ibm.icu</groupId>
<artifactId>icu4j</artifactId>
<version>74.2</version>
</dependency>       

Unfortunately, I get wrong results...

E.g. I am getting:

"Účetnictví" , "Udržitelná",

And I should be getting the reverse...

Using Java 17...

Update - Jdoodle

https://www.jdoodle/embed/v1/45a4a4a323255661

import java.text.Collator;
import java.util.*;

public class CzechCollationTest {
    public static void main(String[] args) {
        Locale locale = new Locale("cs");
        List<String> words = Arrays.asList(
            "Účetnictví", "Udržitelná", "Uhlovodíky"
        );
        test(words, locale, null, "Without strength");
        test(words, locale, Collator.PRIMARY, "With PRIMARY");
        test(words, locale, Collator.SECONDARY, "With SECONDARY");
        test(words, locale, Collator.TERTIARY, "With TERTIARY");
        test(words, locale, Collator.IDENTICAL, "With IDENTICAL");
    }

    private static void test(List<String> list, Locale locale, Integer strength, String description) {
        // Avoid mutating the existing list.
        List<String> clone = new ArrayList<>(list);
        Collator collator = Collator.getInstance(locale);
        if (strength != null) {
            collator.setStrength(strength);
        }
        clone.sort(collator);
        System.out.println(description + ":" + String.join(", ", clone));
    }
}

https://www.jdoodle/embed/v1/45a4a4a323255661

Without strength:Účetnictví, Udržitelná, Uhlovodíky
With PRIMARY:Účetnictví, Udržitelná, Uhlovodíky
With SECONDARY:Účetnictví, Udržitelná, Uhlovodíky
With TERTIARY:Účetnictví, Udržitelná, Uhlovodíky
With IDENTICAL:Účetnictví, Udržitelná, Uhlovodíky
Share Improve this question edited Mar 10 at 15:51 Menelaos asked Mar 10 at 9:56 MenelaosMenelaos 26.6k20 gold badges97 silver badges164 bronze badges 11
  • You're explicitly setting the collator strength to "primary". I strongly suspect that accent differences are considered to be secondary differences (or maybe even tertiary). Why are you explicitly setting the strength to primary? – Jon Skeet Commented Mar 10 at 9:58
  • 1 The ICU library provides new classes in a different package, com.ibm.icu.text.Collator for example - did you try those. – greg-449 Commented Mar 10 at 10:43
  • 4 Please update the question - that's where a minimal reproducible example belongs. I'd also note that you're finding just "some arbtirary locale with cs as the language" - there could be multiple variants. I think it would be better to just call the Locale constructor with the language/country/variant you want to test. – Jon Skeet Commented Mar 10 at 11:01
  • 1 Here's a rather simpler (IMO) repro: gist.github/jskeet/41b9b6194b087f6194d6db18aaa2b9ec (I'd suggest you might want to reduce the number of words to just the three beginning with U as well...) – Jon Skeet Commented Mar 10 at 11:10
  • 1 That I don't know, but at least now with a minimal reproducible example (which I've reformatted - please use the preview in future to make sure the post looks appropriate when you post/edit) the post is hopefully more appealing to those who might be able to help. – Jon Skeet Commented Mar 10 at 14:22
 |  Show 6 more comments

1 Answer 1

Reset to default 3

As per the javadoc of RuleBasedCollator (which is the implementation of collator that the JDK uses), you can make your own:

String Norwegian = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
                    "< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
                    "< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
                    "< \u00E6, \u00C6" +     // Latin letter ae & AE
                    "< \u00F8, \u00D8" +     // Latin letter o & O with stroke
                    "< \u00E5 = a\u030A," +  // Latin letter a with ring above
                    "  \u00C5 = A\u030A;" +  // Latin letter A with ring above
                    "  aa, AA";
 RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

(You can make these read a lot nicer by using the text blocks feature in Java 15+: with triple quotes you can just hit enter in your own string literals without having to close them, + them, newline, open a new string, and so on. The backslash-escaped things are still applied in these, it's not 'raw' mode (and even if it was, \u is JDK magic that is always applied regardless of where you use them).

This still involves writing the actual rules; the code you end up with is barely shorter than what you came up with yourself in your own answer to this question.

I'm currently searching for where JDK gets its 'rule' string for the cz locale. Presumably, that doesn't exist, or piggybacks on a different language (which is incorrect), or contains a bug. I'll edit this answer if I find it.

UPDATE: My search ends at sun.util.locale.provider.ResourceBundleBasedAdapter; that's an interface so no further info can be gleaned from there: an implementation of this is the source of the 'rules' string used by the JDK. But, the name kinda yells 'there is a resource bundle file that contains it', thus, our answer is in locale-cz.resource or whatnot.

This seems to be the locale resource files used by the JDK: jdk/src/jdk.localedata/share/classes/sun/text/resources /ext/

It does not appear to contain a _cz entry at all. It does contain a _cs which doesn't even exist anymore (it's the old code for serbia+montenegro) which is all sorts of bizarre. I had a quick look at that file and it does not appear to be an erroneously named file for the czech collation order.

The github repo is just a convenient clone. Possibly we're not looking at the right stuff. But it sure feels like this is a JDK bug issue that should be reported.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744853231a4597250.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信