Day 43: Language Sorting Notes

« Day 42: Deep into language sorting Day 44 and 45: More Language!! »

⚠️ Language Sorting in Progress

I’m still digging deep into this issue where some Slovak letters are not sorted properly. I know I am not going to solve it by the end of the day, so I am going to try to write up my ridiculous amount of notes so that I will be able to pick it up again on Monday.

🐛 The Problem

The pkg text/collate lets you pick a language set and then sorts words based on that language. It is basically a (very complex) wrapper on top of the go std library’s sort methods.

Each language has a character set defined. Here is English’s and Slovak’s:

"en": {
  0: "a b c d e f g h i j k l m n o p q r s t u v w x y z",
  2: "- ‐ – — , ; : ! ? . … ' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″",
  3: "á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ",
  5: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z",
},
...
"sk": {
  0: "a á ä b c č d ď e é f g h ch i í j k l ĺ ľ m n ň o ó ô p q r ŕ s š t ť u ú v w x y ý z ž",
  2: "- ‐ – , ; : ! ? . … ‘ ‚ “ „ ( ) [ ] § @ * / &",
},

In the “main” English alphabet there are no accented letters. In English, letters with accents are sorted after ALL of the normal 26 a-z letters. That is not the case in Slovak. In Slovak the letters with accents are part of the alphabet and fall immediately after the non-accented version of the letter.

I decided to focus on one problematic letter at a time. To start with I chose: ď. The correct ordering should be d before ď. Here are some words in correct order:

d
da
dat
dr
ď
ďat

But go has other ideas…

package main

import (
	"fmt"

	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {
  // these words are sorted correctly
	words := []string{"d", "da", "dat", "dr", "ď", "ďat"}

	c := collate.New(language.Slovak)
	c.SortStrings(words)

	fmt.Println(words)
}

Output

[d ď da dat ďat dr]

ಠ_ಠ It’s not even consistently wrong!

Time to start adding a lot of print statements to try to figure out wtf is going on.

🔑 Collation Keys

In case I was thinking that this would be easy (ha!) I quickly came across this Unicode Document about how to how to compare to Unicode strings. One website gave it an estimated read time of 137 minutes to understand it. That seems to be underestimating it to me.

My very tldr for how the algorithm works is…

Each character in a word gets some weights.
Character weights are put together to form a long array of ints that represents the weight of the word. This is called the “collation key” for the word.
Then the keys for all the words are passed to a traditional sorting algorithm (quick/heap/etc) to sort.

Here is the code where it gets the key for each word. By adding lots of print statements I was able to get the collation key for each string.

d   -  [22 49 0 0 0 32 0 0 2]
ď   -  [22 49 0 0 0 32 0 65 0 0 2 2]
da  -  [22 49 21 239 0 0 0 32 0 32 0 0 2 2]
dat -  [22 49 21 239 24 22 0 0 0 32 0 32 0 32 0 0 2 2 2]
ďat -  [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2]
dr  -  [22 49 23 189 0 0 0 32 0 32 0 0 2 2]

✨ So now I know that they are sorted in this order, because that is the order of their collation keys.

❓ But how are these keys made?

🗝 Making Collation Keys

As I said above each character gets a weight. There are 4 levels of weights: primary, secondary, tertiary, and quaternary. I recorded these weights for each character:

Char	Primary	Secondary	Tertiary	Quaternary
d	5681	32	2	-
’ (the accent for ď)	0	65	2	-
a	5615	32	2	-
t	6166	32	2	-
r	6077	32	2	-

❓ Okay those are the weights, but how to they form the keys?

Algorithm for turning character weights into a collation key:

First add the primary value for each character, iff it is not 0.
Then add the secondary value for each character, iff it is not 0. (Plus some buffer zeros.)
Then add the tertiary value for each character, iff it is not 0. (Plus some buffer zeros.)
Then add the quaternary value for each character, iff it is not 0. (Plus some buffer zeros.)

Here’s a walk through of making the collation key for for the word ďat:

Add primary for d = [22 49] (I’m not sure how this conversion happens. It’s not hex. I’m confused, but ignored it for now.)
Skip primary for ' because it is 0.
Add primary for a = [22 49 21 239]
Add primary for t = [22 49 21 239 24 22]
Add secondary for d = [22 49 21 239 24 22 0 0 0 32]
Add secondary for ' = [22 49 21 239 24 22 0 0 0 32 0 65]
Add secondary for a = [22 49 21 239 24 22 0 0 0 32 0 65 0 32]
Add secondary for t = [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32]
Add tertiary for d = [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2]
Add tertiary for ' = [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2]
Add tertiary for a = [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2]
Add tertiary for t = [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2]

After after doing this walk through for all of my examples, I realized what the bug was!

🐞 Because the accent ' does not have a primary weight, the next weight added to the collation key is the primary weight for the NEXT letter. It’s basically treating the accent as if it is not there and that d should be sorted the same as ď (except for the slightly different secondary sorting amounts, but they are all the way at the end and only come into affect if the rest of the word is the same).

❓ My next question is: what is supposed to happen? Is ' supposed to have a primary weight?

What do collation keys look like for words that sort properly?

Next I found a letter with an accent that was sorting properly. I chose č.

package main

import (
	"fmt"

	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {
  // these words are sorted correctly
	words := []string{"cib", "čaj"}

	c := collate.New(language.Slovak)
	c.SortStrings(words)

	fmt.Println(words)
}

Output

[cib čaj]

Yay! It orders correctly!

So I mapped out the collation key for čaj and the first thing I noticed is that č is treated as one character.

Char	Primary	Secondary	Tertiary	Quaternary
č	5662	32	2	-
a	5615	32	2	-
j	5862	32	2	-

✨ I learned that accented characters should be treated as one character and should not be split up. This is what happens with č. This is not what happens with ď, which is split up into d and '.

✂️ Why is ď split into two characters with different weights?

Before the collation key is made, each character is turned into an “element” here and then the elements are passed into to an algorithm to make the weights and collation key.

char	element
c	0x402c3a20
č	0x402c3c20
d	0x402c6220
’	0xae604102

Turns out that d' is split into two elements. So the fault is not with the weighting. It comes before that.

❓ How are these elements made and why is ď split into two?

⚛️ Why is ď split into 2 elements?

The function getColElemsString is called to make the elements.
It calls Next which iterates over everything and “appends elems to the internal array”.
Next calls appendNext.
appendNext calls AppendNextString.
AppendNextString calls t.appendNext
t.appendNext calls src.lookup and this looks up the character in some table and returns the element.
BUT!!! When you call src.lookup on ď it returns back a different element than the ones I listed above: 0xe0000a83!! And it returns just one element!
Later in src.lookup it calls ctype on the element. This function looks at the element and returns whether it is normal or if it needs to be contracted OR EXPANDED.
ď is marked for expansion.
Then t.appendNext calls appendExpantion, which does just that. It expands 0xe0000a83 into 0x402c6220 and 0xae604102.
💥 boom! Now there are two elements for ď.

✨ I learned that some ranges of elements get expanded into 2 elements.

❓ Why is ď given an element that needs to be expanded? How are those tables made?

😴 Stopping point

I’m still trying to figure out that last question.

« Day 42: Deep into language sorting Day 44 and 45: More Language!! »