Day 43: Language Sorting Notes
⚠️ Language Sorting in Progress
I’m still digging deep into this issue where some Slovak letters are not sorted properly. I know I am not going to solve it by the end of the day, so I am going to try to write up my ridiculous amount of notes so that I will be able to pick it up again on Monday.
🐛 The Problem
The pkg text/collate lets you pick a language set and then sorts words based on that language. It is basically a (very complex) wrapper on top of the go std library’s sort methods.
Each language has a character set defined. Here is English’s and Slovak’s:
"en": {
0: "a b c d e f g h i j k l m n o p q r s t u v w x y z",
2: "- ‐ – — , ; : ! ? . … ' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″",
3: "á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ",
5: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z",
},
...
"sk": {
0: "a á ä b c č d ď e é f g h ch i í j k l ĺ ľ m n ň o ó ô p q r ŕ s š t ť u ú v w x y ý z ž",
2: "- ‐ – , ; : ! ? . … ‘ ‚ “ „ ( ) [ ] § @ * / &",
},
In the “main” English alphabet there are no accented letters. In English, letters with accents are sorted after ALL of the normal 26 a-z letters. That is not the case in Slovak. In Slovak the letters with accents are part of the alphabet and fall immediately after the non-accented version of the letter.
I decided to focus on one problematic letter at a time. To start with I chose:
ď
. The correct ordering should be d
before ď
. Here are some words in
correct order:
- d
- da
- dat
- dr
- ď
- ďat
But go has other ideas…
package main
import (
"fmt"
"golang.org/x/text/collate"
"golang.org/x/text/language"
)
func main() {
// these words are sorted correctly
words := []string{"d", "da", "dat", "dr", "ď", "ďat"}
c := collate.New(language.Slovak)
c.SortStrings(words)
fmt.Println(words)
}
Output
[d ď da dat ďat dr]
ಠ_ಠ It’s not even consistently wrong!
Time to start adding a lot of print statements to try to figure out wtf is going on.
🔑 Collation Keys
In case I was thinking that this would be easy (ha!) I quickly came across this Unicode Document about how to how to compare to Unicode strings. One website gave it an estimated read time of 137 minutes to understand it. That seems to be underestimating it to me.
My very tldr for how the algorithm works is…
- Each character in a word gets some weights.
- Character weights are put together to form a long array of ints that represents the weight of the word. This is called the “collation key” for the word.
- Then the keys for all the words are passed to a traditional sorting algorithm (quick/heap/etc) to sort.
Here is the code where it gets the key for each word. By adding lots of print statements I was able to get the collation key for each string.
d - [22 49 0 0 0 32 0 0 2]
ď - [22 49 0 0 0 32 0 65 0 0 2 2]
da - [22 49 21 239 0 0 0 32 0 32 0 0 2 2]
dat - [22 49 21 239 24 22 0 0 0 32 0 32 0 32 0 0 2 2 2]
ďat - [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2]
dr - [22 49 23 189 0 0 0 32 0 32 0 0 2 2]
✨ So now I know that they are sorted in this order, because that is the order of their collation keys.
❓ But how are these keys made?
🗝 Making Collation Keys
As I said above each character gets a weight. There are 4 levels of weights: primary, secondary, tertiary, and quaternary. I recorded these weights for each character:
Char | Primary | Secondary | Tertiary | Quaternary |
---|---|---|---|---|
d | 5681 | 32 | 2 | - |
’ (the accent for ď) | 0 | 65 | 2 | - |
a | 5615 | 32 | 2 | - |
t | 6166 | 32 | 2 | - |
r | 6077 | 32 | 2 | - |
❓ Okay those are the weights, but how to they form the keys?
Algorithm for turning character weights into a collation key:
- First add the primary value for each character, iff it is not 0.
- Then add the secondary value for each character, iff it is not 0. (Plus some buffer zeros.)
- Then add the tertiary value for each character, iff it is not 0. (Plus some buffer zeros.)
- Then add the quaternary value for each character, iff it is not 0. (Plus some buffer zeros.)
Here’s a walk through of making the collation key for for the word ďat:
- Add primary for
d
= [22 49] (I’m not sure how this conversion happens. It’s not hex. I’m confused, but ignored it for now.) - Skip primary for
'
because it is 0. - Add primary for
a
= [22 49 21 239] - Add primary for
t
= [22 49 21 239 24 22] - Add secondary for
d
= [22 49 21 239 24 22 0 0 0 32] - Add secondary for
'
= [22 49 21 239 24 22 0 0 0 32 0 65] - Add secondary for
a
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32] - Add secondary for
t
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32] - Add tertiary for
d
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2] - Add tertiary for
'
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2] - Add tertiary for
a
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2] - Add tertiary for
t
= [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2]
After after doing this walk through for all of my examples, I realized what the bug was!
🐞 Because the accent '
does not have a primary weight, the next weight added
to the collation key is the primary weight for the NEXT letter. It’s basically
treating the accent as if it is not there and that d
should be sorted the same
as ď
(except for the slightly different secondary sorting amounts, but they
are all the way at the end and only come into affect if the rest of the word is
the same).
❓ My next question is: what is supposed to happen? Is '
supposed to have a
primary weight?
What do collation keys look like for words that sort properly?
Next I found a letter with an accent that was sorting properly. I chose č
.
package main
import (
"fmt"
"golang.org/x/text/collate"
"golang.org/x/text/language"
)
func main() {
// these words are sorted correctly
words := []string{"cib", "čaj"}
c := collate.New(language.Slovak)
c.SortStrings(words)
fmt.Println(words)
}
Output
[cib čaj]
Yay! It orders correctly!
So I mapped out the collation key for čaj
and the first thing I noticed is
that č
is treated as one character.
Char | Primary | Secondary | Tertiary | Quaternary |
---|---|---|---|---|
č | 5662 | 32 | 2 | - |
a | 5615 | 32 | 2 | - |
j | 5862 | 32 | 2 | - |
✨ I learned that accented characters should be treated as one character and
should not be split up. This is what happens with č
. This is not what
happens with ď
, which is split up into d
and '
.
✂️ Why is ď split into two characters with different weights?
Before the collation key is made, each character is turned into an “element” here and then the elements are passed into to an algorithm to make the weights and collation key.
char | element |
---|---|
c | 0x402c3a20 |
č | 0x402c3c20 |
d | 0x402c6220 |
’ | 0xae604102 |
Turns out that d'
is split into two elements. So the fault is not with the
weighting. It comes before that.
❓ How are these elements made and why is ď split into two?
⚛️ Why is ď split into 2 elements?
- The function
getColElemsString
is called to make the elements. - It calls
Next
which iterates over everything and “appends elems to the internal array”. Next
callsappendNext
.appendNext
callsAppendNextString
.AppendNextString
callst.appendNext
t.appendNext
callssrc.lookup
and this looks up the character in some table and returns the element.- BUT!!! When you call
src.lookup
onď
it returns back a different element than the ones I listed above:0xe0000a83
!! And it returns just one element! - Later in
src.lookup
it callsctype
on the element. This function looks at the element and returns whether it is normal or if it needs to be contracted OR EXPANDED. ď
is marked for expansion.- Then
t.appendNext
callsappendExpantion
, which does just that. It expands0xe0000a83
into0x402c6220
and0xae604102
. - 💥 boom! Now there are two elements for
ď
.
✨ I learned that some ranges of elements get expanded into 2 elements.
❓ Why is ď
given an element that needs to be expanded? How are those tables
made?
😴 Stopping point
I’m still trying to figure out that last question.