👾 Artificial Languages

I spent my morning digging into this issue about artificial languages not being sorted correctly.

It turns out that it is not just about artificial languages, it is about all language tags that have a private tag (the x thing).

Going back to my new favorite rfc: rfc4647 “Matching of Language Tags” section 3.4 “Lookup” it turns out that language matchers should take the x private-use sequences into account when matching:

In the lookup scheme, the language range is progressively truncated from the end until a matching language tag is located. Single letter or digit subtags (including both the letter ‘x’, which introduces private-use sequences, and the subtags that introduce extensions) are removed at the same time as their closest trailing subtag. For example, starting with the range “zh-Hant-CN-x-private1-private2” (Chinese, Traditional script, China, two private-use tags) the lookup progressively searches for content as shown below:

Example of a Lookup Fallback Pattern

Range to match: zh-Hant-CN-x-private1-private2

  1. zh-Hant-CN-x-private1-private2
  2. zh-Hant-CN-x-private1
  3. zh-Hant-CN
  4. zh-Hant
  5. zh
  6. (default)

However, these private tags aren’t being taken into account. At first I thought it might be a weird artificial language thing, but it turns out to be a bug with EVERY language. Here is an example without artificial languages.

package main

import (
	"fmt"
	"golang.org/x/text/language"
)

func main() {
	a, _ := language.Parse("en-x-a")
	b, _ := language.Parse("en-x-b")
	matcher := language.NewMatcher([]language.Tag{a, b})

	m, _, _ := matcher.Match(b) // Asks for the best match for "en-x-b"

	fmt.Println(m) // en-x-a :(
}

This ends up happening because the algorithm considers art-x-a to be an exact match with art-x-b. It doesn’t even check other options.

Strangely enough the code looks like it checks the private use tags when it calls the method VariantOrPrivateUseTags here. But that function always returns an empty string for me! So it always reports that the private use tags are identical!

I added a hacky fix for that, but it still didn’t fix the issue. Even if they aren’t equal the code doesn’t check it again to lower the confidence level like it does here for when there is a different in script or region. So in the end, art-x-a is STILL listed as an exact match for art-x-b.

Then I added a fix for THAT! If the private tags were different I lowered the confidence level to “high” so that the algorithm would keep searching through the options and would eventually find the “exact” match. This worked… for my example at least. It broke a lot of tests because it lowered their confidence levels from exact to high. Seems like an unwanted breaking change… maybe?

🔮 What’s Next

I only have 4 more working days until the end of my sabbatical. Where did the time go?

My goals for the next few days are:

  • Write a comment with my findings for the artificial languages bug
  • Write a blog post wrap up of my time in this experiment
  • Write a presentation to give to my company about how my sabbatical went
  • Write a proposal for what I want to work on next

😭 I miss this already.