The Impact of Collations in Databases

Reading time: 6 min
cover

A while ago, when I just joined my team at one of employers, a colleague showed me the extremely slow production environment and explained that it was caused by a large query used to retrieve most of the necessary information. As my colleagues had noted, the query had become increasingly slow over time and that they didn’t have the time to look into the matter. Instead, they scaled up the database resources in AWS by one tier at a time. With support of my colleague, I began to investigate the matter in more detail.

Checking out the issue

Upon reviewing the query, I noticed that each execution was taking longer than 15 seconds to complete in production. To get a hint of what was happening I checked out the database dashboard in AWS. From my observations, the database seemed to be idle most of the time, except when the query was executed - then there was a spike in memory usage. This led me to conclude that the issue wasn’t with the allocated resources, but rather with the query itself. To pinpoint the problem, I analyzed the query with the following syntax:

SQL
ANALYZE FORMAT=JSON explainable_statement;

From the output it became clear that something was going wrong on an indexed search, which made me raise my eyebrows. From my time in university, I remembered that SQL databases are usually structured as BTrees. An indexed search performed on a BTree is extremely fast Θ(log(n)). So why was it that this query takes so much time to finish. Well spoiler alert it was because of the collation, or should I say collations.

Apparently, a colleague introduced a new collation next to the default collation MariaDB provides (latin1_swedish_ci). This new collation caused the database to use different instructions on how data is sorted and compared. Because of the difference in collations the database couldn’t use the index which resulted in the database scanning and converting all the rows for the matching value Θ(n). Well, with thousands of rows including multiple joins it gets slow.

As an example, I have downloaded the Sakila sample database and modified the primary key and corresponding foreign key to show the difference between using the same collation and different collations. When both keys were set to the same collation, the indexed search was almost instantaneous. However, when different collations were used, the search speed slowed down significantly. In addition, the database had to loop through all rows of the table once to convert those columns to the correct collation.

Analyze output, keys using the same collation:

JSON
"query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.005056578, <-- fast execution time! 😁
    "table": {
        "table_name": "customer",
        "access_type": "const",
        "possible_keys": ["PRIMARY", "idx_fk_address_id"],
        "key": "PRIMARY",
        "key_length": "2",
        "used_key_parts": ["customer_id"],
        "ref": ["const"],
        "r_loops": 0,
        "rows": 1,
        "r_rows": null,
        "filtered": 100,
        "r_filtered": null
    },
    "table": {
        "table_name": "address",
        "access_type": "const",
        "possible_keys": ["PRIMARY"],
        "key": "PRIMARY",
        "key_length": "1022",
        "used_key_parts": ["address_id"],
        "ref": ["const"],
        "r_loops": 0,
        "rows": 1, <-- Using the index key
        "r_rows": null,
        "filtered": 100,
        "r_filtered": null
    }
}

Analyze output, keys using different collations:

JSON
"query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.211677541, <-- slow execution time! 😥
    "table": {
        "table_name": "customer",
        "access_type": "const",
        "possible_keys": ["PRIMARY", "idx_fk_address_id"],
        "key": "PRIMARY",
        "key_length": "2",
        "used_key_parts": ["customer_id"],
        "ref": ["const"],
        "r_loops": 0,
        "rows": 1,
        "r_rows": null,
        "filtered": 100,
        "r_filtered": null
    },
    "table": {
        "table_name": "address",
        "access_type": "ALL",
        "r_loops": 1, <-- Looped through all the rows of the table
        "rows": 603,
        "r_rows": 603, <-- How many rows were actually read by the execution of the query
        "r_table_time_ms": 0.152315624,
        "r_other_time_ms": 0.055721654,
        "filtered": 100,
        "r_filtered": 0.165837479,
        "attached_condition": "convert(address.address_id using utf32) = '\0\0\05'"
    }
}

What exactly is a collation?

A collation specifies the bit patterns that represent each character in a dataset. Collations also determine the rules that sort and compare data. For character data types, such as char and varchar, it means that collations dictate the code page (character encoding) and corresponding character that can be represented for that data type.

So, collations are pretty important when it comes to your queries and linguistic rules as it might affect your performance and output. For instance, in German, the umlauts (ä, ö, ü) are treated as separate letters and are sorted after “z” in the dictionary. This means that “Müller” would appear after “Muller” in a German collation, but before it in a Swedish collation. To understand this in dept you need to know the format of a collation. A collation exists out of a name and option e.g., Latin1_Swedish_100_CI_AS_SC or Japanese_Bushu_Kakusu_100_CS_AS_KS_WS

The collation name

The collation name “Latin1_Swedish” specifies the alphabet, language, or culture. In this case it refers to the Latin1 character set (also known as ISO-8859-1), with Swedish language attributes like å, ä, and ö. For scripts and characters not used in a particular language, explicit rules may not exist. It neither defines the ordering of characters such as Ж, ש, ♫, ∞, ◊, or ⌂. Other examples may include:

Language Swedish: z < ö
German: ö < z
Culture German Dictionary: of < öf
German Phone book: öf < of

Occasionally you will also see a number like “100” after the collation name. This number refers to the version of the sorting rules used by this collation. In this case, it refers to version 100 of the Windows collation rules.

The collation options

After the collation name and version, the options are appended. Those options can vary by database engine; however the most important and common options are:

Syntax Option Sorting Rule
CS Case sensitive a < A
CI Case-insensitive a == A
AS Accent-sensitive a != ấ
AI Accent-insensitive a == ấ

What collation should I use?

In general, I would recommend to use utf8mb4_unicode_ci for most use cases. This collation includes a wide range of languages and BMP (bitmaps) support, which is used for special characters and emoji’s. In exceptional cases I would differ from this collation, and even then, I would most likely use the option to only apply this collation on a specific column instead of the entire database or table.