MySQL Collation & Case Sensitivity or Insensitivity

When using MySQL (or most any DB really), it’s a good idea to understand about collation: the implications, the defaults, and how to change it.

Collation determines how data is sorted and compared in the database. Its essentially a set of rules for the DB to follow when making decisions on said sort and comparison. It answers questions the DB needs to know, like: “how should upper / lower case or character accents be considered when I compare or sort something?”

By default, MySQL will use a case-insensitive collation. You might see this as: latin1_ci, utf8mb4_unicode_ci, etc, depending on the character set you have. The “_ci” stands for “case insensitive”.

While this may be desirable for certain applications, and can be overridden on a per-query basis with the “COLLATE” clause, it can also be very detrimental if you are not aware of it; if you are using an ORM, it may not be desirable to create direct SQL statements.

I recently ran into an issue with a client where they were using the default case-insensitive collation on a project and this was causing undesired collisions. This client was using Django ORM, which has no ability to use the COLLATE clause without direct SQL statements, which were not desirable.

The problem was that – in this particular application – Microsoft Graph API email ids where being used with a Unique constraint. While these email IDs are indeed unique, they can become non-unique if you are using a DB that is collating without regard to case, as was the case here (pun could possibly have been intended). The email id, in effect, require you to use case-sensitive collation if you are going to key off of them or otherwise declare them to be unique.

Luckily with MySQL, we can dynamically alter the collation at multiple levels: DB, TABLE, or COLUMN. In the following example I show the collation problem and then repair the collation problem at the table level in MySQL.

Let’s start off by viewing the actual problem:

mysql> select * from a_table where email_id='AAMkADg5YTA3MDI4LTU4OTktNDU0Mi1hNDg2LWFhMWJlNDIxMTcxYQBGAAAAAABGS97sl180Soe3IIehe0cIBwBJjaJhSZ5dQrvfHER4hyffAAAAEpopAADzbgOkaOB-RrhbS1H09mzgAACsctTzAAA=';
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------------------+------------------+----------------------------+---------+------------+---------------------+---------------------+
| id | email_id                                                                                                                                                 | dt_created                 | event_id   | dt_received                | state            | dt_completed               | case_id | error      | dt_last_error_email | ticket_num_received |
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------------------+------------------+----------------------------+---------+------------+---------------------+---------------------+
| 38 | AAMkADg5YTA3MDI4LTU4OTktNDU0Mi1hNDg2LWFhMWJlNDIxMTcxYQBGAAAAAABGS97sl180Soe3IIehe0cIBwBJjaJhSZ5dQrvfHER4hyffAAAAEpopAADzbgOkaOB-RrhbS1H09mzgAACsctTZAAA= | 2020-06-14 14:15:53.000000 | 7480568530 | 2020-06-14 14:16:00.054631 | COMPLETE_SUCCESS | 2020-06-14 14:25:15.280664 |       7 | ERROR_NONE | NULL                | NULL                |
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------------------+------------------+----------------------------+---------+------------+---------------------+---------------------+
1 row in set (0.00 sec)

So, can you spot the problem? If you look at the 4th from the last character – the “z”, you will notice that the one in the query is lower-case while the one MySQL declared a match is upper-case. This actually prevents insertion of the one in the query, if I were to try to insert it, which the application was trying to do. Of course, in reality these are completely different emails and the application needed to be able to make this insertion.

To fix this, we first need to understand the DB’s current overall collation and then view the current collation where the collision is occurring:

mysql> SELECT @@character_set_database, @@collation_database; 
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_unicode_ci   |
+--------------------------+----------------------+

mysql> show full columns from a_table;
+---------------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| Field               | Type         | Collation          | Null | Key | Default | Extra          | Privileges                      | Comment |
+---------------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| id                  | int(11)      | NULL               | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| email_id            | varchar(191) | utf8mb4_unicode_ci | NO   | UNI | NULL    |                | select,insert,update,references |         |
| dt_created          | datetime(6)  | NULL               | NO   |     | NULL    |                | select,insert,update,references |         |
| event_id            | varchar(128) | utf8mb4_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_received         | datetime(6)  | NULL               | NO   |     | NULL    |                | select,insert,update,references |         |
| state               | varchar(63)  | utf8mb4_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_completed        | datetime(6)  | NULL               | YES  |     | NULL    |                | select,insert,update,references |         |
| case_id             | int(11)      | NULL               | YES  | MUL | NULL    |                | select,insert,update,references |         |
| error               | varchar(63)  | utf8mb4_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_last_error_email | datetime(6)  | NULL               | YES  |     | NULL    |                | select,insert,update,references |         |
| ticket_num_received | varchar(31)  | utf8mb4_unicode_ci | YES  |     | NULL    |                | select,insert,update,references |         |
+---------------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
11 rows in set (0.00 sec)

We can see all varchar fields are collating with utf8mb4_unicode_ci, which as we now know, is case insensitive.

The best first step is to confirm our theory by running the same query and adding the COLLATE clause with the collation we think will address the issue:

mysql> select * from a_table where email_id='AAMkADg5YTA3MDI4LTU4OTktNDU0Mi1hNDg2LWFhMWJlNDIxMTcxYQBGAAAAAABGS97sl180Soe3IIehe0cIBwBJjaJhSZ5dQrvfHER4hyffAAAAEpopAADzbgOkaOB-RrhbS1H09mzgAACsctTzAAA=' COLLATE utf8mb4_bin;
Empty set (0.00 sec)

So, no results from the query tells us that – under this new collations utf8mb4_bin – a ‘z’ and ‘Z’ are not considered to be the same thing, thus confirming our theory that changing to this new collation would solve the problem.

So we could change the collation for the entire DB like so:

ALTER DATABASE a_database CHARACTER SET = utf8mb4 COLLATE = utf8mb4_bin;

But that could be risky unless we are absolutely certain that case-insensitivity was not needed anywhere. If we are not the original (“greenfield”) developer for this project, it is entirely possible that some assumptions have been made somewhere by another developer with regards to the collation, and would those assumptions be documented? Probably not. If we go whole-hog and change the collation for the whole DB, that could break something.

At the opposite end of the spectrum, we could change it at the column level like so:

ALTER TABLE a_table
MODIFY a_table.email_id VARCHAR(191) 
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

Changing at the column or table level is the safest course which will achieve the objective. I chose to do it at the table level and run regressions against the functionality which used the table.

So next, we issue the ALTER TABLE SQL command to set the desired collation:

mysql> alter table a_table convert to character set utf8mb4 collate utf8mb4_bin;
Query OK, 9 rows affected (0.07 sec)
Records: 9  Duplicates: 0  Warnings: 0

Next, we view the table’s collation again to verify it is as desired:

mysql> show full columns from a_table;
+---------------------+--------------+-------------+------+-----+---------+----------------+---------------------------------+---------+
| Field               | Type         | Collation   | Null | Key | Default | Extra          | Privileges                      | Comment |
+---------------------+--------------+-------------+------+-----+---------+----------------+---------------------------------+---------+
| id                  | int(11)      | NULL        | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| email_id            | varchar(191) | utf8mb4_bin | NO   | UNI | NULL    |                | select,insert,update,references |         |
| dt_created          | datetime(6)  | NULL        | NO   |     | NULL    |                | select,insert,update,references |         |
| event_id            | varchar(128) | utf8mb4_bin | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_received         | datetime(6)  | NULL        | NO   |     | NULL    |                | select,insert,update,references |         |
| state               | varchar(63)  | utf8mb4_bin | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_completed        | datetime(6)  | NULL        | YES  |     | NULL    |                | select,insert,update,references |         |
| case_id             | int(11)      | NULL        | YES  | MUL | NULL    |                | select,insert,update,references |         |
| error               | varchar(63)  | utf8mb4_bin | NO   |     | NULL    |                | select,insert,update,references |         |
| dt_last_error_email | datetime(6)  | NULL        | YES  |     | NULL    |                | select,insert,update,references |         |
| ticket_num_received | varchar(31)  | utf8mb4_bin | YES  |     | NULL    |                | select,insert,update,references |         |
+---------------------+--------------+-------------+------+-----+---------+----------------+---------------------------------+---------+
11 rows in set (0.00 sec)

Lastly, we run the same query as before to verify the results:

mysql> select * from a_table where email_id='AAMkADg5YTA3MDI4LTU4OTktNDU0Mi1hNDg2LWFhMWJlNDIxMTcxYQBGAAAAAABGS97sl180Soe3IIehe0cIBwBJjaJhSZ5dQrvfHER4hyffAAAAEpopAADzbgOkaOB-RrhbS1H09mzgAACsctTzAAA=';
Empty set (0.00 sec)

So, now we see that there is no match because the collation is no longer considering ‘z’ and ‘Z’ to be the same thing. Subsequent testing through Django’s ORM was also successful.

In the future, for this project, I will likely set the collation at the DB level in my DEV & TEST envs to see if anything breaks. If all that goes well, then I will know I can just set it collation at DB creation time, without the need to do any ALTER commands later:

CREATE DATABASE a_database CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

So here I switched to utf8mb4_bin to get the case sensitivity I wanted. Be aware that as MySQL version’s progress, different collations may come and go. You simply have to see what collations are available for your particular MySQL version. Newer versions of MySQL are moving to using the *_cs extension to denote case sensitivity. This post is not really about advising you on any particular collation. You will need to do some legwork to figure out what collations are possible with your chosen character set, and then which of those suite the needs of your project best.

Here is a great command to list the possible collations for a give character set:


mysql> SHOW COLLATION WHERE Charset = 'utf8mb4';
+------------------------+---------+-----+---------+----------+---------+
| Collation              | Charset | Id  | Default | Compiled | Sortlen |
+------------------------+---------+-----+---------+----------+---------+
| utf8mb4_general_ci     | utf8mb4 |  45 | Yes     | Yes      |       1 |
| utf8mb4_bin            | utf8mb4 |  46 |         | Yes      |       1 |
| utf8mb4_unicode_ci     | utf8mb4 | 224 |         | Yes      |       8 |
| utf8mb4_icelandic_ci   | utf8mb4 | 225 |         | Yes      |       8 |
| utf8mb4_latvian_ci     | utf8mb4 | 226 |         | Yes      |       8 |
| utf8mb4_romanian_ci    | utf8mb4 | 227 |         | Yes      |       8 |
| utf8mb4_slovenian_ci   | utf8mb4 | 228 |         | Yes      |       8 |
| utf8mb4_polish_ci      | utf8mb4 | 229 |         | Yes      |       8 |
| utf8mb4_estonian_ci    | utf8mb4 | 230 |         | Yes      |       8 |
| utf8mb4_spanish_ci     | utf8mb4 | 231 |         | Yes      |       8 |
| utf8mb4_swedish_ci     | utf8mb4 | 232 |         | Yes      |       8 |
| utf8mb4_turkish_ci     | utf8mb4 | 233 |         | Yes      |       8 |
| utf8mb4_czech_ci       | utf8mb4 | 234 |         | Yes      |       8 |
| utf8mb4_danish_ci      | utf8mb4 | 235 |         | Yes      |       8 |
| utf8mb4_lithuanian_ci  | utf8mb4 | 236 |         | Yes      |       8 |
| utf8mb4_slovak_ci      | utf8mb4 | 237 |         | Yes      |       8 |
| utf8mb4_spanish2_ci    | utf8mb4 | 238 |         | Yes      |       8 |
| utf8mb4_roman_ci       | utf8mb4 | 239 |         | Yes      |       8 |
| utf8mb4_persian_ci     | utf8mb4 | 240 |         | Yes      |       8 |
| utf8mb4_esperanto_ci   | utf8mb4 | 241 |         | Yes      |       8 |
| utf8mb4_hungarian_ci   | utf8mb4 | 242 |         | Yes      |       8 |
| utf8mb4_sinhala_ci     | utf8mb4 | 243 |         | Yes      |       8 |
| utf8mb4_german2_ci     | utf8mb4 | 244 |         | Yes      |       8 |
| utf8mb4_croatian_ci    | utf8mb4 | 245 |         | Yes      |       8 |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 |         | Yes      |       8 |
| utf8mb4_vietnamese_ci  | utf8mb4 | 247 |         | Yes      |       8 |
+------------------------+---------+-----+---------+----------+---------+
26 rows in set (0.01 sec)

See the following MySQL doc page for more information on this topic:

Character Sets and Collations in MySQL

Leave a Comment Cancel Reply