Storing special characters in MySQL

A while back (maybe 5 years or so?) I was trying to get a whole load of data in to a MySQL database, but kept finding that characters such as ć were being displayed incorrectly.

The character falls within the UTF-8 Extended Latin character set and my database was set up for UTF8, so why was I seeing strange multi character results in my data?

It turns out that MySQL utf8 is an alias for utf8mb3, allowing only up to three bytes per character. This means that the basic multilingual plane (BMP) is supported out the box, however unicode characters in other planes are not supported.

To support characters outside the basic multilingual plane, you’ll want to set your MySQL server to use utf8mb4. This allows MySQL to store up to 4 bytes per character, and means you’ll now be able to store characters such as ć and emojis correctly.

How to set up your database with utf8mb4

Update /etc/my.cnf

[client]
port=3306
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_bin

Set the database to:

ALTER SCHEMA `<DATABASE>`  DEFAULT COLLATE utf8mb4_bin ;

Check output of the below which should now mention utf8mb4:

show variables like '%colla%';

If this looks good, you should now be able to insert special characters with no issue.

utf8mb4 to become default

The good news is that the MySQL docs suggest that the utf8 alias will soon point to utf8mb4 and no longer utf8mb3. You can read more on that here.