A while back (maybe 5 years or so?) I was trying to get a whole load of data in to a MySQL database, but kept finding that characters such as ć
were being displayed incorrectly.
The character falls within the UTF-8 Extended Latin character set and my database was set up for UTF8
, so why was I seeing strange multi character results in my data?
It turns out that MySQL utf8
is an alias for utf8mb3
, allowing only up to three bytes per character. This means that the basic multilingual plane (BMP) is supported out the box,
however unicode characters in other planes are not supported.
To support characters outside the basic multilingual plane, you’ll want to set your MySQL server to use utf8mb4
. This allows MySQL to store up to 4 bytes per character, and means you’ll
now be able to store characters such as ć
and emojis correctly.
Update /etc/my.cnf
[client]
port=3306
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_bin
Set the database to:
ALTER SCHEMA `<DATABASE>` DEFAULT COLLATE utf8mb4_bin ;
Check output of the below which should now mention utf8mb4
:
show variables like '%colla%';
If this looks good, you should now be able to insert special characters with no issue.
The good news is that the MySQL docs suggest that the utf8
alias will soon point to utf8mb4
and no longer utf8mb3
. You can read more on that here.