A while back (maybe 5 years or so?) I was trying to get a whole load of data in to a MySQL database, but kept finding that characters such as
ć were being displayed incorrectly.
The character falls within the UTF-8 Extended Latin character set and my database was set up for
UTF8, so why was I seeing strange multi character results in my data?
It turns out that MySQL
utf8 is an alias for
utf8mb3, allowing only up to three bytes per character. This means that the basic multilingual plane (BMP) is supported out the box,
however unicode characters in other planes are not supported.
To support characters outside the basic multilingual plane, you’ll want to set your MySQL server to use
utf8mb4. This allows MySQL to store up to 4 bytes per character, and means you’ll
now be able to store characters such as
ć and emojis correctly.
Set the database to:
ALTER SCHEMA `<DATABASE>` DEFAULT COLLATE utf8mb4_bin ;
Check output of the below which should now mention
show variables like '%colla%';
If this looks good, you should now be able to insert special characters with no issue.
The good news is that the MySQL docs suggest that the
utf8 alias will soon point to
utf8mb4 and no longer
utf8mb3. You can read more on that here.