The basic rule is that setting charset defaults applies to future Create and Alter statements that rely on such defaults, but don't change existing tables or columns. See https://dev.mysql.com/doc/refman/8.0/en/charset-conversion.html.
There's an
Alter Table...Convert To...
command
that does change column defs, but it does not fix charset inconsistencies and
mismatches---for that, see here.
If you are using the mysql client program, you can prevent many
charset problems by issuing, right off the top ...
SET NAMES "charset_name";... which is shorthand for ...
SET character_set_client=charset_name; SET character_set_results=charset_name; SET character_set_connection=charset_name;For most purposes,
charset_name
should be utf8
or since MySQL 8.0 utf8mb4
.
If you are using PHP, instead of issuing a SET NAMES
query at
the top, call mysqli_set_charset(), which also sets the charset for
PHP functions like mysqli_escape_string().
Note that these practices do not prevent problems arising from charset mismatches, for example between table or column charset settings and the client query ing the table, or between table or column and the data coming in.
For webapps there's another set of problems---some browsers, when they detect a character set mismatch with what's been specified, do a silent conversion, so looking OK in a browser doesn't ensure that the character encoding is sound, so forms accepting user data should include ...
<form accept "charset_name">... for example ...
<form accepts "UTF-8">
LENGTH()
of the character against CHAR_LENGTH()
:
set names utf8; set @utf8q = "’"; select @utf8q, hex(@utf8q) as hexOfQ, convert(@utf8q using utf8) as convToUtf8, convert(@utf8q using latin1) as convToLatin1, hex( convert(@utf8q using latin1) ) as HexOflatin1, length(@utf8q) as len, char_length(@utf8q) as charLen; +----------+--------+------------+--------------+-------------+------+---------+ | @utf8q | hexOfQ | convToUtf8 | convToLatin1 | HexOflatin1 | len | charLen | +----------+--------+------------+--------------+-------------+------+---------+ | ’ | E28099 | ’ | ’ | 92 | 3 | 1 | +----------+--------+------------+--------------+-------------+------+---------+Inserting the character into a table introduces more problem possibilitiess. Here's a small test table to store our char as latin1 or utf8, either as i s, or specified as latin1, or specified as utf8:
drop table if exists c; create table c( id smallint unsigned primary key auto_increment, slatin varchar(64) character set latin1, sutf8 varchar(64) character set utf8, comment varchar(64) ); insert into c set slatin="’", sutf8="’", comment="as is"; insert into c set slatin=_latin1"’", sutf8=_utf8"’", comment="slatin forced latin1, sutf8 forced utf8"; insert into c set slatin=_utf8"’", sutf8=_latin1"’", comment="slatin forced utf8, sutf8 forced latin1"; insert into c set slatin=_latin1"’", sutf8=_latin1"’", comment="both forced latin1"; insert into c set slatin=_utf8"’", sutf8=_utf8"’", comment="both forced utf8"; select id, slatin,hex(slatin),sutf8,hex(sutf8), comment from c\G *************************** 1. row *************************** id: 1 slatin: ’ hex(slatin): C3A2E282ACE284A2 sutf8: ’ hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2 comment: as is *************************** 2. row *************************** id: 2 slatin: ’ hex(slatin): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2 sutf8: ’ hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2 comment: slatin forced latin1, sutf8 forced utf8 *************************** 3. row *************************** id: 3 slatin: ’ hex(slatin): C3A2E282ACE284A2 sutf8: ’ hex(sutf8): C383C692C382C2A2C383C2A2C3A2E282ACC5A1C382C2ACC383C2A2C3A2E282ACC5BEC382C2A2 comment: slatin forced utf8, sutf8 forced latin1 *************************** 4. row *************************** id: 4 slatin: ’ hex(slatin): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2 sutf8: ’ hex(sutf8): C383C692C382C2A2C383C2A2C3A2E282ACC5A1C382C2ACC383C2A2C3A2E282ACC5BEC382C2A2 comment: both forced latin1 *************************** 5. row *************************** id: 5 slatin: ’ hex(slatin): C3A2E282ACE284A2 sutf8: ’ hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2 comment: both forced utf8
Stored as is or with a specification matching the target column, everything's fine. But when the charset of what goes into the table fails to match the column charset, as in rows 2,3 and 4, we see charset mangling.
Here's a more elaborate example inserting the same Greek text into latin1, latin7 (which is "Baltic Rim", which includes Greek), and utf8 columns of a table, under each of those three Set Names settings, then Selecting each of the nine results under those three Set Names settings, giving 27 readings. Blank cells indicate unreadable results:
Inserting and Selecting Greek Text in MySQL +--------------+------------+-----------------------------------------+ | | Set Names for Insert | +--------------+------------+------------+--------------+-------------+ | Set Names | Column | | | | | for Select | Charset | latin1 | latin7 | utf8 | +--------------+------------+------------+--------------+-------------+ | | latin1 | Correct | | | | latin1 | latin7 | Correct | | | | | utf8 | Correct | | | +--------------+------------+------------+--------------+-------------+ | | latin1 | | | | | latin7 | latin7 | | Correct | | | | utf8 | | | | +--------------+------------+------------+--------------+-------------+ | | latin1 | | | Correct | | utf8 | latin7 | | | Correct | | | utf8 | | | Correct | +--------------+------------+------------+--------------+-------------+
When the charset of incoming data matches the column charset, and the charset of the client doing the Select matches both, the result is correct. In 20 of the remaining 24 combinations, the result is mangled. Latin1 and utf8 are a little forgiving of each other, but all remaining mismatches produce mangling!
The moral of the story: take care to match incoming data with column specifications and current client, server and table settings.
Must-read character set and collation resources ...