Charset & Collation Issues

Often in database management, character set and collation settiongs need careful management: Charsets are ideally uniform across all tables & columns in a database except where specs require otherwise. Over even a short period of time, though, charset & collate defaults change, so do programming and data entry habits, so differences mismatches and incompatibilities accumulate.

The basic rule is that setting charset defaults applies to future Create and Alter statements that rely on such defaults, but don't change existing tables or columns. See https://dev.mysql.com/doc/refman/8.0/en/charset-conversion.html.

There's an Alter Table...Convert To... command that does change column defs, but it does not fix charset inconsistencies and mismatches---for that, see here. If you are using the mysql client program, you can prevent many charset problems by issuing, right off the top ...

SET NAMES "charset_name"; 
... which is shorthand for ...
SET character_set_client=charset_name;
SET character_set_results=charset_name; 
SET character_set_connection=charset_name; 
For most purposes, charset_name should be utf8 or since MySQL 8.0 utf8mb4.

If you are using PHP, instead of issuing a SET NAMES query at the top, call mysqli_set_charset(), which also sets the charset for PHP functions like mysqli_escape_string().

Note that these practices do not prevent problems arising from charset mismatches, for example between table or column charset settings and the client query ing the table, or between table or column and the data coming in.

For webapps there's another set of problems---some browsers, when they detect a character set mismatch with what's been specified, do a silent conversion, so looking OK in a browser doesn't ensure that the character encoding is sound, so forms accepting user data should include ...

                                                                                                                                                       
<form accept "charset_name">                                                                                                                                
... for example ...
                                                                                                                                                       
<form accepts "UTF-8">                                                                                                                                       

Charset mismatches

Character set mismatches can go unnoticed. To diagnose a suspicious character, compare hex values under relevant charsets, and compare LENGTH() of the character against CHAR_LENGTH():
                                                                                                                                                       
set names utf8;                                                                                                                                             
set @utf8q = "’";                                                                                                                                    
select                                                                                                                                                      
  @utf8q,                                                                                                                                                   
  hex(@utf8q) as hexOfQ,                                                                                                                                    
  convert(@utf8q using utf8) as convToUtf8,                                                                                                                 
  convert(@utf8q using latin1) as convToLatin1,                                                                                                             
  hex( convert(@utf8q using latin1) ) as HexOflatin1,                                                                                                       
  length(@utf8q) as len,                                                                                                                                    
  char_length(@utf8q) as charLen;                                                                                                                           
+----------+--------+------------+--------------+-------------+------+---------+                                                                              
| @utf8q   | hexOfQ | convToUtf8 | convToLatin1 | HexOflatin1 | len  | charLen |                                                                              
+----------+--------+------------+--------------+-------------+------+---------+                                                                              
| ’ | E28099 | ’   | ’     | 92          |    3 |       1 |                                                         
+----------+--------+------------+--------------+-------------+------+---------+                                                                              
Inserting the character into a table introduces more problem possibilitiess. Here's a small test table to store our char as latin1 or utf8, either as i s, or specified as latin1, or specified as utf8:
                                                                                                                                                       
drop table if exists c;                                                                                                                                     
create table c(                                                                                                                                             
  id smallint unsigned primary key auto_increment,                                                                                                          
  slatin varchar(64) character set latin1,                                                                                                                      
  sutf8  varchar(64) character set utf8,                                                                                                                        
  comment varchar(64)                                                                                                                                       
);                                                                                                                                                          
insert into c 
  set slatin="’", sutf8="’", comment="as is";                                                                                     
insert into c 
  set slatin=_latin1"’", sutf8=_utf8"’", 
  comment="slatin forced latin1, sutf8 forced utf8";                                       
insert into c 
  set slatin=_utf8"’", sutf8=_latin1"’", 
  comment="slatin forced utf8, sutf8 forced latin1";                                       
insert into c 
  set slatin=_latin1"’", sutf8=_latin1"’", 
  comment="both forced latin1";                                                          
insert into c 
set slatin=_utf8"’", sutf8=_utf8"’", 
comment="both forced utf8";                                                                
                                                                                                                                                            
select id, slatin,hex(slatin),sutf8,hex(sutf8), comment from c\G
*************************** 1. row ***************************
         id: 1
     slatin: ’
hex(slatin): C3A2E282ACE284A2
      sutf8: ’
 hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2
    comment: as is
*************************** 2. row ***************************
         id: 2
     slatin: ’
hex(slatin): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2
      sutf8: ’
 hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2
    comment: slatin forced latin1, sutf8 forced utf8
*************************** 3. row ***************************
         id: 3
     slatin: ’
hex(slatin): C3A2E282ACE284A2
      sutf8: ’
 hex(sutf8): C383C692C382C2A2C383C2A2C3A2E282ACC5A1C382C2ACC383C2A2C3A2E282ACC5BEC382C2A2
    comment: slatin forced utf8, sutf8 forced latin1
*************************** 4. row ***************************
         id: 4
     slatin: ’
hex(slatin): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2
      sutf8: ’
 hex(sutf8): C383C692C382C2A2C383C2A2C3A2E282ACC5A1C382C2ACC383C2A2C3A2E282ACC5BEC382C2A2
    comment: both forced latin1
*************************** 5. row ***************************
         id: 5
     slatin: ’
hex(slatin): C3A2E282ACE284A2
      sutf8: ’
 hex(sutf8): C383C2A2C3A2E2809AC2ACC3A2E2809EC2A2
    comment: both forced utf8                                              

Stored as is or with a specification matching the target column, everything's fine. But when the charset of what goes into the table fails to match the column charset, as in rows 2,3 and 4, we see charset mangling.

Here's a more elaborate example inserting the same Greek text into latin1, latin7 (which is "Baltic Rim", which includes Greek), and utf8 columns of a table, under each of those three Set Names settings, then Selecting each of the nine results under those three Set Names settings, giving 27 readings. Blank cells indicate unreadable results:

                                                                                                                                                       
             Inserting and Selecting Greek Text in MySQL                                                                                                    
+--------------+------------+-----------------------------------------+                                                                                     
|                           |          Set Names for Insert           |                                                                                     
+--------------+------------+------------+--------------+-------------+                                                                                     
|  Set Names   | Column     |            |              |             |                                                                                     
|  for Select  | Charset    |  latin1    |   latin7     |   utf8      |                                                                                     
+--------------+------------+------------+--------------+-------------+                                                                                     
|              | latin1     |  Correct   |              |             |                                                                                     
| latin1       | latin7     |  Correct   |              |             |                                                                                     
|              | utf8       |  Correct   |              |             |                                                                                     
+--------------+------------+------------+--------------+-------------+                                                                                     
|              | latin1     |            |              |             |                                                                                     
| latin7       | latin7     |            |   Correct    |             |                                                                                     
|              | utf8       |            |              |             |                                                                                     
+--------------+------------+------------+--------------+-------------+                                                                                     
|              | latin1     |            |              |  Correct    |                                                                                     
| utf8         | latin7     |            |              |  Correct    |                                                                                     
|              | utf8       |            |              |  Correct    |                                                                                     
+--------------+------------+------------+--------------+-------------+                                                                                     

When the charset of incoming data matches the column charset, and the charset of the client doing the Select matches both, the result is correct. In 20 of the remaining 24 combinations, the result is mangled. Latin1 and utf8 are a little forgiving of each other, but all remaining mismatches produce mangling!

The moral of the story: take care to match incoming data with column specifications and current client, server and table settings.

Must-read character set and collation resources ...

Character set tutorial

Prevent charset problems

Collations

Analyse charset problems

Fix charset problems.