Question

What is the difference between utf8mb3_general_ci and utf8_general_ci?

Answer and Explanation

The key difference between utf8mb3_general_ci and utf8_general_ci lies in the character encoding they support and, consequently, their storage capabilities. Both are collations for the UTF-8 character set, but they operate on slightly different encodings:

1. `utf8mb3_general_ci`:

- This was the original UTF-8 implementation in MySQL (and some other databases). The mb3 stands for "multi-byte 3," which indicates that it uses a maximum of 3 bytes to store each character.

- The limitation to 3 bytes means that it can only encode a subset of the full UTF-8 character set. Specifically, it cannot represent certain characters, particularly some less commonly used CJK (Chinese, Japanese, Korean) characters and many of the more recent emojis.

- It is important to note that even though its name implies UTF-8 it only supports a subset of the UTF-8 standard, hence the "mb3" designation. This can lead to data loss if data is inserted that uses more than 3 bytes for encoding.

2. `utf8_general_ci` (or often just `utf8mb4_general_ci`):

- This is the more modern version of UTF-8 support in databases. It fully supports the true UTF-8 standard and uses a maximum of 4 bytes per character (mb4, as is often implied). This enables it to encode the full range of UTF-8 characters, including all CJK characters and emojis.

- While many database systems may not use utf8_general_ci and use the proper name utf8mb4_general_ci, they refer to the same implementation.

3. `general_ci` Part:

- The _general_ci suffix, present in both, refers to the "case-insensitive" collation. This means that strings will be compared without regard to case. There are other available collation types such as _unicode_ci that might give slightly more accurate sorting and comparisons for some character sets, but are more complex and may introduce performance penalties.

4. Practical Implications:

- If you are working with modern web content which often contains emojis and a wider range of international characters, it is important to use `utf8mb4_general_ci` to avoid data loss or errors when characters are not supported.

- For applications that are only dealing with ASCII or Latin-based characters and basic punctuation, utf8mb3_general_ci would likely be adequate. However, due to its limited nature and the potential for data loss, it’s generally recommended to switch to utf8mb4_general_ci to handle any situation smoothly.

5. Migration:

- Migrating from utf8mb3_general_ci to utf8mb4_general_ci is generally straightforward, but it should be done after verifying that the underlying database, web server, and all related components fully support it. This conversion may sometimes require modifications to your table schemas and data.

In summary, the essential difference is that `utf8mb3_general_ci` uses a maximum of 3 bytes per character and is limited in its character set support while `utf8mb4_general_ci` utilizes 4 bytes per character and completely covers the UTF-8 standard. It is advisable to use utf8mb4_general_ci to ensure your database is able to store a full range of modern characters including emojis.

More questions