Back to Blog
Text Tools

Text Encoding: UTF-8, ASCII, Unicode โ€” What Every Developer Must Know

2025-05-28 7 min read

Character encoding bugs cause garbled text, broken APIs, and mysterious errors. This guide explains ASCII, Unicode, UTF-8, UTF-16 and how encoding works in practice.

Encoding bugs produce the dreaded "garbled text" problem โ€” question marks, boxes, or random symbols where letters should be. Understanding how character encoding works prevents these issues and helps you debug them when they appear.

The Fundamental Problem

Computers store text as numbers. The encoding defines which number represents which character. If you encode text with one encoding and read it with another, you get garbled output. This is called a mojibake.

ASCII: The Ancestor

American Standard Code for Information Interchange (1963). Maps 128 characters (A-Z, a-z, 0-9, punctuation, control codes) to 7-bit numbers (0-127). Works only for English. Every encoding system is backwards-compatible with ASCII.

Unicode: The Universal Standard

Unicode assigns a unique "code point" to every character in every writing system โ€” over 149,000 characters covering 161 scripts plus emoji. A code point looks like U+0041 (the letter A) or U+1F600 (๐Ÿ˜€). Unicode is the standard; how you store these numbers on disk or wire is the encoding.

UTF-8: The Universal Encoding

UTF-8 is a variable-length encoding for Unicode:

  • ASCII characters (U+0000 to U+007F): 1 byte โ€” backwards compatible
  • Latin extended, Greek, Cyrillic (U+0080 to U+07FF): 2 bytes
  • Most other scripts including CJK (U+0800 to U+FFFF): 3 bytes
  • Emoji and rare characters: 4 bytes

UTF-8 is used by 98%+ of websites. Always use UTF-8 for web, APIs, and databases. Always declare <meta charset="UTF-8"> in HTML.

UTF-16 and UTF-32

UTF-16 uses 2 bytes for most characters. Used internally by Windows, Java, and JavaScript strings. UTF-32 uses exactly 4 bytes per character โ€” simple but wasteful. Neither is commonly used for data exchange or web content.

utf8 encoding unicode ascii developer

More Articles