Counting Characters and UTF-8 — Characters, Bytes, and Graphemes

"How many characters is this text?" is a deceptively simple question that has no single answer. For the same text, the result changes depending on whether you count bytes, code units, code points, or graphemes (the characters you see). This article walks through the basics of Unicode and UTF-8, all the way to the common programming pitfall of "an emoji counted as two characters," with verified worked examples.

The bottom line up front: the "one character" a person sees = a grapheme cluster, the unit a program counts internally = a code unit / code point, and the amount needed to store or send it = bytes. Remembering that these three are different things will save you from a lot of confusion.

1. The "character count" depends on how you count

For example, "a" is one character and one byte by anyone's count. But "あ" is one character yet three bytes in UTF-8, and although "😀" looks like a single character, JavaScript's "😀".length returns 2. None of these is wrong; they are simply counting different units. The starting point is to distinguish these four ways of counting.

2. Unicode basics — code points and planes

Unicode is a standard that assigns a unique number to every character in the world. That number is called a code point and is written as a hexadecimal value following U+. For example, "A" is U+0041, "あ" is U+3042, and "😀" is U+1F600.

Code points range from U+0000 to U+10FFFF and are divided into "planes" of 64K each.

Whether a character "fits in the BMP or is in a supplementary plane" is an important boundary that directly affects UTF-16 surrogate pairs and the str.length discrepancy discussed below.

3. UTF-8 — a variable-length encoding

UTF-8 encodes a code point into a byte sequence using a variable length of 1 to 4 bytes, and it is the de facto standard of today's web. The number of bytes is determined by the size of the code point.

Code point rangeUTF-8 byte lengthExamples
U+0000U+007F1 byteASCII (A, digits, symbols)
U+0080U+07FF2 bytesLatin Extended, Greek, Cyrillic, etc.
U+0800U+FFFF3 bytesMost Japanese such as hiragana and kanji ()
U+10000U+10FFFF4 bytesMany emoji (😀), some kanji

The advantages of UTF-8 are that it is backward compatible with ASCII (ASCII characters stay one byte) and that it is endianness independent. On the other hand, because the number of bytes per character is not constant, "character count ≠ byte count" always holds.

4. Code units vs code points vs grapheme clusters

This is the part most often misunderstood in programming. JavaScript strings are represented internally as UTF-16, and str.length returns the number of UTF-16 code units. It is not necessarily the number of code points or graphemes.

Surrogate pairs (UTF-16's supplementary-plane representation)

UTF-16 represents BMP characters with one code unit (2 bytes), but characters in the supplementary planes (U+10000 and above) are represented by two code units — a surrogate pair. That is why "😀".length === 2. When you want to count by code point, use the spread syntax or for...of.

Combining characters, ZWJ, and grapheme clusters

Furthermore, several code points can combine into a single glyph. Examples include combining characters (e.g., representing é as e + combining accent U+0301) and family emoji that join emoji with a ZWJ (zero-width joiner, U+200D). To count these as "one character a person sees," use Intl.Segmenter.

// Count by grapheme
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("😀")].length;          // 1
[...seg.segment("👨‍👩‍👧")].length; // 1
👨‍👩‍👧 (a family emoji) is made of 👨 + ZWJ + 👩 + ZWJ + 👧five code points (in UTF-16 its .length is 8) — yet it appears as one grapheme. For character limits and validation on social platforms, which unit you count by has a big impact on the user experience.

5. Verifying with worked examples

Here are representative characters lined up under the four ways of counting (the UTF-8 byte counts and the JavaScript values have been verified for real).

StringGraphemesCode pointsUTF-16 lengthUTF-8 bytes
a1111
1113
1113
😀1124
é (e+U+0301)1223
👨‍👩‍👧15818

For example, "あ" is U+3042, which falls in the 3-byte range, so it is 1 character = 3 bytes. "😀" is U+1F600 in a supplementary plane, so it is 1 grapheme and 1 code point but its length is 2 and it takes 4 bytes in UTF-8. "👨‍👩‍👧" is three emoji plus two ZWJs — five code points in total. Each emoji is 4 bytes × 3 = 12, and each ZWJ (U+200D) is 3 bytes × 2 = 6, for a total of 18 bytes.

6. Line breaks, whitespace, and when you need byte counts

In practice, it is not only "what counts as one character" but also how you treat whitespace and line breaks that matters.

On the other hand, the situations where byte counts are required are equally clear.

SituationUnit to countWhy
Character limit on an input field (UI)GraphemesTo match a person's "one character" by sight
Database column length (VARCHAR, etc.)Bytes or charactersThe unit of the limit depends on the DB and character set
Data transfer / file sizeBytesIt is the amount actually transferred and stored
Input to crypto / hashingBytesThe processing target is always a byte sequence
Free Tool Count for real with the Character Counter Measure character counts, byte counts and more on the spot. Paste your text and see how the result differs depending on how you count.

Frequently Asked Questions (FAQ)

Why is an emoji sometimes counted as two characters?

JavaScript's str.length returns the number of UTF-16 code units. A character in the supplementary planes (U+10000 and above), such as "😀", is represented by two code units (a surrogate pair), so length counts it as 2. To match the way people count characters by sight (graphemes), use [...str].length or Intl.Segmenter.

What is the difference between a character count and a byte count?

A character count is how many characters there are; a byte count is how many bytes are needed to store or transmit them. In UTF-8, ASCII is 1 byte, most Japanese characters are 3 bytes, and emoji are 4 bytes, so the byte count differs even for a single character. Byte counts matter for database column sizing and estimating data transfer.

What is a grapheme cluster?

It is the unit a person perceives as a single character. Combining characters or a ZWJ (zero-width joiner) can make several code points appear as one glyph; for example "👨‍👩‍👧" is made of several code points but is a single grapheme. In JavaScript you can split text into graphemes with Intl.Segmenter.

← Back to the Tech Blog list