Unicode Converter
Convert between text and Unicode encoding with multiple output formats
Code Point Details (click to copy)
What is Unicode?
Unicode is an industry standard in computing that encodes most of the world's writing systems. Each Unicode character has a unique numeric identifier called a Code Point, typically represented in hexadecimal with a U+ prefix, such as U+4E2D for the Chinese character '中'. The Unicode Converter tool can convert text to various Unicode representation formats and also restore Unicode encodings back to text. Unicode describes characters by code points, while encodings such as UTF-8 define how those code points are stored as bytes. This matters when analyzing text with emoji, accents, CJK characters, symbols, control characters, or mixed scripts. One visible character may consist of multiple code points, especially with combining marks, variation selectors, or emoji sequences. The tool helps debug mojibake, normalization, escape sequences, invisible characters, and mismatches between character count and byte length.
How to Use
Steps
- Enter or paste the text to convert in the input box
- Select conversion direction: Text to Unicode or Unicode to Text
- Choose output format: \uXXXX or U+XXXX (available when encoding)
- Results appear automatically, one-click copy or swap input/output
Encoding Notes
- Unicode escape formats are useful for code and debugging, but they can reduce readability for normal copywriting.
- When decoding, verify surrogate pairs and emoji output; splitting a pair can produce broken characters.
Use Cases
Technical Principle
The Unicode Standard (ISO/IEC 10646) assigns a unique numeric code point to every character across 17 planes (U+0000 to U+10FFFF). The Basic Multilingual Plane (BMP, Plane 0) covers U+0000–U+FFFF and contains nearly all modern writing systems including CJK Unified Ideographs. Supplementary planes (1–16) hold historic scripts, rare CJK, emoji, and special-purpose characters. The tool converts between readable text and two machine-oriented representations: JavaScript-style \uXXXX escape sequences and the standard U+XXXX notation. UTF-16 is the internal encoding of JavaScript strings — every BMP character is stored as a single 16-bit code unit equal to its code point, while supplementary characters (U+10000 and above) are encoded as surrogate pairs: 0x10000 is subtracted from the code point to leave a 20-bit value, which is then split into a 10-bit high surrogate (0xD800 + ((cp - 0x10000) >> 10)) and a 10-bit low surrogate (0xDC00 + ((cp - 0x10000) & 0x3FF)). The tool's encode mode detects supplementary code points via String.prototype.codePointAt() and generates the correct two-\u escape for the \uXXXX format. For U+XXXX format, the full code point is displayed directly. The decode mode parses three syntaxes: \uXXXX (four hex digits, BMP only), \u{XXXXX} (ES6 curly-brace syntax supporting the full Unicode range), and U+XXXX (the standard notation with variable-length hex). The regex /\\u\{([0-9a-fA-F]+)\}/g handles curly-brace escapes and feeds them to String.fromCodePoint(), while /\\u([0-9a-fA-F]{4})/g handles traditional escapes via String.fromCharCode(). Mixing the two correctly reconstructs surrogate pairs when a supplementary character was encoded as two \u escapes. UTF-8 encoding is relevant because it determines byte length: a BMP character like '中' (U+4E2D) encodes as 3 UTF-8 bytes (E4 B8 AD), while an emoji like '😀' (U+1F600) requires 4 bytes (F0 9F 98 80). The tool's character counter distinguishes between code point count and UTF-16 code unit count — a useful distinction when debugging length limits in databases, APIs, or form fields that count code units rather than characters.
- Code point iteration: String.prototype.codePointAt(pos) correctly returns the full code point for supplementary characters, unlike charCodeAt() which returns only the high surrogate — the tool uses the spread operator [...str] to iterate by code point, which internally calls the string iterator protocol.
- Surrogate pair math: For a supplementary code point CP > 0xFFFF, the high surrogate is Math.floor((CP - 0x10000) / 0x400) + 0xD800 and the low surrogate is ((CP - 0x10000) % 0x400) + 0xDC00 — the encode mode applies this formula to produce valid \uD800\uDC00 pairs.
- Decode regex pipeline: Three patterns run sequentially — \u{XXXXX} (ES6 curly braces) → \uXXXX (four-digit hex) → U+XXXX (standard notation) — with fromCodePoint() handling the curly-brace and U+ paths and fromCharCode() handling the traditional four-digit path.
- UTF-8 byte structure: BMP characters use 1–3 UTF-8 bytes (ASCII = 1 byte, Latin supplement = 2 bytes, CJK = 3 bytes); supplementary characters use 4 bytes following the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx — the tool's byte counter uses new Blob([str]).size for accurate measurement.
- Code point vs code unit: A single visible character may occupy multiple code points (e.g., é can be U+00E9 or U+0065 + U+0301 combining acute) — the tool reports both charCount (UTF-16 code units) and codepointCount (Unicode scalar values) to surface this discrepancy.
- Unicode planes overview: Plane 0 (BMP) = modern scripts, Plane 1 (SMP) = historic + emoji + math, Plane 2 (SIP) = rare CJK, Plane 14 (SSP) = tags + variation selectors, Planes 15–16 = private use — the encode mode correctly handles all planes via codePointAt().
- \u vs U+ notation: \uXXXX is the JavaScript/Java/C escape sequence (BMP only without curly braces); U+XXXX is the Unicode Consortium's canonical notation (minimum 4 hex digits, no upper limit) — the tool's format toggle switches between these representations.
Examples
Chinese to Unicode escape
Input: 你好世界 (4 CJK characters, 12 UTF-8 bytes)
Output: \u4f60\u597d\u4e16\u754c
Note: BMP code points only; useful in JSON strings, JavaScript literals, and log filesEmoji to surrogate pair escape
Input: 😀🎉 (2 emoji, each above U+FFFF)
Output: \uD83D\uDE00\uD83C\uDF89
Note: non-BMP characters encode as a UTF-16 surrogate pair; older JS engines need String.fromCodePoint to round-trip theseDecoding Unicode escapes
Input: \u4e2d\u6587\u6d4b\u8bd5
Output: 中文测试
Note: paste the escaped string and the tool reverses the encoding; verify the output bytes match the source when debugging CJK payload issuesFAQ
What does this tool show for each character?
Code point (decimal and hex), block name (e.g. Basic Latin, CJK Unified Ideographs), category (Letter, Number, Symbol, Punctuation, etc.), Unicode name, plus UTF-8 / UTF-16 / UTF-32 byte representations. Useful for debugging encoding bugs and for picking the right character.
What's the difference between UTF-8, UTF-16, and UTF-32?
All three encode the same Unicode characters. UTF-8 uses 1-4 bytes per code point and is byte-compatible with ASCII (the dominant web encoding). UTF-16 uses 2 or 4 bytes (used internally by JavaScript and Windows). UTF-32 uses 4 bytes always (rarely seen on the wire, common in memory).
Why does '𝓗' show as two UTF-16 code units?
Code points above U+FFFF (the Basic Multilingual Plane) are encoded as a 'surrogate pair' in UTF-16: two 16-bit halves. JavaScript's string.length counts these as 2; new Array.from(str) treats them as one. The page shows both views so you can debug surprises in length counting.
What is a normalisation form (NFC/NFD/NFKC/NFKD)?
Unicode allows multiple representations of the same visible text - é can be one code point (U+00E9) or e + combining acute accent (U+0065 U+0301). NFC composes them; NFD decomposes. NFKC/NFKD additionally fold compatibility characters (½ → 1/2). Always normalise before string comparison or hashing.
Why do emoji sometimes show as boxes?
Your browser's font doesn't have that glyph. Modern emoji use ZWJ sequences (e.g. 👨👩👧 = man + ZWJ + woman + ZWJ + girl) which need specific fonts to render as a single image; older fonts show three separate emoji or boxes.
How do I look up a character by name?
Type the name (or part of it) in the search box. Names follow the official Unicode names list (LATIN SMALL LETTER A, GREEK CAPITAL LETTER OMEGA, MUSICAL SYMBOL G CLEF). Common emoji also have a 'CLDR short name' that the page recognises.
Is my input uploaded?
No. Lookups use an in-browser Unicode database. Nothing is uploaded.