ToolActToolAct

Unicode Converter

Convert between text and Unicode encoding with multiple output formats

Input
Characters: 0
Code Points: 0
Output
Characters: 0

Code Point Details (click to copy)

Enter text to display code points for each character

What is Unicode?

Unicode is an industry standard in computing that encodes most of the world's writing systems. Each Unicode character has a unique numeric identifier called a Code Point, typically represented in hexadecimal with a U+ prefix, such as U+4E2D for the Chinese character '中'. The Unicode Converter tool can convert text to various Unicode representation formats and also restore Unicode encodings back to text. Unicode describes characters by code points, while encodings such as UTF-8 define how those code points are stored as bytes. This matters when analyzing text with emoji, accents, CJK characters, symbols, control characters, or mixed scripts. One visible character may consist of multiple code points, especially with combining marks, variation selectors, or emoji sequences. The tool helps debug mojibake, normalization, escape sequences, invisible characters, and mismatches between character count and byte length.

How to Use

Steps

  1. Enter or paste the text to convert in the input box
  2. Select conversion direction: Text to Unicode or Unicode to Text
  3. Choose output format: \uXXXX or U+XXXX (available when encoding)
  4. Results appear automatically, one-click copy or swap input/output

Encoding Notes

  • Unicode escape formats are useful for code and debugging, but they can reduce readability for normal copywriting.
  • When decoding, verify surrogate pairs and emoji output; splitting a pair can produce broken characters.

Use Cases

Encoding text into Unicode escape formatsConvert text into JavaScript-style \uXXXX sequences or U+XXXX code point notation. Characters above U+FFFF are represented as surrogate pairs in escape mode while the detail grid still shows their full code point. The conversion is a local String.prototype.codePointAt and fromCodePoint pass, so the character data stays local until you copy the escaped output yourself.
Decoding Unicode escapes back to readable textDecode \uXXXX, \u{. }, and U+XXXX patterns into characters, then copy the restored text. This helps inspect logs, JSON strings, localization files, and escaped snippets from code or APIs. Decoding happens through fromCharCode and fromCodePoint inside the page, so the source string and the rebuilt characters never leave the browser tab - useful when the snippet contains internal copy or unreleased product names.
Inspecting individual code pointsEvery input character is listed with its visible character and U+ value, and clicking a tile copies that code point. Character count and code point count are shown separately, which matters for emoji and other multi-unit characters. Because the grid is built from iteration over the input string in memory, the entire code-point map is rendered locally without sending the text to a third-party Unicode lookup service.
Spotting mojibake and double-encoded stringsPaste strings that look like 'é' or '中文', decode them, then re-encode with the suspected original encoding (UTF-8 vs Latin-1) to confirm whether the source was double-encoded rather than garbled at capture. The inspection is purely browser-side: TextDecoder, code-point iteration, and re-encode all happen in the page, so production strings pulled from logs can be analyzed without uploading them.
Counting emoji and combining marks as code pointsUse the code point grid to flag family emoji, flag sequences, and combining diacritics that occupy multiple slots. The separate character count and code point count clarify why a '1 char' string may consume 7 UTF-16 units. Because the counting is a local iteration over the input, even strings with sensitive content can be sized for database columns or wire-format budgets without leaving the browser.

Technical Principle

The Unicode Standard (ISO/IEC 10646) assigns a unique numeric code point to every character across 17 planes (U+0000 to U+10FFFF). The Basic Multilingual Plane (BMP, Plane 0) covers U+0000–U+FFFF and contains nearly all modern writing systems including CJK Unified Ideographs. Supplementary planes (1–16) hold historic scripts, rare CJK, emoji, and special-purpose characters. The tool converts between readable text and two machine-oriented representations: JavaScript-style \uXXXX escape sequences and the standard U+XXXX notation. UTF-16 is the internal encoding of JavaScript strings — every BMP character is stored as a single 16-bit code unit equal to its code point, while supplementary characters (U+10000 and above) are encoded as surrogate pairs: 0x10000 is subtracted from the code point to leave a 20-bit value, which is then split into a 10-bit high surrogate (0xD800 + ((cp - 0x10000) >> 10)) and a 10-bit low surrogate (0xDC00 + ((cp - 0x10000) & 0x3FF)). The tool's encode mode detects supplementary code points via String.prototype.codePointAt() and generates the correct two-\u escape for the \uXXXX format. For U+XXXX format, the full code point is displayed directly. The decode mode parses three syntaxes: \uXXXX (four hex digits, BMP only), \u{XXXXX} (ES6 curly-brace syntax supporting the full Unicode range), and U+XXXX (the standard notation with variable-length hex). The regex /\\u\{([0-9a-fA-F]+)\}/g handles curly-brace escapes and feeds them to String.fromCodePoint(), while /\\u([0-9a-fA-F]{4})/g handles traditional escapes via String.fromCharCode(). Mixing the two correctly reconstructs surrogate pairs when a supplementary character was encoded as two \u escapes. UTF-8 encoding is relevant because it determines byte length: a BMP character like '中' (U+4E2D) encodes as 3 UTF-8 bytes (E4 B8 AD), while an emoji like '😀' (U+1F600) requires 4 bytes (F0 9F 98 80). The tool's character counter distinguishes between code point count and UTF-16 code unit count — a useful distinction when debugging length limits in databases, APIs, or form fields that count code units rather than characters.

  • Code point iteration: String.prototype.codePointAt(pos) correctly returns the full code point for supplementary characters, unlike charCodeAt() which returns only the high surrogate — the tool uses the spread operator [...str] to iterate by code point, which internally calls the string iterator protocol.
  • Surrogate pair math: For a supplementary code point CP > 0xFFFF, the high surrogate is Math.floor((CP - 0x10000) / 0x400) + 0xD800 and the low surrogate is ((CP - 0x10000) % 0x400) + 0xDC00 — the encode mode applies this formula to produce valid \uD800\uDC00 pairs.
  • Decode regex pipeline: Three patterns run sequentially — \u{XXXXX} (ES6 curly braces) → \uXXXX (four-digit hex) → U+XXXX (standard notation) — with fromCodePoint() handling the curly-brace and U+ paths and fromCharCode() handling the traditional four-digit path.
  • UTF-8 byte structure: BMP characters use 1–3 UTF-8 bytes (ASCII = 1 byte, Latin supplement = 2 bytes, CJK = 3 bytes); supplementary characters use 4 bytes following the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx — the tool's byte counter uses new Blob([str]).size for accurate measurement.
  • Code point vs code unit: A single visible character may occupy multiple code points (e.g., é can be U+00E9 or U+0065 + U+0301 combining acute) — the tool reports both charCount (UTF-16 code units) and codepointCount (Unicode scalar values) to surface this discrepancy.
  • Unicode planes overview: Plane 0 (BMP) = modern scripts, Plane 1 (SMP) = historic + emoji + math, Plane 2 (SIP) = rare CJK, Plane 14 (SSP) = tags + variation selectors, Planes 15–16 = private use — the encode mode correctly handles all planes via codePointAt().
  • \u vs U+ notation: \uXXXX is the JavaScript/Java/C escape sequence (BMP only without curly braces); U+XXXX is the Unicode Consortium's canonical notation (minimum 4 hex digits, no upper limit) — the tool's format toggle switches between these representations.

Examples

Chinese to Unicode escape

Input:  你好世界 (4 CJK characters, 12 UTF-8 bytes)
Output: \u4f60\u597d\u4e16\u754c
Note:   BMP code points only; useful in JSON strings, JavaScript literals, and log files

Emoji to surrogate pair escape

Input:  😀🎉 (2 emoji, each above U+FFFF)
Output: \uD83D\uDE00\uD83C\uDF89
Note:   non-BMP characters encode as a UTF-16 surrogate pair; older JS engines need String.fromCodePoint to round-trip these

Decoding Unicode escapes

Input:  \u4e2d\u6587\u6d4b\u8bd5
Output: 中文测试
Note:   paste the escaped string and the tool reverses the encoding; verify the output bytes match the source when debugging CJK payload issues

FAQ

What does this tool show for each character?

Code point (decimal and hex), block name (e.g. Basic Latin, CJK Unified Ideographs), category (Letter, Number, Symbol, Punctuation, etc.), Unicode name, plus UTF-8 / UTF-16 / UTF-32 byte representations. Useful for debugging encoding bugs and for picking the right character.

What's the difference between UTF-8, UTF-16, and UTF-32?

All three encode the same Unicode characters. UTF-8 uses 1-4 bytes per code point and is byte-compatible with ASCII (the dominant web encoding). UTF-16 uses 2 or 4 bytes (used internally by JavaScript and Windows). UTF-32 uses 4 bytes always (rarely seen on the wire, common in memory).

Why does '𝓗' show as two UTF-16 code units?

Code points above U+FFFF (the Basic Multilingual Plane) are encoded as a 'surrogate pair' in UTF-16: two 16-bit halves. JavaScript's string.length counts these as 2; new Array.from(str) treats them as one. The page shows both views so you can debug surprises in length counting.

What is a normalisation form (NFC/NFD/NFKC/NFKD)?

Unicode allows multiple representations of the same visible text - é can be one code point (U+00E9) or e + combining acute accent (U+0065 U+0301). NFC composes them; NFD decomposes. NFKC/NFKD additionally fold compatibility characters (½ → 1/2). Always normalise before string comparison or hashing.

Why do emoji sometimes show as boxes?

Your browser's font doesn't have that glyph. Modern emoji use ZWJ sequences (e.g. 👨‍👩‍👧 = man + ZWJ + woman + ZWJ + girl) which need specific fonts to render as a single image; older fonts show three separate emoji or boxes.

How do I look up a character by name?

Type the name (or part of it) in the search box. Names follow the official Unicode names list (LATIN SMALL LETTER A, GREEK CAPITAL LETTER OMEGA, MUSICAL SYMBOL G CLEF). Common emoji also have a 'CLDR short name' that the page recognises.

Is my input uploaded?

No. Lookups use an in-browser Unicode database. Nothing is uploaded.