HTML Entity Encoder
Convert HTML entity characters online, supports encoding and decoding to prevent XSS attacks
Select Conversion Method
What is HTML Entity Encoding?
HTML entity encoding is a mechanism that converts special characters into HTML entity references. In HTML, certain characters have special meanings (like <, >, &), and if you need to display these characters themselves on the page, you must use entity encoding. Entity encoding comes in two forms: named entities (like <) and numeric entities (like <). Named entities are more readable, while numeric entities can represent any Unicode character. HTML encoding matters when text must be inserted into HTML without being interpreted as markup. Characters such as <, >, &, quotes, and apostrophes can otherwise change tags, attributes, or entities. The tool helps with examples, templates, CMS content, and debugging XSS-related issues. Context still matters: HTML body text, attributes, URLs, JavaScript, and CSS all require different escaping rules, so encoded output should be used in the correct place.
How to Use
How to use
- Enter or paste text to convert in the left input box
- Click the corresponding conversion button to select encoding or decoding method
- The result will automatically display on the right
- Click the "Copy" button to copy the result to clipboard
Conversion Methods
Keyboard Shortcuts
- Ctrl + EHTML Entity Encode
- Ctrl + DHTML Entity Decode
Encoding Tips
- Encode user-visible text before inserting it into HTML source, especially when the text may contain angle brackets, quotes, or ampersands.
- HTML entity encoding helps prevent markup from being interpreted, but it is only one part of XSS defense and does not replace contextual output escaping.
Use Cases
Technical Principle
HTML uses two kinds of character references defined by the WHATWG HTML Living Standard. Named character references begin with & and end with ;, drawing from the entities.json table maintained by WHATWG (about 2,231 names as of the current spec, including legacy aliases without trailing semicolons such as & without the ;). Numeric character references use Unicode code points in either decimal (<) or hexadecimal (<) form and can encode any character from U+0000 to U+10FFFF except for the surrogate range U+D800-U+DFFF. The five characters that MUST be escaped to preserve HTML syntactic safety are & (&), < (<), > (>), " ("), and ' ('); note that ' is part of XML and HTML5 but is NOT valid in HTML 4.01, so OWASP recommends the numeric form ' for double-quote-delimited attributes that must round-trip through legacy parsers. Encoding in this tool is a single-pass replacement: the order matters because & must be escaped first, otherwise the entity prefixes inserted for < and > would themselves get re-escaped into &lt;. Decoding leverages the browser's HTML parser by assigning the input to a detached element's innerHTML and reading back textContent; this dispatches to the official Tokenizer state machine in the HTML spec (sections 13.2.5.72 Character reference state through 13.2.5.80), which correctly resolves named, decimal, and hex forms including malformed inputs like missing semicolons. Numeric encoding for the full-encode mode walks the string code-point by code-point using String.prototype.codePointAt to handle astral characters that occupy a UTF-16 surrogate pair (e.g., emoji U+1F600 becomes 😀 not the two-surrogate fallback). XSS prevention requires context-aware escaping, not just HTML-entity encoding. The OWASP Cross-Site Scripting Prevention Cheat Sheet defines five distinct contexts: HTML body, HTML attribute (quoted vs unquoted), JavaScript data (inside <script>), CSS, and URL. HTML-entity escaping covers contexts 1 and 2 only. JavaScript contexts should use \xHH or \uHHHH escapes via JSON.stringify, URL contexts need encodeURIComponent (RFC 3986 percent-encoding), and inline event handlers compound the rules because their values pass through both HTML and JavaScript parsers. A Content-Security-Policy header with script-src 'self' and 'unsafe-inline' removed is the modern defense-in-depth layer that catches escape mistakes, and DOM sinks such as innerHTML, document.write, and setAttribute('on*', ...) should be replaced with textContent or framework-managed bindings (React's JSX, Vue's mustache) that escape by default.
- Named references: about 2,231 entries in WHATWG entities.json; the five must-escape names are & < > " ' (' is HTML5/XML-only, not HTML 4.01)
- Numeric references: decimal &#DDDDD; and hexadecimal &#xHHHH; cover U+0000 to U+10FFFF; surrogates U+D800-U+DFFF and U+0000 NULL are invalid per HTML spec
- Escape order: & must be replaced first, otherwise the inserted & prefix of subsequent escapes is double-encoded; encoding is O(n) with a 5-entry lookup table
- Decoding via DOMParser: assigning to a detached element's innerHTML invokes the HTML spec tokenizer (Character reference state, sections 13.2.5.72-80) which handles legacy entities without trailing semicolons
- Astral character handling: use String.prototype.codePointAt and for...of iteration so emoji and CJK extension B characters (U+10000+) produce a single &#NNNNN; rather than two surrogate references
- Context-aware escaping (OWASP XSS Prevention Cheat Sheet rule #0): HTML body, HTML attribute, JavaScript, CSS, and URL each need different escaping; HTML entities alone do not stop XSS in JS or URL sinks
- Defense in depth: Content-Security-Policy script-src 'self' (RFC-style), DOMPurify allowlist sanitization for rich-text input, and preferring textContent/innerText over innerHTML in vanilla DOM code
Examples
Basic element encoding
Input: <script>alert(1)</script>
Output: <script>alert(1)</script>
Use: prevent the browser from interpreting the text as a real tag when rendering user-supplied contentAttribute value encoding
Input: <div title="Hello & world">
Output: <div title="Hello & world">
Note: quotes and the ampersand inside the attribute are entity-encoded so the value cannot break out of the quotesURL display in page
Input: search?q=hello&lang=en
Output: search?q=hello&lang=en
Use: the page should encode the & before inserting the URL into HTML, otherwise the parser may treat the rest as a malformed entityNon-ASCII characters (full encode)
Input: CJK characters like 中文
Output: full UTF-8 numeric form 中文 (or named entities if the page supports them)
Use: safe embedding of arbitrary Unicode into legacy HTML; modern pages usually rely on UTF-8 insteadFAQ
Which characters does HTML encoding convert?
The five SGML reserved characters: & → &, < → <, > → >, " → ", ' → ' (or '). Optionally non-ASCII characters can be converted to numeric entities (&#xNN;) for legacy systems that don't handle UTF-8.
When do I need HTML encoding?
Any time user-supplied text is inserted into HTML content. Failing to encode is the root cause of XSS vulnerabilities. Encode user content for HTML body, attribute values, JavaScript context, CSS context, and URL context - each context has slightly different rules.
What's the difference between ' and '?
Both produce a single quote. ' was added to HTML5 but is not valid in HTML4 or older email clients - if your output is read by old systems, use '. The page emits ' by default for maximum compatibility.
Why does my output still contain &?
If the input already contains an HTML entity like &, encoding it produces &amp; - which is correct because the input ampersand was a literal character, not an entity. Decode first if your source is already entity-encoded.
Are emojis converted?
Emojis are valid Unicode and modern HTML handles them as ordinary characters - no encoding needed unless your target system insists on ASCII-only. Toggle 'numeric entity for non-ASCII' to convert them to &#xNNNN; form.
Is encoding the same as URL encoding?
No. URL encoding (percent-encoding) replaces unsafe characters with %NN sequences for use in URLs. HTML encoding replaces them with named or numeric entities for use in HTML. Use the right tool for the right context - mixing them creates double-encoding bugs.
Is the conversion done locally?
Yes. Encoding and decoding happen in your browser. Pasted text is not uploaded.