Character Frequency Analyzer
Count the frequency of each character in your text
Options
排序
What is Character Frequency Analysis?
Character frequency analysis counts how often each character appears in a piece of text and what share it represents. It can cover letters, digits, punctuation, whitespace, CJK characters, symbols, and unusual control characters depending on the input. Typical work includes cryptography exercises, simple substitution analysis, text feature extraction, data cleaning, encoding checks, compression research, and quality control on imported content. Unusual frequencies can reveal hidden formatting characters, mojibake, repeated separators, unexpected languages, or copy-paste artifacts. The tool does not explain every cause by itself, but it gives a measurable view of the actual character distribution so the next inspection step is clearer.
How to Use
How to use
- Enter or paste your text in the input area
- The system automatically counts each character's occurrences and frequency
- Use sort options to arrange results by character or count
- Adjust options (case sensitivity, spaces, newlines)
Counting Scope
- Decide whether spaces, punctuation, case, and line breaks should count before comparing texts.
- For multilingual text, remember that emoji, composed characters, and CJK characters may not behave like single Latin letters.
Use Cases
Technical Principle
Character frequency analysis counts the occurrences of each character in a text. The implementation uses a hash table (typically a Map in JavaScript): 1) traverse the text character by character, 2) for each character, increment the count in the map, 3) sort the entries by count descending, 4) render the histogram (top 10-20 characters, with the rest in a tail bucket). The algorithm is O(N) in the length of the text, where N is the number of characters, and the dominant cost is the map lookup and increment (essentially free in V8 and SpiderMonkey for the common case). A useful subtlety: what counts as a character? In JavaScript, .length returns the number of UTF-16 code units, not Unicode code points. A character outside the Basic Multilingual Plane (emoji, rare CJK ideographs) is encoded as a surrogate pair (two code units), and a naive per-code-unit count will over-count by treating each surrogate half as a separate character. The page uses Array.from(text) or Intl.Segmenter to iterate code points (or grapheme clusters), which is the right definition of a character for most use cases. A useful application: frequency analysis is the classical technique for breaking substitution ciphers (Caesar, Vigenere, simple substitution). The English letter frequency distribution is well-known (E, T, A, O, I, N, S, H, R are the most common, in that approximate order), and a substitution cipher preserves the frequency distribution, so the most common ciphertext letter likely maps to E, the second to T, and so on. The page is a teaching tool, not a cryptanalysis tool, but the technique is the same. A caveat on Unicode: for CJK languages, the character frequency distribution depends on the corpus (modern novel, classical poetry, technical text) and the level of analysis (character, word, bigram, trigram). A frequency analysis of a modern Chinese novel gives one distribution; a frequency analysis of the Analects gives a different one. The page is corpus-agnostic, so the user can run it on any text.
- Hash table (Map in JS): traverse text, increment count, sort by count desc, render histogram; O(N) in text length with very fast per-character cost.
- Surrogate pairs: a character outside the BMP (emoji, rare CJK) is two UTF-16 code units; naive per-code-unit count over-counts; use Array.from or Intl.Segmenter.
- Substitution cipher cryptanalysis: English letter frequency E, T, A, O, I, N, S, H, R (most common in that order); a substitution cipher preserves the distribution.
- CJK frequency: depends on the corpus (modern novel, classical poetry, technical text) and the level of analysis (character, word, bigram); the page is corpus-agnostic.
- Top-N and tail: the page renders top 10-20 characters by default, with a tail bucket for the rest; useful for spotting patterns in dense text.
- Case sensitivity: the page exposes a case-sensitive/insensitive toggle; English is typically case-insensitive (E and e are the same letter, just different shapes), CJK is always case-insensitive (no concept of case).
- Performance: O(N) for counting, O(K log K) for sorting K distinct characters; the page handles 1M-character texts in well under a second, which is the production pattern.
Examples
English Text Analysis
Input "hello world" → Result: 'l' appears 3 times (27.3%), 'o' appears 2 times (18.2%)Chinese Text Analysis
Input "我爱中国我爱北京" → Result: '我' and '爱' each appear 2 times (25%)Frequency Distribution Comparison
English text: 'e' highest at ~12.7%; Chinese: '的' highest at ~4%FAQ
What does the analyzer count?
Each unique character and how often it appears, sorted by frequency. Useful for cipher analysis (English text has predictable letter frequencies starting with E, T, A, O), text profiling, and finding unexpected characters in a copy-pasted document.
Is whitespace counted?
Yes by default - spaces, tabs, and newlines each get their own row. Toggle 'ignore whitespace' if you only want printable characters. Whitespace is typically the most common 'character' in natural-language text (~20%).
Is the count case-sensitive?
By default, yes - 'A' and 'a' are separate. Toggle 'ignore case' to fold them together, which is common for natural-language analysis. For code or hash inspection, case-sensitive matters.
Does it work for Chinese, Japanese, Korean?
Yes. Each CJK character is counted individually. The frequency table for a Chinese paragraph naturally shows hundreds of distinct characters since Chinese has no shared alphabet. The tool handles Unicode properly via grapheme-cluster counting.
Can I see the frequency in percent?
Yes, the page typically shows count and percentage of total. Useful for comparing letter distributions to known reference distributions when breaking simple substitution ciphers.
Why is the percentage not adding up to 100?
Rounding. Each cell is rounded to a fixed number of decimals; sum of rounded percentages can drift from 100% by a few tenths. Sum the raw counts for an exact total.
Is my text uploaded?
No. Frequency analysis runs in your browser. Pasted text is not transmitted.