ToolActToolAct

Character Frequency Analyzer

Count the frequency of each character in your text

Input Text

Options

Total Characters
0
Unique Characters
0

排序

Options

频率分布
No data. Please enter some text.

What is Character Frequency Analysis?

Character frequency analysis counts how often each character appears in a piece of text and what share it represents. It can cover letters, digits, punctuation, whitespace, CJK characters, symbols, and unusual control characters depending on the input. Typical work includes cryptography exercises, simple substitution analysis, text feature extraction, data cleaning, encoding checks, compression research, and quality control on imported content. Unusual frequencies can reveal hidden formatting characters, mojibake, repeated separators, unexpected languages, or copy-paste artifacts. The tool does not explain every cause by itself, but it gives a measurable view of the actual character distribution so the next inspection step is clearer.

How to Use

How to use

  1. Enter or paste your text in the input area
  2. The system automatically counts each character's occurrences and frequency
  3. Use sort options to arrange results by character or count
  4. Adjust options (case sensitivity, spaces, newlines)

Counting Scope

  • Decide whether spaces, punctuation, case, and line breaks should count before comparing texts.
  • For multilingual text, remember that emoji, composed characters, and CJK characters may not behave like single Latin letters.

Use Cases

Profile messy logs, OCR output, and keyword listsPaste copied logs, OCR output, keyword lists, or encoded strings and see total characters, unique characters, counts, percentages, and visual bars sorted by frequency. Sort by count to spot dominant symbols and sort by character to scan the alphabet in order. This kind of distribution snapshot is the first step before deciding whether to scrub, classify, or discard the input.
Expose invisible whitespace and casing problemsToggle case sensitivity, ignore spaces, and ignore newlines to see whether duplicates, parser failures, or layout issues are caused by invisible separators or letter-case differences. Comparing a counted-as-distinct pair against a counted-as-equal run reveals which characters are silently doubling the size of the text. A single non-breaking space in thousands of rows will show up as a tiny but real percentage in the table.
Prepare evidence for cleanup rules and validatorsSort by character or count before writing replacement rules, validation logic, language-learning material, puzzle checks, or corpus-cleaning notes that depend on exact character distribution. The frequency table gives cleanup discussions a measurable basis instead of relying on visual inspection alone. Share the table with reviewers so the rules and the data they target stay aligned.
Spot mojibake, smart quotes, and stray BOM markersRun the analysis on suspect strings and look for unusual clusters of replacement characters, combining marks, or non-printable bytes that often signal wrong encoding, smart-quote substitution, or stray BOM markers in copied text. A sudden spike in U+FFFD or a single character with a strange frequency in a short sample usually means a charset mismatch upstream.
Distinguish code points, surrogate pairs, and ZWJ emojiBeyond raw counts, the analyzer exposes the difference between code points and grapheme clusters: a flag emoji like 🏳️‍🌈 is rendered as several code points joined by a ZWJ, and a math symbol may be a single code point or a surrogate pair. Treat the totals as code-point totals so ZWJ sequences and surrogate pairs are visible in the table rather than collapsing into the surrounding text.

Technical Principle

Character frequency analysis counts the occurrences of each character in a text. The implementation uses a hash table (typically a Map in JavaScript): 1) traverse the text character by character, 2) for each character, increment the count in the map, 3) sort the entries by count descending, 4) render the histogram (top 10-20 characters, with the rest in a tail bucket). The algorithm is O(N) in the length of the text, where N is the number of characters, and the dominant cost is the map lookup and increment (essentially free in V8 and SpiderMonkey for the common case). A useful subtlety: what counts as a character? In JavaScript, .length returns the number of UTF-16 code units, not Unicode code points. A character outside the Basic Multilingual Plane (emoji, rare CJK ideographs) is encoded as a surrogate pair (two code units), and a naive per-code-unit count will over-count by treating each surrogate half as a separate character. The page uses Array.from(text) or Intl.Segmenter to iterate code points (or grapheme clusters), which is the right definition of a character for most use cases. A useful application: frequency analysis is the classical technique for breaking substitution ciphers (Caesar, Vigenere, simple substitution). The English letter frequency distribution is well-known (E, T, A, O, I, N, S, H, R are the most common, in that approximate order), and a substitution cipher preserves the frequency distribution, so the most common ciphertext letter likely maps to E, the second to T, and so on. The page is a teaching tool, not a cryptanalysis tool, but the technique is the same. A caveat on Unicode: for CJK languages, the character frequency distribution depends on the corpus (modern novel, classical poetry, technical text) and the level of analysis (character, word, bigram, trigram). A frequency analysis of a modern Chinese novel gives one distribution; a frequency analysis of the Analects gives a different one. The page is corpus-agnostic, so the user can run it on any text.

  • Hash table (Map in JS): traverse text, increment count, sort by count desc, render histogram; O(N) in text length with very fast per-character cost.
  • Surrogate pairs: a character outside the BMP (emoji, rare CJK) is two UTF-16 code units; naive per-code-unit count over-counts; use Array.from or Intl.Segmenter.
  • Substitution cipher cryptanalysis: English letter frequency E, T, A, O, I, N, S, H, R (most common in that order); a substitution cipher preserves the distribution.
  • CJK frequency: depends on the corpus (modern novel, classical poetry, technical text) and the level of analysis (character, word, bigram); the page is corpus-agnostic.
  • Top-N and tail: the page renders top 10-20 characters by default, with a tail bucket for the rest; useful for spotting patterns in dense text.
  • Case sensitivity: the page exposes a case-sensitive/insensitive toggle; English is typically case-insensitive (E and e are the same letter, just different shapes), CJK is always case-insensitive (no concept of case).
  • Performance: O(N) for counting, O(K log K) for sorting K distinct characters; the page handles 1M-character texts in well under a second, which is the production pattern.

Examples

English Text Analysis

Input "hello world" → Result: 'l' appears 3 times (27.3%), 'o' appears 2 times (18.2%)

Chinese Text Analysis

Input "我爱中国我爱北京" → Result: '我' and '爱' each appear 2 times (25%)

Frequency Distribution Comparison

English text: 'e' highest at ~12.7%; Chinese: '的' highest at ~4%

FAQ

What does the analyzer count?

Each unique character and how often it appears, sorted by frequency. Useful for cipher analysis (English text has predictable letter frequencies starting with E, T, A, O), text profiling, and finding unexpected characters in a copy-pasted document.

Is whitespace counted?

Yes by default - spaces, tabs, and newlines each get their own row. Toggle 'ignore whitespace' if you only want printable characters. Whitespace is typically the most common 'character' in natural-language text (~20%).

Is the count case-sensitive?

By default, yes - 'A' and 'a' are separate. Toggle 'ignore case' to fold them together, which is common for natural-language analysis. For code or hash inspection, case-sensitive matters.

Does it work for Chinese, Japanese, Korean?

Yes. Each CJK character is counted individually. The frequency table for a Chinese paragraph naturally shows hundreds of distinct characters since Chinese has no shared alphabet. The tool handles Unicode properly via grapheme-cluster counting.

Can I see the frequency in percent?

Yes, the page typically shows count and percentage of total. Useful for comparing letter distributions to known reference distributions when breaking simple substitution ciphers.

Why is the percentage not adding up to 100?

Rounding. Each cell is rounded to a fixed number of decimals; sum of rounded percentages can drift from 100% by a few tenths. Sum the raw counts for an exact total.

Is my text uploaded?

No. Frequency analysis runs in your browser. Pasted text is not transmitted.