What is the difference between NFC and NFD?

NFC composes base characters and combining marks into precomposed forms where they exist (e.g. 'é' as one code point). NFD decomposes them into base + combining marks (e.g. 'e' + combining acute). NFC is the recommended default for app, web, and database text; NFD is useful for inspection and for pipelines that want to strip or analyse diacritics.

What is the difference between NFC and NFKC?

NFC applies only canonical equivalences — characters that are strictly, officially the same. NFKC also applies compatibility mappings, which fold characters that merely look equivalent: ligatures become separate letters, full-width ASCII becomes normal ASCII, superscript digits become regular digits, Hebrew presentation forms become their base forms. NFKC is useful for search indexes but dangerous for display text.

What is a zero-width space?

U+200B (Zero Width Space) is an invisible code point. It exists as a soft word-break hint, but in practice it is injected by chat apps, CMS editors, and WYSIWYG exports and silently breaks regex, exact-match search, password comparisons, and joins. It is not the same as ZWJ or ZWNJ, which are legitimate joiners used in many scripts.

What is a non-breaking space?

U+00A0 (No-Break Space, NBSP) looks like a regular space but is a different code point. It prevents line breaks between its neighbours and commonly appears in content pasted from Word, Google Docs, or web pages. It breaks word-splitting, exact-match search, CSV parsing, and CSS whitespace handling. Replacing it with a regular ASCII space is safe in almost all app pipelines.

Why are smart quotes breaking my code?

Curly quotes (U+2018, U+2019, U+201C, U+201D) are valid typography in prose but are different code points from ASCII ' and ". JSON, JavaScript, SQL, shell, and CSV parsers do not accept them as string delimiters. Applying the smart-quotes-to-ASCII rule (included in the developer-safe profile) makes pasted content parseable as code.

What are bidi controls and why are they dangerous in source code?

Bidi controls are invisible Unicode code points that change the visual order of text without changing its byte order. In source code, that means the file can look one way to a reviewer and parse a different way to the compiler — the basis of Trojan Source attacks. They are also used in filename spoofing. Bidi controls should be surfaced in code review and stripped from identifiers, hostnames, and configuration files.

Should I strip all invisible characters?

No. ZWJ (U+200D) and ZWNJ (U+200C) are legitimate joiners in Persian, Urdu, Hindi, Bengali, and many other scripts, and are also what holds emoji ZWJ sequences together (families, flags, skin-tone modifiers). Stripping them blindly breaks real text in real languages. The developer-safe profile strips ZWSP, WJ, and MVS but preserves ZWJ and ZWNJ; removing those requires an explicit opt-in rule.

Does this tool upload my text anywhere?

No. In v1 all analysis, normalization, comparison, and sanitization run entirely in your browser. No text is sent to a server, logged, or stored between sessions. This is intentional so that the tool is safe for inspecting sensitive code, credentials, and internal content. You should still follow your own policy before pasting secrets into any browser tool.

All tools

Invisible Character Detector & Unicode Normalizer

Q: Why do two identical strings not match?

The usual causes are hidden characters (zero-width spaces, soft hyphens, NBSP, BOM), different Unicode normalization forms (one string is precomposed, the other is decomposed), compatibility variants (full-width vs ASCII, ligatures, superscripts), or bidi controls that change the byte sequence without changing the visible glyphs. The tool identifies which of these applies and shows the first code-point difference.

Reveal zero-width spaces, smart quotes, bidi controls, and Unicode normalization mismatches. Clean and normalize text safely in your browser.

Examples

Text to inspect

Processing is local. Your text never leaves the browser.

Code points

UTF-16

UTF-8 bytes

Graphemes

Paste text on the left, or load an example, to inspect it.

All analysis, normalization, comparison, and cleanup run entirely in your browser. No text is uploaded, logged, or stored between sessions. Safe for debugging credentials, private queries, or internal content — but always follow your own policy before pasting secrets into any browser tool.

Classification, normalization, comparison, and cleanup all run entirely in your browser. No text is uploaded to a server, logged, or stored between sessions. This makes the tool safe for debugging sensitive code, credentials, internal queries, and private content — but always follow your own policy before pasting secrets into any browser tool.

Overview

Unicode is the reason modern software can handle every language, emoji, and symbol in the same string. It is also the reason two visually identical strings sometimes fail equality checks, regex patterns break on content that looks clean, JSON parsers reject pasted input, passwords do not match, and CSV/SQL joins silently return empty results. A tiny fraction of characters — many of them invisible — cause a disproportionate amount of real-world bugs.

This tool reveals those characters. It highlights every zero-width space, non-breaking space, smart quote, soft hyphen, bidi control, variation selector, and combining mark in your text, and explains what each one commonly breaks. It normalizes text to NFC, NFD, NFKC, or NFKD with clear before/after views, and it compares two strings under every normalization form so you can see exactly why they differ.

Cleanup is deliberately separate from inspection. Profiles (developer-safe, content-safe, security-hardened, and minimal) apply curated rule sets that avoid destroying legitimate non-Latin text. Destructive rules are explicitly labelled, and zero-width joiners used in emoji and Indic/Perso-Arabic scripts are preserved unless you deliberately opt into stripping them.

Use cases

When to use it

Fixing copy/paste bugs from Slack, PDFs, Word, Google Docs, or CMS editorsthose sources routinely inject NBSPs, smart quotes, soft hyphens, and occasionally zero-width spaces. This tool surfaces them and lets you clean the result safely.
Debugging string equality issues in tests, databases, and APIscompare two strings and see whether they differ at the byte level, only under NFC, or because one contains hidden characters the other does not.
Cleaning JSON, regex, SQL, HTML, and CSV inputssmart quotes break JSON and SQL, soft hyphens break exact-match search, NBSPs break whitespace parsing. A developer-safe pass clears the common offenders without altering legitimate content.
Spotting bidi controls in code and configuration filesTrojan Source-style attacks hide directional overrides in source code. The security review rendering makes every bidi control visible as a token.
Preparing text for storage, comparison, or exportnormalize to NFC for predictable equality, strip BOMs and line separators for parser compatibility, and replace NBSPs with ASCII spaces when consistency matters.
Inspecting CMS output before publishingcontent-safe profile removes BOM and soft hyphens and normalizes to NFC without touching smart quotes, dashes, or legitimate typography.

When it's not enough

Assuming every invisible character is a bugZWJ and ZWNJ are legitimate and often required in Persian, Urdu, Hindi, Bengali, and in emoji sequences. The tool treats them as informational by default.
Using NFKC or NFKD as a general-purpose normalizationcompatibility normalization folds ligatures, full-width forms, superscripts, and other characters that merely look equivalent. Apply only when you specifically want that folding (for example a search index).
Treating ASCII-only as transliterationthe ASCII-only rule removes non-ASCII code points. It does not transliterate and does not preserve meaning. Use it only when an ASCII-only target is a hard requirement.
Stripping all default-ignorable characters globallythe tool deliberately never strips ZWJ/ZWNJ without an explicit opt-in because that would break emoji and many non-English scripts.

How to use it

1
Paste text or pick an example
The inspector shows per-character counts (code points, UTF-16, UTF-8 bytes, graphemes) and a highlighted preview with per-category colour coding.
2
Review the highlighted issues
Open the Characters tab. Every non-plain code point appears with its Unicode code point, name, category, cause of trouble, and suggested fix. Use the category filters to isolate bidi controls, invisibles, or any other single type.
3
Inspect alternate views if needed
The Views tab shows an escaped rendering, a JSON-safe rendering, a flat code-point report, and a security-review rendering that replaces bidi controls with visible tokens.
4
Compare two strings if equality is the question
Switch to Compare mode. The result includes raw, NFC, NFD, NFKC, and NFKD equality, a natural-language explanation, the first differing code point on each side, and any hidden characters that appear in only one of the two strings.
5
Normalize if needed
Open Normalize mode, pick NFC for most app/web/database text, or NFD/NFKC/NFKD for specific workflows. Compatibility forms are clearly flagged as lossy.
6
Clean with a profile or custom rules
Choose developer-safe, content-safe, security-hardened, minimal, or build a custom rule set. The change log shows exactly how many characters each rule removed or replaced, and destructive rules are labelled LOSSY.

Common errors and fixes

Zero-width space breaking regex or a password comparison

Run a developer-safe or security-hardened sanitization pass. Both strip zero-width spaces (ZWSP, WJ, MVS) while preserving ZWJ/ZWNJ used in legitimate scripts and emoji.

Smart quotes breaking JSON or JavaScript string literals

Apply the smart-quotes-to-ASCII rule (included in the developer-safe profile). Curly quotes become straight ' and " so the string parses as code.

NBSPs breaking layout, joins, or exact-match searches

Replace NBSP (U+00A0) and narrow NBSP (U+202F) with a regular space. Developer-safe does this. Content-safe also does it because it rarely changes visible output.

Soft hyphens breaking exact-match search or database lookups

Soft hyphen (U+00AD) is invisible and safe to strip in almost all text pipelines. Included in developer-safe, security-hardened, and content-safe profiles.

Bidi controls making code look different from what the compiler sees

Use Inspect → Views → Security review to render every bidi control as a visible token. Security-hardened sanitization strips the full bidi-control set. Never accept invisible bidi controls silently in source review.

NFC/NFD mismatches breaking equality in databases or joins

Normalize both sides to NFC before comparing or storing. This is the NFC rule in every profile. The default language recommendation is NFC unless you specifically need something else.

Over-aggressive cleanup destroying legitimate non-Latin text

Stop using a one-size-fits-all strip. Switch to the content-safe profile, which normalizes to NFC and removes BOM/soft hyphens without touching smart quotes, dashes, joiners, or combining marks used by many languages.

Frequently asked questions

Regex Tester JSON Formatter & Validator CSV ↔ JSON Converter & Cleaner Text Case Converter & Cleanup Diff Viewer Slug Generator & URL Sanitizer URL Parser, Encoder & UTM Query Builder Punycode Converter & Homograph Inspector All developer tools

Why do two identical strings not match in code or databases?

Two strings that look identical on screen can still fail equality for four common reasons:

Hidden characters. One string contains a zero-width space, soft hyphen, non-breaking space, BOM, or variation selector that the other does not. Invisible on screen, different at the byte level.
Normalization form. One string uses precomposed characters (café as a single é), the other uses a base letter plus a combining accent (cafe + ́). Normalize both sides to NFC and they compare equal.
Compatibility equivalents. Full-width letters, ligatures, superscripts, or Hebrew presentation forms look like ASCII counterparts but are different code points. They fold under NFKC / NFKD, not NFC.
Bidi or directional differences. Invisible bidi controls can appear in logs, filenames, and copy/paste text. They do not change the rendered glyphs but they change the bytes.

What is Unicode normalization and which form should I use?

Unicode normalization reshuffles combining marks and folds equivalent representations so that two logically-equal strings compare equal byte-for-byte. There are four forms:

NFC (Canonical Composition) — recommended default for most app, web, and database text. Base characters and combining marks are composed into precomposed forms where possible.
NFD (Canonical Decomposition) — precomposed characters are split into base + combining marks. Useful for inspection, diacritic-stripping pipelines, and some text analysis. Not recommended for storage.
NFKC / NFKD (Compatibility forms) — more aggressive. Compatibility mappings fold characters that merely look equivalent: ligatures become separate letters, full-width ASCII becomes normal ASCII, superscript digits become regular digits. Useful for search-index preprocessing; dangerous for display text.

The practical rule: apply NFC everywhere you are storing, indexing, or comparing general text. Only apply NFKC or NFKD to specific pipelines where compatibility folding is actually what you want.

Invisible gremlins cheat sheet

Character	Unicode	What it breaks	Suggested fix
Zero-Width Space (ZWSP)	U+200B	Regex, exact-match search, password equality, joins. Injected by chat apps and WYSIWYG editors.	Strip. Safe for almost all app text.
Zero-Width Joiner (ZWJ)	U+200D	Legitimate in Indic scripts and emoji ZWJ sequences (family, flag, skin-tone). Still invisible.	Do not strip blindly. Keep for emoji and many languages.
Zero-Width Non-Joiner (ZWNJ)	U+200C	Legitimate in Persian, Urdu, Hindi, Bengali, and more. Invisible.	Do not strip blindly. Required in many Perso-Arabic and Indic strings.
Non-Breaking Space (NBSP)	U+00A0	Layout, word splitting, exact-match search, CSV. Common when pasting from Word/Google Docs/web pages.	Replace with a regular ASCII space.
Smart Quotes	U+2018–U+201D	JSON, JavaScript, SQL, shell, CSV when pasted into code.	Replace with ASCII `'` and `"` in code contexts.
Soft Hyphen	U+00AD	Invisible conditional line-break. Breaks exact-match search, password equality, DB lookups.	Strip. Safe for almost all general text.
LRO / RLO (Directional Overrides)	U+202D / U+202E	Reverses visual glyph order. Core building blocks of Trojan Source attacks and filename spoofing.	Strip in code and configuration. Never accept silently in source review.
BOM / ZWNBSP	U+FEFF	JSON parsers, shebang lines, first-line parsers. Added by Windows editors and Excel CSV export.	Strip from stored/parsed text.

This cheat sheet is deliberately short and practical. ZWJ and ZWNJ are listed because they are invisible and can surprise you — not because they are bugs.

Why bidi controls are a real security issue

Unicode includes a set of bidirectional formatting controls that let mixed left-to-right and right-to-left text render correctly. They are also the building blocks of Trojan Source attacks: the visual order of a line of code can be forced to diverge from its logical byte order. A reviewer reading the file sees one thing; the compiler parses something else.

The same technique appears in filenames (an executable named doc\u202Egpj.exe can render as docexe.jpg), in configuration files, and in text pasted into ticketing systems. Detection is not optional — directional controls are invisible. The tool flags every bidi control as a high-severity finding and offers a security review rendering that replaces them with visible tokens so code review can actually see them.

The security-hardened sanitization profile strips the full bidi-control set (LRE, RLE, LRO, RLO, PDF, LRI, RLI, FSI, PDI) and also the bidi marks (LRM, RLM, ALM). Use it for identifiers, hostnames, filenames, and any source reviewable by humans.

How to fix this in JavaScript

The same rules the tool applies can be used directly in JavaScript with standard APIs — no library required.

// 1. Normalize to NFC before storage or comparison
const a = userInput.normalize("NFC");

// 2. Strip safe zero-width characters (keep ZWJ/ZWNJ)
const ZW_SAFE = /[\u200B\u2060\u180E\u034F\uFEFF]/g;
const noZw = a.replace(ZW_SAFE, "");

// 3. Replace curly quotes with ASCII
const noSmart = noZw
  .replace(/[\u2018\u2019\u201A\u201B]/g, "'")
  .replace(/[\u201C\u201D\u201E\u201F]/g, '"');

// 4. Collapse NBSP to a regular space
const cleanSpaces = noSmart.replace(/[\u00A0\u202F]/g, " ");

// 5. Detect bidi controls (high-severity for code review)
const BIDI = /[\u202A-\u202E\u2066-\u2069\u200E\u200F\u061C]/;
if (BIDI.test(cleanSpaces)) {
  // flag, do not accept silently
}

Note the separation: normalize first, then clean. Normalization is non-destructive for the characters that matter; stripping is destructive and must be done with care. ZWJ and ZWNJ are deliberately left alone — they are needed for emoji and many non-Latin scripts.

Local, private, and zero-upload in v1

Classification, normalization, comparison, and sanitization all run in your browser. No text is uploaded to a server, logged, or stored between sessions. That makes this tool safe for debugging sensitive code, credentials, internal queries, and private content. You should still follow your own organization's policy before pasting secrets into any browser-based tool.