All tools

Invisible Character Detector & Unicode Normalizer

Reveal zero-width spaces, smart quotes, bidi controls, and Unicode normalization mismatches. Clean and normalize text safely in your browser.

Examples
Text to inspect

Processing is local. Your text never leaves the browser.

Code points

0

UTF-16

0

UTF-8 bytes

0

Graphemes

0

Paste text on the left, or load an example, to inspect it.

All analysis, normalization, comparison, and cleanup run entirely in your browser. No text is uploaded, logged, or stored between sessions. Safe for debugging credentials, private queries, or internal content — but always follow your own policy before pasting secrets into any browser tool.

Classification, normalization, comparison, and cleanup all run entirely in your browser. No text is uploaded to a server, logged, or stored between sessions. This makes the tool safe for debugging sensitive code, credentials, internal queries, and private content — but always follow your own policy before pasting secrets into any browser tool.

Overview

Unicode is the reason modern software can handle every language, emoji, and symbol in the same string. It is also the reason two visually identical strings sometimes fail equality checks, regex patterns break on content that looks clean, JSON parsers reject pasted input, passwords do not match, and CSV/SQL joins silently return empty results. A tiny fraction of characters — many of them invisible — cause a disproportionate amount of real-world bugs.

This tool reveals those characters. It highlights every zero-width space, non-breaking space, smart quote, soft hyphen, bidi control, variation selector, and combining mark in your text, and explains what each one commonly breaks. It normalizes text to NFC, NFD, NFKC, or NFKD with clear before/after views, and it compares two strings under every normalization form so you can see exactly why they differ.

Cleanup is deliberately separate from inspection. Profiles (developer-safe, content-safe, security-hardened, and minimal) apply curated rule sets that avoid destroying legitimate non-Latin text. Destructive rules are explicitly labelled, and zero-width joiners used in emoji and Indic/Perso-Arabic scripts are preserved unless you deliberately opt into stripping them.

Use cases

When to use it

  • Fixing copy/paste bugs from Slack, PDFs, Word, Google Docs, or CMS editorsthose sources routinely inject NBSPs, smart quotes, soft hyphens, and occasionally zero-width spaces. This tool surfaces them and lets you clean the result safely.
  • Debugging string equality issues in tests, databases, and APIscompare two strings and see whether they differ at the byte level, only under NFC, or because one contains hidden characters the other does not.
  • Cleaning JSON, regex, SQL, HTML, and CSV inputssmart quotes break JSON and SQL, soft hyphens break exact-match search, NBSPs break whitespace parsing. A developer-safe pass clears the common offenders without altering legitimate content.
  • Spotting bidi controls in code and configuration filesTrojan Source-style attacks hide directional overrides in source code. The security review rendering makes every bidi control visible as a token.
  • Preparing text for storage, comparison, or exportnormalize to NFC for predictable equality, strip BOMs and line separators for parser compatibility, and replace NBSPs with ASCII spaces when consistency matters.
  • Inspecting CMS output before publishingcontent-safe profile removes BOM and soft hyphens and normalizes to NFC without touching smart quotes, dashes, or legitimate typography.

When it's not enough

  • Assuming every invisible character is a bugZWJ and ZWNJ are legitimate and often required in Persian, Urdu, Hindi, Bengali, and in emoji sequences. The tool treats them as informational by default.
  • Using NFKC or NFKD as a general-purpose normalizationcompatibility normalization folds ligatures, full-width forms, superscripts, and other characters that merely look equivalent. Apply only when you specifically want that folding (for example a search index).
  • Treating ASCII-only as transliterationthe ASCII-only rule removes non-ASCII code points. It does not transliterate and does not preserve meaning. Use it only when an ASCII-only target is a hard requirement.
  • Stripping all default-ignorable characters globallythe tool deliberately never strips ZWJ/ZWNJ without an explicit opt-in because that would break emoji and many non-English scripts.

How to use it

  1. 1

    Paste text or pick an example

    The inspector shows per-character counts (code points, UTF-16, UTF-8 bytes, graphemes) and a highlighted preview with per-category colour coding.

  2. 2

    Review the highlighted issues

    Open the Characters tab. Every non-plain code point appears with its Unicode code point, name, category, cause of trouble, and suggested fix. Use the category filters to isolate bidi controls, invisibles, or any other single type.

  3. 3

    Inspect alternate views if needed

    The Views tab shows an escaped rendering, a JSON-safe rendering, a flat code-point report, and a security-review rendering that replaces bidi controls with visible tokens.

  4. 4

    Compare two strings if equality is the question

    Switch to Compare mode. The result includes raw, NFC, NFD, NFKC, and NFKD equality, a natural-language explanation, the first differing code point on each side, and any hidden characters that appear in only one of the two strings.

  5. 5

    Normalize if needed

    Open Normalize mode, pick NFC for most app/web/database text, or NFD/NFKC/NFKD for specific workflows. Compatibility forms are clearly flagged as lossy.

  6. 6

    Clean with a profile or custom rules

    Choose developer-safe, content-safe, security-hardened, minimal, or build a custom rule set. The change log shows exactly how many characters each rule removed or replaced, and destructive rules are labelled LOSSY.

Common errors and fixes

Zero-width space breaking regex or a password comparison

Run a developer-safe or security-hardened sanitization pass. Both strip zero-width spaces (ZWSP, WJ, MVS) while preserving ZWJ/ZWNJ used in legitimate scripts and emoji.

Smart quotes breaking JSON or JavaScript string literals

Apply the smart-quotes-to-ASCII rule (included in the developer-safe profile). Curly quotes become straight ' and " so the string parses as code.

NBSPs breaking layout, joins, or exact-match searches

Replace NBSP (U+00A0) and narrow NBSP (U+202F) with a regular space. Developer-safe does this. Content-safe also does it because it rarely changes visible output.

Soft hyphens breaking exact-match search or database lookups

Soft hyphen (U+00AD) is invisible and safe to strip in almost all text pipelines. Included in developer-safe, security-hardened, and content-safe profiles.

Bidi controls making code look different from what the compiler sees

Use Inspect → Views → Security review to render every bidi control as a visible token. Security-hardened sanitization strips the full bidi-control set. Never accept invisible bidi controls silently in source review.

NFC/NFD mismatches breaking equality in databases or joins

Normalize both sides to NFC before comparing or storing. This is the NFC rule in every profile. The default language recommendation is NFC unless you specifically need something else.

Over-aggressive cleanup destroying legitimate non-Latin text

Stop using a one-size-fits-all strip. Switch to the content-safe profile, which normalizes to NFC and removes BOM/soft hyphens without touching smart quotes, dashes, joiners, or combining marks used by many languages.

Frequently asked questions

Why do two identical strings not match in code or databases?

Two strings that look identical on screen can still fail equality for four common reasons:

  • Hidden characters. One string contains a zero-width space, soft hyphen, non-breaking space, BOM, or variation selector that the other does not. Invisible on screen, different at the byte level.
  • Normalization form. One string uses precomposed characters (café as a single é), the other uses a base letter plus a combining accent (cafe + ́). Normalize both sides to NFC and they compare equal.
  • Compatibility equivalents. Full-width letters, ligatures, superscripts, or Hebrew presentation forms look like ASCII counterparts but are different code points. They fold under NFKC / NFKD, not NFC.
  • Bidi or directional differences. Invisible bidi controls can appear in logs, filenames, and copy/paste text. They do not change the rendered glyphs but they change the bytes.

What is Unicode normalization and which form should I use?

Unicode normalization reshuffles combining marks and folds equivalent representations so that two logically-equal strings compare equal byte-for-byte. There are four forms:

  • NFC (Canonical Composition) — recommended default for most app, web, and database text. Base characters and combining marks are composed into precomposed forms where possible.
  • NFD (Canonical Decomposition) — precomposed characters are split into base + combining marks. Useful for inspection, diacritic-stripping pipelines, and some text analysis. Not recommended for storage.
  • NFKC / NFKD (Compatibility forms) — more aggressive. Compatibility mappings fold characters that merely look equivalent: ligatures become separate letters, full-width ASCII becomes normal ASCII, superscript digits become regular digits. Useful for search-index preprocessing; dangerous for display text.

The practical rule: apply NFC everywhere you are storing, indexing, or comparing general text. Only apply NFKC or NFKD to specific pipelines where compatibility folding is actually what you want.

Invisible gremlins cheat sheet

CharacterUnicodeWhat it breaksSuggested fix
Zero-Width Space (ZWSP)U+200BRegex, exact-match search, password equality, joins. Injected by chat apps and WYSIWYG editors.Strip. Safe for almost all app text.
Zero-Width Joiner (ZWJ)U+200DLegitimate in Indic scripts and emoji ZWJ sequences (family, flag, skin-tone). Still invisible.Do not strip blindly. Keep for emoji and many languages.
Zero-Width Non-Joiner (ZWNJ)U+200CLegitimate in Persian, Urdu, Hindi, Bengali, and more. Invisible.Do not strip blindly. Required in many Perso-Arabic and Indic strings.
Non-Breaking Space (NBSP)U+00A0Layout, word splitting, exact-match search, CSV. Common when pasting from Word/Google Docs/web pages.Replace with a regular ASCII space.
Smart QuotesU+2018–U+201DJSON, JavaScript, SQL, shell, CSV when pasted into code.Replace with ASCII ' and " in code contexts.
Soft HyphenU+00ADInvisible conditional line-break. Breaks exact-match search, password equality, DB lookups.Strip. Safe for almost all general text.
LRO / RLO (Directional Overrides)U+202D / U+202EReverses visual glyph order. Core building blocks of Trojan Source attacks and filename spoofing.Strip in code and configuration. Never accept silently in source review.
BOM / ZWNBSPU+FEFFJSON parsers, shebang lines, first-line parsers. Added by Windows editors and Excel CSV export.Strip from stored/parsed text.

This cheat sheet is deliberately short and practical. ZWJ and ZWNJ are listed because they are invisible and can surprise you — not because they are bugs.

Why bidi controls are a real security issue

Unicode includes a set of bidirectional formatting controls that let mixed left-to-right and right-to-left text render correctly. They are also the building blocks of Trojan Source attacks: the visual order of a line of code can be forced to diverge from its logical byte order. A reviewer reading the file sees one thing; the compiler parses something else.

The same technique appears in filenames (an executable named doc\u202Egpj.exe can render as docexe.jpg), in configuration files, and in text pasted into ticketing systems. Detection is not optional — directional controls are invisible. The tool flags every bidi control as a high-severity finding and offers a security review rendering that replaces them with visible tokens so code review can actually see them.

The security-hardened sanitization profile strips the full bidi-control set (LRE, RLE, LRO, RLO, PDF, LRI, RLI, FSI, PDI) and also the bidi marks (LRM, RLM, ALM). Use it for identifiers, hostnames, filenames, and any source reviewable by humans.

How to fix this in JavaScript

The same rules the tool applies can be used directly in JavaScript with standard APIs — no library required.

// 1. Normalize to NFC before storage or comparison
const a = userInput.normalize("NFC");

// 2. Strip safe zero-width characters (keep ZWJ/ZWNJ)
const ZW_SAFE = /[\u200B\u2060\u180E\u034F\uFEFF]/g;
const noZw = a.replace(ZW_SAFE, "");

// 3. Replace curly quotes with ASCII
const noSmart = noZw
  .replace(/[\u2018\u2019\u201A\u201B]/g, "'")
  .replace(/[\u201C\u201D\u201E\u201F]/g, '"');

// 4. Collapse NBSP to a regular space
const cleanSpaces = noSmart.replace(/[\u00A0\u202F]/g, " ");

// 5. Detect bidi controls (high-severity for code review)
const BIDI = /[\u202A-\u202E\u2066-\u2069\u200E\u200F\u061C]/;
if (BIDI.test(cleanSpaces)) {
  // flag, do not accept silently
}

Note the separation: normalize first, then clean. Normalization is non-destructive for the characters that matter; stripping is destructive and must be done with care. ZWJ and ZWNJ are deliberately left alone — they are needed for emoji and many non-Latin scripts.

Local, private, and zero-upload in v1

Classification, normalization, comparison, and sanitization all run in your browser. No text is uploaded to a server, logged, or stored between sessions. That makes this tool safe for debugging sensitive code, credentials, internal queries, and private content. You should still follow your own organization's policy before pasting secrets into any browser-based tool.