What counts as a special character in text?

It depends on context. Generally, special characters are anything outside basic English letters (A-Z), digits (0-9), and common whitespace. This includes accented letters (é, ñ, ü), currency symbols (€, £, ¥), emoji, mathematical symbols, control characters (tab, null byte), zero-width spaces, and non-Latin scripts (Chinese, Arabic, Cyrillic characters). What's 'special' depends on what your target system can handle.

How do I remove accents from letters without deleting the whole character?

Use Unicode NFD normalization to decompose accented characters into their base letter plus a combining accent mark, then strip the combining marks. For example, 'é' decomposes into 'e' + combining acute accent. Remove the combining mark and you're left with 'e'. This works for café to cafe, München to Munchen, São Paulo to Sao Paulo.

Why does my text have invisible characters?

Invisible characters sneak into text from many sources. Copy-pasting from websites can bring zero-width spaces (U+200B), non-breaking spaces (U+00A0), and zero-width joiners. Word processors add smart quotes and soft hyphens. PDFs embed control characters. Some of these are intentional (like non-breaking spaces to keep words together) but most are unwanted and can break searches, comparisons, and data imports.

How do I remove emoji from text?

In regex, you can match emoji using Unicode ranges or the Unicode Emoji property. In JavaScript with the /u flag: text.replace(/\p{Emoji}/gu, ''). In Python with the regex library: regex.sub(r'\p{Emoji}', '', text). Be aware that some 'emoji' are actually sequences of multiple code points joined by invisible characters, so simple regex might leave fragments.

What is the difference between Unicode NFC and NFD normalization?

NFC (Canonical Composition) combines base letters and accents into single precomposed characters where possible. 'e' + combining accent becomes 'é'. NFD (Canonical Decomposition) does the opposite: it splits characters into base letter plus combining marks. NFD is used when you want to strip accents because you can decompose first, then remove the combining marks. NFC is the standard form for storage and comparison.

How to Remove Special Characters from Text (Accents, Symbols, Emoji, Unicode)

“Special characters” is one of those terms that means completely different things depending on who’s saying it. A web developer means something different from a database admin, who means something different from a copywriter.

But the end result is usually the same: you’ve got text with characters in it that shouldn’t be there, and you need them gone. Maybe your database rejects them. Maybe your filename has an umlaut that breaks on Windows. Maybe someone pasted text from a webpage and it brought along 14 invisible Unicode gremlins.

Let’s sort this out.

What Are “Special Characters” Anyway?

The answer genuinely depends on context. So here’s a breakdown by what people usually mean:

For general text cleanup: Special characters = anything that isn’t a basic English letter (A-Z, a-z), a digit (0-9), or standard whitespace (space, newline). Everything else is “special.”

For database and file systems: Special characters = anything that causes problems in your specific system. SQL might choke on single quotes. Windows filenames can’t have colons or asterisks. URLs need special characters to be percent-encoded.

For internationalization: Special characters = accented letters (é, ñ, ü, ç), characters from non-Latin scripts (Chinese, Arabic, Cyrillic, Korean), and locale-specific punctuation (¿, «, »).

For data cleaning: Special characters = invisible characters like zero-width spaces, non-breaking spaces, control characters, byte order marks, and other Unicode artifacts that look like nothing but break everything.

The Invisible Character Nightmare

Let me tell you about the worst kind of special character. The ones you literally cannot see.

You paste text from a website into your spreadsheet. It looks perfectly normal. But when you try to VLOOKUP against it, nothing matches. When you search for a specific value, it’s “not found.” When you compare two strings that look identical, they come back as “different.”

Congratulations, you’ve encountered invisible Unicode characters. Here are the usual suspects:

Zero-width space (U+200B). Takes up zero pixels of width but is a real character. Websites use them to suggest word-break opportunities. Copy-paste brings them along.

Non-breaking space (U+00A0). Looks exactly like a regular space but has a different character code. Used to prevent word wrap between specific words. Your eyes see a space. Your computer sees a completely different character than the regular space (U+0020).

Zero-width joiner (U+200D) and zero-width non-joiner (U+200C). Used in some writing systems and emoji sequences. The family emoji is actually individual people connected by ZWJs.

Soft hyphen (U+00AD). An invisible hyphen that only appears when a word needs to break at the end of a line. Word processors insert these. They’re invisible but present in the string.

Byte order mark (U+FEFF). Sometimes appears at the very beginning of a file when it’s saved as UTF-8 with BOM. It’s invisible in most editors but can cause “phantom first character” issues.

Right-to-left mark (U+200F) and left-to-right mark (U+200E). Control the direction of text rendering. They’re invisible but present in text that mixes English with Arabic or Hebrew.

TextSorter’s Clean Text tool strips all of these. It’s actually one of the most underrated features of the tool, because most text cleaners only handle visible characters.

Removing Accents (Without Deleting the Letter)

This is the most common “special character” task and also the most commonly done wrong.

You want to turn “café” into “cafe.” Simple, right? But if you just do “delete everything that’s not ASCII,” you’ll turn “José García” into “Jos Garca” because the accented letters get completely removed instead of converted to their base equivalents.

The correct approach uses Unicode normalization. Here’s how it works:

The letter “é” can exist in two forms in Unicode:

Precomposed (NFC): A single code point U+00E9
Decomposed (NFD): Two code points: U+0065 (plain “e”) + U+0301 (combining acute accent)

These render identically on screen. But in decomposed form, the base letter and the accent are separate characters. So you can:

Convert to NFD (decompose the accented characters)
Remove all combining marks (Unicode category Mn)
What’s left is the base letters without accents

In practice:

JavaScript:

function removeAccents(text) {
  return text.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}

removeAccents('café résumé naïve');
// "cafe resume naive"

The normalize('NFD') decomposes the characters. The regex [\u0300-\u036f] matches the combining diacritical marks range and removes them.

Python:

import unicodedata

def remove_accents(text):
    nfkd = unicodedata.normalize('NFKD', text)
    return ''.join(c for c in nfkd if unicodedata.category(c) != 'Mn')

remove_accents('café résumé naïve')
# "cafe resume naive"

Using NFKD instead of NFD also handles compatibility decompositions, converting things like ligatures (fi becomes “fi”) and fullwidth characters to their standard forms.

The result:

“café” becomes “cafe”
“München” becomes “Munchen”
“São Paulo” becomes “Sao Paulo”
“naïve” becomes “naive”
“résumé” becomes “resume”

The base letters survive. Only the accent marks get removed.

Removing Emoji

Emoji removal has gotten more complex over the years because emoji themselves have gotten more complex.

Early emoji were single code points. Easy to match. Modern emoji can be sequences of multiple code points connected by zero-width joiners. A flag emoji is two “regional indicator” code points. A skin-tone emoji is a base emoji plus a modifier. A family emoji is multiple person emoji joined by ZWJ characters.

JavaScript (modern):

// Remove emoji using Unicode property (requires /u flag)
const clean = text.replace(/\p{Extended_Pictographic}/gu, '');

The Extended_Pictographic property catches most emoji including modern multi-codepoint sequences.

Python (with regex library):

import regex
clean = regex.sub(r'\p{Extended_Pictographic}', '', text)

Note: Python’s built-in re module doesn’t support Unicode properties. You need the regex package (pip install regex) for \p{...} patterns.

Context-Specific Cleaning

For URLs and Slugs

When creating URL slugs from titles, the standard process is:

Convert to NFD and strip accents
Convert to lowercase
Replace spaces with hyphens
Remove everything that’s not a letter, digit, or hyphen
Collapse multiple hyphens into one

function slugify(text) {
  return text
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .toLowerCase()
    .replace(/[^a-z0-9\s-]/g, '')
    .replace(/\s+/g, '-')
    .replace(/-+/g, '-')
    .replace(/^-|-$/g, '');
}

slugify('How to Make Crème Brûlée (Easy Recipe!)');
// "how-to-make-creme-brulee-easy-recipe"

For Filenames

Different operating systems have different rules:

Windows (NTFS) forbids: \ / : * ? " < > | macOS/Linux forbids: / (and \0 null byte) Cross-platform safe characters: letters, digits, hyphens, underscores, periods

A safe filename sanitizer:

function sanitizeFilename(name) {
  return name
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .replace(/[\\/:*?"<>|]/g, '')
    .replace(/\s+/g, '_')
    .replace(/^\.+/, '')
    .substring(0, 255);
}

For Database Fields

The main concerns are:

SQL injection characters (single quotes, semicolons) but honestly you should be using parameterized queries, not sanitizing input strings
Characters outside your database’s character set (if it’s not UTF-8)
Invisible characters that cause silent deduplication failures

For database normalization, the safest approach is: store everything as NFC-normalized UTF-8, and handle display/search normalization at the application layer.

The Nuclear Option: ASCII Only

Sometimes you just need everything to be plain ASCII. Legacy systems, old databases, specific file format requirements. For these cases:

function toAscii(text) {
  return text
    .normalize('NFKD')
    .replace(/[^\x00-\x7F]/g, '');
}

This nukes everything outside the 7-bit ASCII range. It’s aggressive but guarantees compatibility with any system.

The downside: non-Latin text gets completely destroyed. “东京” (Tokyo in Japanese) becomes an empty string. “Москва” (Moscow in Russian) becomes nothing. Only use this when you’re certain the text is primarily English or Latin-script.

The Smart Approach: Clean Without Destroying

For most real-world text cleanup, you don’t want to remove everything. You want to:

Replace accented letters with their ASCII equivalents
Strip invisible characters and control characters
Keep regular punctuation (or remove it separately if needed)
Preserve the actual content

TextSorter’s Clean Text tool takes this balanced approach. It handles the invisible characters, normalizes whitespace, fixes smart quotes, and gives you clean text without going nuclear on the content.

For more targeted operations, combine it with Find and Replace to strip specific character patterns using regex.

Common Mistakes

Mistake 1: Deleting instead of converting. Removing “é” entirely instead of converting it to “e” loses information. Always decompose and strip marks rather than deleting the whole character.

Mistake 2: Assuming ASCII is enough. If you serve users who speak Spanish, French, German, Portuguese, or basically any non-English language, their names and content will have characters outside ASCII. “José” is not an edge case, it’s a very common name.

Mistake 3: Not handling zero-width characters. You clean all the visible junk but miss the invisible characters. Then your string comparisons still fail and you can’t figure out why. Use a tool that specifically targets invisible Unicode characters.

Mistake 4: Breaking emoji sequences. Removing individual code points from an emoji sequence can leave orphaned combining characters or modifiers that render as blank squares or question marks. Remove complete emoji or leave them alone.

Mistake 5: Forgetting about encoding. If your text file is saved in Latin-1 but your tool reads it as UTF-8 (or vice versa), you’ll get mojibake (garbled characters like Ã©). Make sure the encoding matches throughout your pipeline.

Mistake 6: Sanitizing for security instead of using parameterized queries. If you’re removing special characters to prevent SQL injection or XSS, you’re doing it wrong. Use parameterized database queries and HTML escaping/sanitization libraries. Character stripping is not a security measure.

Quick Reference: What to Remove and When

Scenario	What to Remove	What to Keep
URL slugs	Accents, symbols, spaces (replace with -)	Letters, digits, hyphens
Filenames	OS-forbidden chars, accents	Letters, digits, hyphens, underscores, dots
Database cleanup	Invisible chars, control chars	All visible characters
NLP preprocessing	Punctuation, symbols	Letters, digits, spaces
ASCII conversion	Everything non-ASCII	A-Z, a-z, 0-9, basic punctuation
Deduplication prep	Invisible chars, normalize accents	All visible text (normalized)

Clean special characters from your text now →