TextSorter

How to Remove Special Characters from Text (Accents, Symbols, Emoji, Unicode)

· 8 min read

“Special characters” is one of those terms that means completely different things depending on who’s saying it. A web developer means something different from a database admin, who means something different from a copywriter.

But the end result is usually the same: you’ve got text with characters in it that shouldn’t be there, and you need them gone. Maybe your database rejects them. Maybe your filename has an umlaut that breaks on Windows. Maybe someone pasted text from a webpage and it brought along 14 invisible Unicode gremlins.

Let’s sort this out.

What Are “Special Characters” Anyway?

The answer genuinely depends on context. So here’s a breakdown by what people usually mean:

For general text cleanup: Special characters = anything that isn’t a basic English letter (A-Z, a-z), a digit (0-9), or standard whitespace (space, newline). Everything else is “special.”

For database and file systems: Special characters = anything that causes problems in your specific system. SQL might choke on single quotes. Windows filenames can’t have colons or asterisks. URLs need special characters to be percent-encoded.

For internationalization: Special characters = accented letters (é, ñ, ü, ç), characters from non-Latin scripts (Chinese, Arabic, Cyrillic, Korean), and locale-specific punctuation (¿, «, »).

For data cleaning: Special characters = invisible characters like zero-width spaces, non-breaking spaces, control characters, byte order marks, and other Unicode artifacts that look like nothing but break everything.

The Invisible Character Nightmare

Let me tell you about the worst kind of special character. The ones you literally cannot see.

You paste text from a website into your spreadsheet. It looks perfectly normal. But when you try to VLOOKUP against it, nothing matches. When you search for a specific value, it’s “not found.” When you compare two strings that look identical, they come back as “different.”

Congratulations, you’ve encountered invisible Unicode characters. Here are the usual suspects:

Zero-width space (U+200B). Takes up zero pixels of width but is a real character. Websites use them to suggest word-break opportunities. Copy-paste brings them along.

Non-breaking space (U+00A0). Looks exactly like a regular space but has a different character code. Used to prevent word wrap between specific words. Your eyes see a space. Your computer sees a completely different character than the regular space (U+0020).

Zero-width joiner (U+200D) and zero-width non-joiner (U+200C). Used in some writing systems and emoji sequences. The family emoji is actually individual people connected by ZWJs.

Soft hyphen (U+00AD). An invisible hyphen that only appears when a word needs to break at the end of a line. Word processors insert these. They’re invisible but present in the string.

Byte order mark (U+FEFF). Sometimes appears at the very beginning of a file when it’s saved as UTF-8 with BOM. It’s invisible in most editors but can cause “phantom first character” issues.

Right-to-left mark (U+200F) and left-to-right mark (U+200E). Control the direction of text rendering. They’re invisible but present in text that mixes English with Arabic or Hebrew.

TextSorter’s Clean Text tool strips all of these. It’s actually one of the most underrated features of the tool, because most text cleaners only handle visible characters.

Removing Accents (Without Deleting the Letter)

This is the most common “special character” task and also the most commonly done wrong.

You want to turn “café” into “cafe.” Simple, right? But if you just do “delete everything that’s not ASCII,” you’ll turn “José García” into “Jos Garca” because the accented letters get completely removed instead of converted to their base equivalents.

The correct approach uses Unicode normalization. Here’s how it works:

The letter “é” can exist in two forms in Unicode:

  • Precomposed (NFC): A single code point U+00E9
  • Decomposed (NFD): Two code points: U+0065 (plain “e”) + U+0301 (combining acute accent)

These render identically on screen. But in decomposed form, the base letter and the accent are separate characters. So you can:

  1. Convert to NFD (decompose the accented characters)
  2. Remove all combining marks (Unicode category Mn)
  3. What’s left is the base letters without accents

In practice:

JavaScript:

function removeAccents(text) {
  return text.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}

removeAccents('café résumé naïve');
// "cafe resume naive"

The normalize('NFD') decomposes the characters. The regex [\u0300-\u036f] matches the combining diacritical marks range and removes them.

Python:

import unicodedata

def remove_accents(text):
    nfkd = unicodedata.normalize('NFKD', text)
    return ''.join(c for c in nfkd if unicodedata.category(c) != 'Mn')

remove_accents('café résumé naïve')
# "cafe resume naive"

Using NFKD instead of NFD also handles compatibility decompositions, converting things like ligatures (fi becomes “fi”) and fullwidth characters to their standard forms.

The result:

  • “café” becomes “cafe”
  • “München” becomes “Munchen”
  • “São Paulo” becomes “Sao Paulo”
  • “naïve” becomes “naive”
  • “résumé” becomes “resume”

The base letters survive. Only the accent marks get removed.

Removing Emoji

Emoji removal has gotten more complex over the years because emoji themselves have gotten more complex.

Early emoji were single code points. Easy to match. Modern emoji can be sequences of multiple code points connected by zero-width joiners. A flag emoji is two “regional indicator” code points. A skin-tone emoji is a base emoji plus a modifier. A family emoji is multiple person emoji joined by ZWJ characters.

JavaScript (modern):

// Remove emoji using Unicode property (requires /u flag)
const clean = text.replace(/\p{Extended_Pictographic}/gu, '');

The Extended_Pictographic property catches most emoji including modern multi-codepoint sequences.

Python (with regex library):

import regex
clean = regex.sub(r'\p{Extended_Pictographic}', '', text)

Note: Python’s built-in re module doesn’t support Unicode properties. You need the regex package (pip install regex) for \p{...} patterns.

Context-Specific Cleaning

For URLs and Slugs

When creating URL slugs from titles, the standard process is:

  1. Convert to NFD and strip accents
  2. Convert to lowercase
  3. Replace spaces with hyphens
  4. Remove everything that’s not a letter, digit, or hyphen
  5. Collapse multiple hyphens into one
function slugify(text) {
  return text
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .toLowerCase()
    .replace(/[^a-z0-9\s-]/g, '')
    .replace(/\s+/g, '-')
    .replace(/-+/g, '-')
    .replace(/^-|-$/g, '');
}

slugify('How to Make Crème Brûlée (Easy Recipe!)');
// "how-to-make-creme-brulee-easy-recipe"

For Filenames

Different operating systems have different rules:

Windows (NTFS) forbids: \ / : * ? " < > | macOS/Linux forbids: / (and \0 null byte) Cross-platform safe characters: letters, digits, hyphens, underscores, periods

A safe filename sanitizer:

function sanitizeFilename(name) {
  return name
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .replace(/[\\/:*?"<>|]/g, '')
    .replace(/\s+/g, '_')
    .replace(/^\.+/, '')
    .substring(0, 255);
}

For Database Fields

The main concerns are:

  • SQL injection characters (single quotes, semicolons) but honestly you should be using parameterized queries, not sanitizing input strings
  • Characters outside your database’s character set (if it’s not UTF-8)
  • Invisible characters that cause silent deduplication failures

For database normalization, the safest approach is: store everything as NFC-normalized UTF-8, and handle display/search normalization at the application layer.

The Nuclear Option: ASCII Only

Sometimes you just need everything to be plain ASCII. Legacy systems, old databases, specific file format requirements. For these cases:

function toAscii(text) {
  return text
    .normalize('NFKD')
    .replace(/[^\x00-\x7F]/g, '');
}

This nukes everything outside the 7-bit ASCII range. It’s aggressive but guarantees compatibility with any system.

The downside: non-Latin text gets completely destroyed. “东京” (Tokyo in Japanese) becomes an empty string. “Москва” (Moscow in Russian) becomes nothing. Only use this when you’re certain the text is primarily English or Latin-script.

The Smart Approach: Clean Without Destroying

For most real-world text cleanup, you don’t want to remove everything. You want to:

  1. Replace accented letters with their ASCII equivalents
  2. Strip invisible characters and control characters
  3. Keep regular punctuation (or remove it separately if needed)
  4. Preserve the actual content

TextSorter’s Clean Text tool takes this balanced approach. It handles the invisible characters, normalizes whitespace, fixes smart quotes, and gives you clean text without going nuclear on the content.

For more targeted operations, combine it with Find and Replace to strip specific character patterns using regex.

Common Mistakes

Mistake 1: Deleting instead of converting. Removing “é” entirely instead of converting it to “e” loses information. Always decompose and strip marks rather than deleting the whole character.

Mistake 2: Assuming ASCII is enough. If you serve users who speak Spanish, French, German, Portuguese, or basically any non-English language, their names and content will have characters outside ASCII. “José” is not an edge case, it’s a very common name.

Mistake 3: Not handling zero-width characters. You clean all the visible junk but miss the invisible characters. Then your string comparisons still fail and you can’t figure out why. Use a tool that specifically targets invisible Unicode characters.

Mistake 4: Breaking emoji sequences. Removing individual code points from an emoji sequence can leave orphaned combining characters or modifiers that render as blank squares or question marks. Remove complete emoji or leave them alone.

Mistake 5: Forgetting about encoding. If your text file is saved in Latin-1 but your tool reads it as UTF-8 (or vice versa), you’ll get mojibake (garbled characters like é). Make sure the encoding matches throughout your pipeline.

Mistake 6: Sanitizing for security instead of using parameterized queries. If you’re removing special characters to prevent SQL injection or XSS, you’re doing it wrong. Use parameterized database queries and HTML escaping/sanitization libraries. Character stripping is not a security measure.

Quick Reference: What to Remove and When

ScenarioWhat to RemoveWhat to Keep
URL slugsAccents, symbols, spaces (replace with -)Letters, digits, hyphens
FilenamesOS-forbidden chars, accentsLetters, digits, hyphens, underscores, dots
Database cleanupInvisible chars, control charsAll visible characters
NLP preprocessingPunctuation, symbolsLetters, digits, spaces
ASCII conversionEverything non-ASCIIA-Z, a-z, 0-9, basic punctuation
Deduplication prepInvisible chars, normalize accentsAll visible text (normalized)

Clean special characters from your text now →