Search for Foreign Language Characters in Text
Franck wrote us, asking if there was any good way to find just the Japanese characters in an InDesign story. How can you find just those characters? Or just Russian (Cyrillic), or just ornaments, or — for that matter — search for only latin characters? The answer, of course, is the GREP tab of the Find/Change dialog box (CS3 and later).
Finding only Latin Characters
To find just the latin characters, ignoring any special characters, punctuation, numbers, and so on, you could type
[a-z|A-Z]+
into the Find What field of the GREP tab of the Find/Change dialog box. Both the vertical pipe character and the square brackets act as “or” commands, so this means “any character between a and z or between A and Z” (then the plus symbol means a string of one or more of them).
It will not find accented characters because those characters don’t strictly fall between a and z in the unicode lists. It’s all based on unicode numbers, as we’ll see later on.
If you want to find longer strings, including spaces, most punctuation, numbers, and so on, you might use this long list, which offers even more characters inside the “or” square brackets:
[.,;:?!\d a-z|A-Z]+
You can also turn this code around and say “any character that is not in this list” by adding a ^ (caret) symbol at the beginning:
[^.,;:?!\d a-z|A-Z]+
which would find all the non-English characters, such as accents, ornaments, cyrillic, and so on.
Searching for Unicode Ranges
As I mentioned earlier, find/change is all based on unicode values. For example, capital A is 0041, capital Z is 005A, and so on. (Unicode values are based on four hexadecimal numbers, which is a fancy way of saying that each number can be 0-9 or A-F.) So if you know the range of the unicode values, you can really dial the GREP in to exactly what you’re looking for.
For example, here’s some GREP to find all Japanese characters:
([\x{4E00}-\x{9FBF}]|[\x{3040}-\x{309F}]|[\x{30A0}-\x{30FF}])+
It seems complex at first, but after a moment you’ll notice that this is simply a list of three ranges, separated by vertical pipes. Technically, it means, all the characters that fall into the kanji section of Unicode (4E00 to 9FBF) or all the hiragana (3040 to 309F) or all the katakana. (I found those on the Japanese writing system page at wikipedia.)
You could use a similar method to find characters in Bengali (0980-09FF) or just math operators (2200-22FF) or whatever you’d like. I found these and many other unicode ranges at unicode.com/charts.
Applying a Different Font to These Characters
Of course, once you have found the characters, you probably want to do something with them — such as change their font or apply a character style. You can also do that in the Find/Change dialog box:

Note that the Change To field is empty. This indicates that you want to leave the text alone — whatever InDesign finds — and only apply the formatting to it.
The nice thing (in CS4) is that you can use these Unicode ranges also in GREP styles, for instance when your default font doesn’t have the characters. Define a character style for the other font and make a GREP style. No need for local formatting then.
i saw this one coming
thanx david !
Hello David,
No need to separate ranges in character classes with pipes: you can just use one range after the other. And the parentheses create (in this case) just an unnecessary reference. This works fine:
[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]+
Finally, the last two ranges follow each other with no intervening character (9F + 1 = A0) so that they can be combined:
[\x{4E00}-\x{9FBF}\x{3040}-\x{30FF}]+
Regards,
Peter
Great points and good analysis, Peter. Thank you. I often don’t take the time to finesse the grep codes, once I find something that works.
BTW, I should point out that it was Peter K’s GREP ebook (listed here) in which I first learned the trick to speccing unicode characters.
Hello,
About [a-zA-Z], it is right with both OS and InDesign CS4. But, it isn’t with Mac OS X and InDesign CS3.
As I recently wrote on my blog, [a-z] matches accented characters in Latin-1, Latin-A, Latin-B or Latin add.
Best
I encountered a problem recently in CS4, wherein on typing ‘(apostrophe) or “, accented O – e.g O with a tilde used to be inserted in text.
Is this a problem of text variables or inadvertently i have activated some secret keyboard short-cut for foreign (accented) characters
@Mayoor: I’m not sure what the problem was that you are describing. InDesign should handle those characters just fine. Of course, there are some characters that do not exist inside fonts.
@Mayoor:
“inadvertently i have activated some secret keyboard short-cut for foreign (accented) characters..”
Yes — the one on your system, not a hidden feature of InDesign
If you are using Windows, you might have pressed and released Ctrl+Left Alt (from memory) — it toggles the international keyboard def. Go to your Control Panel, Regional & Language settings, tab Languages, button “Details”, tab Settings, and finally button Key Settings.
Oh, and then hit the button “Change Key Sequence”. Disable both options, if you aren’t using them consciously.
And that was not typed off the top of my head. I s’pose the equivalent on OS X would be a bit easier.
Great post David!
I just wanted to point out that Multilingual Tools as well as World Tools have what I call “Language Styles” which can apply styles based on writing scripts. Besides working in CS2, the extra trick that Language Styles does is that it applies the character styles to punctuation between that language as well (using a trick that I learned from Peter Kahrel quite some time ago).
It currently has Hebrew and Latin (Roman) styles. I expected to add many more (Arabic, Cryllic, Indic, CJK, etc), but I have yet to receive any requests to do so!