October 26 2009 • 5:04 AM

Search for Foreign Language Characters in Text

Franck wrote us, asking if there was any good way to find just the Japanese characters in an InDesign story. How can you find just those characters? Or just Russian (Cyrillic), or just ornaments, or — for that matter — search for only latin characters? The answer, of course, is the GREP tab of the Find/Change dialog box (CS3 and later).

Finding only Latin Characters

To find just the latin characters, ignoring any special characters, punctuation, numbers, and so on, you could type

[a-z|A-Z]+

into the Find What field of the GREP tab of the Find/Change dialog box. Both the vertical pipe character and the square brackets act as “or” commands, so this means “any character between a and z or between A and Z” (then the plus symbol means a string of one or more of them).

It will not find accented characters because those characters don’t strictly fall between a and z in the unicode lists. It’s all based on unicode numbers, as we’ll see later on.

If you want to find longer strings, including spaces, most punctuation, numbers, and so on, you might use this long list, which offers even more characters inside the “or” square brackets:

[.,;:?!\d a-z|A-Z]+

You can also turn this code around and say “any character that is not in this list” by adding a ^ (caret) symbol at the beginning:

[^.,;:?!\d a-z|A-Z]+

which would find all the non-English characters, such as accents, ornaments, cyrillic, and so on.

Searching for Unicode Ranges

As I mentioned earlier, find/change is all based on unicode values. For example, capital A is 0041, capital Z is 005A, and so on. (Unicode values are based on four hexadecimal numbers, which is a fancy way of saying that each number can be 0-9 or A-F.) So if you know the range of the unicode values, you can really dial the GREP in to exactly what you’re looking for.

For example, here’s some GREP to find all Japanese characters:

([\x{4E00}-\x{9FBF}]|[\x{3040}-\x{309F}]|[\x{30A0}-\x{30FF}])+

It seems complex at first, but after a moment you’ll notice that this is simply a list of three ranges, separated by vertical pipes. Technically, it means, all the characters that fall into the kanji section of Unicode (4E00 to 9FBF) or all the hiragana (3040 to 309F) or all the katakana. (I found those on the Japanese writing system page at wikipedia.)

You could use a similar method to find characters in Bengali (0980-09FF) or just math operators (2200-22FF) or whatever you’d like. I found these and many other unicode ranges at unicode.com/charts.

Applying a Different Font to These Characters

Of course, once you have found the characters, you probably want to do something with them — such as change their font or apply a character style. You can also do that in the Find/Change dialog box:

Applying a character style to Japanese text with GREP

Note that the Change To field is empty. This indicates that you want to leave the text alone — whatever InDesign finds — and only apply the formatting to it.

9 Responses discussing this post. Add yours below.

  1. Jan Krans
    October 26th, 2009 • 6:24 am • Link

    The nice thing (in CS4) is that you can use these Unicode ranges also in GREP styles, for instance when your default font doesn’t have the characters. Define a character style for the other font and make a GREP style. No need for local formatting then.

  2. fr
    October 26th, 2009 • 6:51 am • Link

    i saw this one coming :) thanx david !

  3. October 26th, 2009 • 8:15 am • Link

    Hello David,

    No need to separate ranges in character classes with pipes: you can just use one range after the other. And the parentheses create (in this case) just an unnecessary reference. This works fine:

    [\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]+

    Finally, the last two ranges follow each other with no intervening character (9F + 1 = A0) so that they can be combined:

    [\x{4E00}-\x{9FBF}\x{3040}-\x{30FF}]+

    Regards,

    Peter

  4. David Blatner
    October 26th, 2009 • 8:27 am • Link

    Great points and good analysis, Peter. Thank you. I often don’t take the time to finesse the grep codes, once I find something that works.

    BTW, I should point out that it was Peter K’s GREP ebook (listed here) in which I first learned the trick to speccing unicode characters.

  5. October 26th, 2009 • 10:59 am • Link

    Hello,

    About [a-zA-Z], it is right with both OS and InDesign CS4. But, it isn’t with Mac OS X and InDesign CS3.

    As I recently wrote on my blog, [a-z] matches accented characters in Latin-1, Latin-A, Latin-B or Latin add.

    Best

  6. Mayoor
    October 27th, 2009 • 8:37 am • Link

    I encountered a problem recently in CS4, wherein on typing ‘(apostrophe) or “, accented O – e.g O with a tilde used to be inserted in text.
    Is this a problem of text variables or inadvertently i have activated some secret keyboard short-cut for foreign (accented) characters

  7. David Blatner
    October 28th, 2009 • 6:39 am • Link

    @Mayoor: I’m not sure what the problem was that you are describing. InDesign should handle those characters just fine. Of course, there are some characters that do not exist inside fonts.

  8. Jongware
    October 29th, 2009 • 1:59 pm • Link

    @Mayoor:
    “inadvertently i have activated some secret keyboard short-cut for foreign (accented) characters..”

    Yes — the one on your system, not a hidden feature of InDesign :-)

    If you are using Windows, you might have pressed and released Ctrl+Left Alt (from memory) — it toggles the international keyboard def. Go to your Control Panel, Regional & Language settings, tab Languages, button “Details”, tab Settings, and finally button Key Settings.
    Oh, and then hit the button “Change Key Sequence”. Disable both options, if you aren’t using them consciously.

    And that was not typed off the top of my head. I s’pose the equivalent on OS X would be a bit easier.

  9. November 5th, 2009 • 3:48 am • Link

    Great post David!

    I just wanted to point out that Multilingual Tools as well as World Tools have what I call “Language Styles” which can apply styles based on writing scripts. Besides working in CS2, the extra trick that Language Styles does is that it applies the character styles to punctuation between that language as well (using a trick that I learned from Peter Kahrel quite some time ago).

    It currently has Hebrew and Latin (Roman) styles. I expected to add many more (Arabic, Cryllic, Indic, CJK, etc), but I have yet to receive any requests to do so!

Subscribe to the Discussion

Get the ongoing discussion surrounding "Search for Foreign Language Characters in Text" delivered to you. Click here to subscribe via RSS.

Leave a Reply

You can use limited HTML tags, such as <em></em> for emphasis/italics and <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .

InDesignSecrets reserves the right to edit and/or remove posts and comments.