Grep Pattern Searching

In Episode 8, I mentioned how I used Grep Pattern Searching in BBEdit to search for patterns in my text, rather than searching for specific pieces of text. The value of this is that, even though the actual text varies throughout my text file, if the patterns are consistent, I can do very complex and powerful search-and-replace operations that keep that variable text intact, while changing elements of the pattern around it.

Here’s an example from real life: A monthly magazine column of new products. Each write-up starts with a company name on one line, followed by a paragraph starting with the words “What’s New” followed by a colon, then a sentence or two of descriptive text about the product. The next paragraph starts with “The Value” followed by a colon, then a few sentences describing the positive attributes of the product for prospective buyers. After that, there’s a line for the company’s web address and another line for their phone number. This is a consistent pattern. But the text itself is not consistent. Every company name is different. Everything following “What’s New” is different, as is everything following “The Value” and all of the URLs and phone numbers.

So how do you search this whole document? The thing to understand is that you don’t search for specific text. Rather, you search the pattern within which the text exists. What you’re searching for is any string of text (the company name), followed by a return, followed by the specific text “What’s New” and a colon, followed by any string of text (the product details), followed by a return, followed by the specific text “The Value” followed by a colon, followed by any string of text (the description of the product’s benefits), followed by a return, followed by any string of text (the web address) followed by a return, followed by any string of text (the phone number).

In BBEdit, that search instruction translates to this:

(.+)rWHAT’S NEW:(.+)rTHE VALUE:(.+)r(.+)r(.+)

The (.+) means any range of character or characters. The period is any character, and the plus sign extends that to mean any range of characters. The parenthesis around them makes them a sub-pattern. In other words, the whole line above is the pattern, the items in parenthesis are sub-patterns within that pattern. In the above example, there are 5 sub-patterns, each representing variable text. The r elements refer to returns in the original text.

Now…let’s say I wanted to do a replace operation based on this pattern. I have style sheets in InDesign for each element: Company Name, What’s New paragraph, Value paragraph, URL and Phone. On top of that, I have a character from a dingbat font that I put before the URL and another that I put before the phone number to serve as little icons in the layout. All of this is handled expertly by InDesign’s nested style sheets, but I need to put the text characters (in this case, a lower case “u” for the URL icon and an ampersand for the phone icon) first for the nested style sheet in InDesign to work.

To replace this pattern so that all of my style sheet references are included in the right places and have my icon characters added, I use the following replace instruction:

1r WHAT?S NEW:2r THE VALUE:3ru 4r & 5

The bracketed “pstyle” elements are tags that InDesign will use to format this text automatically when placed in a document with the corresponding styles, the names of which follow the colon in the bracketed tag. The combinations of backslashes and numbers — 1 2 3 and so on — refer to the sub-patterns in the original search pattern. They’re numbered by their order in the search instruction. The text of the What’s New Paragraph is 2 and the text of the Phone Number is 5. They’re the second and fifth sub-patterns in the search. By putting in these backslash-number combinations, you’re telling BBEdit to replace the original sub-pattern with itself. So every word of text that follows “What’s New:” until BBEdit finds a return will be replaced by…ITSELF. It remains exactly the same. Same with the other sub-patterns. The Company Name is replaced with itself, but now it has a tag around it. Similarly, the URL is replaced by itself, but now it is preceded by both the tag for its InDesign paragraph style, and the lower case “u” that will appear in a dingbat font in InDesign, thanks to nested style sheets.In my magazine, I have pages and pages of these product write-ups, so I make sure our editors put two returns between each one when they’re writing in Word, so that I can have BBEdit search for the pattern in every write-up. It’s as simple as searching for the same pattern shown above, but with two returns — rr — added at the end, like so:

(.+)rWHAT’S NEW:(.+)rTHE VALUE:(.+)r(.+)r(.+)rr

Likewise, my replace pattern would also include those two returns. It would look like this:

1r WHAT?S NEW:2r THE VALUE:3r u 4r & 5rr

The only thing left is to add one little bit of information to the very first line of this text file:

This is all InDesign needs in addition to the bracketed “pstyle” tags to completely format an unlimited number of these product write-ups the instant they’re placed (using the Place function, not copy-and-paste) in an InDesign file. All of this text will come in using your established style sheets, 100% formatted without you having to select any text or apply any style sheets in InDesign.

One more thing…if this sort of thing is something that you do over and over, like I do, for a magazine or other regular publication…you can save your search-and-replace patterns in BBEdit for future use.

Makes you want to go out and start finding patterns in all of you text, doesn’t it?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *