Grep duplicate-select first instance

Home Forums InDesign User Groups Grep duplicate-select first instance

This topic contains 15 replies, has 4 voices, and was last updated by  Obi-wan Kenobi 1 week ago.

  • Author
    Posts
  • #97708

    Hi,

    Hoping for some help: I have copied dates into indd & have this (Jan-Dec but e.g.): January 22 29 January 12 24. I’ve used this string & it isolates/retains only the 2nd instance of months & their dates: (\u\l+ )(\d+ )*\1 [the big spaces here are all tabs, incl. in the Grep string] but how would I then isolate/retain just the first instance of dates?

  • #98073

    Graham Park
    Member

    Don’t have an elegant solution for you but this will work, it will remove the second number following the month, tab, number. But it will not remove punctuation following the second date.
    There is no real rule to find the months without finding text in other places, so they need to be explicitly defined.

    GREP Find and replace
    FIND
    ((January|February|March|April|May|June|July|August|September|October|November|December)\t\d+)(\t\d+)

    REPLACE
    $1

    How it works is to find two groups and then only add back the first item found.

  • #98081

    Thanks for your reply, Graham. Upon re-reading, I realise I didn’t make it very clear (you know, I know what I mean!!). What it is, is 2 years of dates that when copied from the pdf, actually come in Jan\r\d, \d, \d+\rJan\r\d, \d, \d+\r, etc for the 2 years (so Jan 2017 followed by Jan 2018, followed by Feb 2017, then Feb 2018, etc, etc – I had changed the returns & comma-space to \t but originally it was as above – a return after each date & then each set of numbers). I needed to extract the set of 2017 months from the 2018. I got the 2018 by keeping only the duplicate set (using .*? for the numbers). But in the end, as time was a-wasting, I just changed it as nec. to become a table, cut that column & converted it back to text – it was the simplest solution but I’ve only picked up Grep in the last 2 or so years & so it was the (as someone else has put it), brain acrobatics? to see if it would work should it keep coming up again.

    In another similar situation, I applied numbering to each line, changed the numbers to text & used that to distinguish which lines I wanted where. It was a bit round-about, but I got there!

    But thank you for taking the time!

  • #98082

    Graham Park
    Member

    A bit hard from that, if you would like to post some sample text I can have a shot at it.

  • #98083

    January
    3, 8,10,15, 22, 24, 26, 29
    January
    5,10,12,17, 24, 26, 29, 31
    February
    5, 12,16, 19, 21, 26
    February
    7, 14, 19, 21, 23, 28
    March
    2, 5, 12,19, 26
    March
    5, 7, 14, 21, 28
    April
    2, 9, 16, 23, 30
    April
    4, 11,18, 25
    May
    7, 14, 21, 28
    May
    2, 9, 16, 23, 30
    June
    4, 11, 18, 25
    June
    6, 13, 20, 27
    July
    2, 9, 16, 23, 30
    July
    4, 11, 18, 25
    August
    6, 13, 20, 27
    August
    1, 8, 15, 22, 29
    September
    3,10,17,24
    September
    5,12,19,26
    October
    1, 8, 15, 22, 24, 29
    October
    3, 10, 17, 24, 26, 31
    November
    5, 12, 19, 26
    November
    7, 14, 21, 28
    December
    3, 10, 17, 19
    December
    5, 12, 19, 21

    • #98157

      Hi,

      Basing on this list aand just don’t forget to have a carriage-return at the end of the List:

      Find:
      ((January|February|March|April|May|June|July|August|September|October|November|December)\r)((\d+(,\h?)?)+\r)\1((?3))

      Replace1:
      $1$3
      to get the first date, not the second

      Replace2:
      $1$6
      to get the second date, not the first

      (^/)

      Note “,\h?” takes in account typing errors in days (no space after some commas)!

    • #98192

      Thanks, Obi-Wan. What does the ((?3))’ do? I tried using alternatives to the ‘/1’ – ‘/2’, ‘/3’, etc but they don’t work! Is it like that?
      (Ha, like the ‘^/’ – just realised what it was after seeing it on many posts!)

    • #98193

      (^/) ==> the cape, the hood and the light-saber: the signature of the Great Jedi-Masters, as on my avatar photo!

      MTFBWY!

      (^/)

  • #98084

    Graham Park
    Member

    Sorry I can’t help with GREP as it almost exclusively on 1 line with a few exceptions.

    To achieve this I think you would need to write a script. That is not my area so maybe some else can help.

  • #98085

    No worries, thanks for looking, anyway :)

  • #98088

    David Blatner
    Keymaster

    Your list is kind of like a list… I wonder if Peter Kahrel’s Update Index script would help? http://www.kahrel.plus.com/indesign/lists_indexes.html

  • #98100

    Thanks, David – I tried the script (changing the page_span to 0) & it just seemed to delete the first line (January) but I’m not sure what I’m doing there, exactly, so I’ll have another good look but I did work out a solution, based on Graham’s suggested Find. I just need to run one Grep, then run a 2nd one on the original text again:

    Find what:
    (January|February|March|April|May|June|July|August|September|October|November|December)\r(.*?)\r
    (January|February|March|April|May|June|July|August|September|October|November|December)\r(.*?)$

    Change to:
    $1\t$2\r

    & then again with the Change to:
    $3\t$4\r

    Perfectly seps out 1 year at a time. (I’ve been unsure about the ‘or’ symbol – whether it needs square brackets around it or not but this shows me when I do & don’t, too.)

    Appreciate you both taking the time – it’s another ‘happily ever after’ in Grep land!

  • #98103

    Graham Park
    Member

    Love a good brain teaser.
    I finally worked this one out using yours GREP as a starting point.
    Now two steps and it is done.

    FIND
    (January|February|March|April|May|June|July|August|September|October|November|December)\r(.*?)\r

    CHANGE TO
    $1\t$2\r

    Then

    FIND
    ((January|February|March|April|May|June|July|August|September|October|November|December)\t.+$)\r((January|February|March|April|May|June|July|August|September|October|November|December)\t.+$)

    CHANGE TO
    $1

  • #98106

    Graham Park
    Member

    You could shorten the months to make it query easier to read.

    FIND
    (Ja.+|Fe.+|Mar.+|Ap.+|May|June|July|Au.+|Se.+|Oc.+|No.+|De.+)\r(.*?)\r
    Replace
    $1\t$2\r

    Then

    FIND
    ((Ja.+|Fe.+|Mar.+|Ap.+|May|June|July|Au.+|Se.+|Oc.+|No.+|De.+)\t.+$)\r((Ja.+|Fe.+|Mar.+|Ap.+|May|June|July|Au.+|Se.+|Oc.+|No.+|De.+)\t.+$)
    REPLACE
    $1

  • #98107

    Before your last reply, I had shortened the 2nd query like this:
    ((January|February|March|April|May|June|July|August|September|October|November|December)\t.+?\r?){2}

    putting a ‘?’ after return to catch the last set of numbers at the ‘end of story’. But further shortening never hurts!

    I was initially able to extract the 2nd set of numbers (for 2018) using the ‘\1’ find duplicate query:

    Find:
    (\u\l+\r)(\d+.*\r)*\1

    Change to:
    $1

    Shorter, still :)

    I was just struggling to find a way to extract only the 1st set of year’s months (for 2017).

    But with all of these refinements, it’s finally worked so thanks so much for your help with that – the query with the ‘or’ months did the trick with just a few adjustments & abbreviations along the way & now I’m a ‘happy Grep-er’!

  • #98108

    Graham Park
    Member

    That will work but I think it will find every second paragraph.
    As such you will need to be more careful when you use it.
    The one I did specifies the first word of the line in the find so is a bit safer. Still use all with care.

You must be logged in to reply to this topic.