09-04-2021, 10:34 PM | #1 |
Groupie
Posts: 196
Karma: 1003498
Join Date: Jun 2010
Device: none
|
pdf regex question - regex that wraps to a new line
I'm trying to eliminate chapter titles that show up as headers in a pdf.
The pdf text looks something like this: Blah blah blah blah Blah blah blah blah <br> Blah blah blah blah Blah blah blah blah blah blah <br> Blah blah blah blah Blah blah blah blah <br> Blah blah blah blah Blah blah blah blah blah <br> <hr/> <a id="p55"></a>Some Chapter Title<br> 55<br> blah blah blah blah <br> rBlah blah blah blah blah.<br> Using the following regex, I'm able to select this text: <a id="p55"></a>Some Chapter Title<br> regex: <a id="p[0-9]*"></a>[A-Z][^<]*<br> But what I really want to match is the same text as above AND the page number on the next row: <a id="p55"></a>Some Chapter Title<br> 55<br> The reason I want to do this is not just to get rid of the page numbers, but also sometimes actual sentences of the book get captured by this regex, but these sentences are not followed by page numbers - the page numbers only follow the chapter title headers in this particular sequence. Problem is the regex won't wrap to the next line, so if I try: regex: <a id="p[0-9]*"></a>[A-Z][^<]*<br>[0-9]* I get zero matches. Any ideas? |
09-05-2021, 09:00 AM | #2 |
Groupie
Posts: 196
Karma: 1003498
Join Date: Jun 2010
Device: none
|
Figured it out.
regex: <a id="p[0-9]*"></a>[^<]*<br>[\r\n]*[0-9]*<br> Will match: <a id="p55"></a>Some Chapter Title<br> 55<br> |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex to count line wraps? | kboogie222 | Library Management | 12 | 09-15-2019 09:12 PM |
Removing Line breaks using regex in PDF when converting | tankervin | Conversion | 3 | 01-12-2017 04:23 PM |
how do I span more than one line with regex | BartB | Sigil | 3 | 12-11-2011 05:12 PM |
Importing RegEx Line | TheEldest | Calibre | 1 | 07-05-2011 10:18 PM |
Insert new line with regex | deckoff | Sigil | 6 | 08-08-2010 11:24 AM |