pdf regex question - regex that wraps to a new line

flyash · 09-04-2021, 10:34 PM

I'm trying to eliminate chapter titles that show up as headers in a pdf.

The pdf text looks something like this:

Blah blah blah blah Blah blah blah blah 
Blah blah blah blah Blah blah blah blah blah blah 
Blah blah blah blah Blah blah blah blah 
Blah blah blah blah Blah blah blah blah blah 
<hr/>
<a id="p55"></a>Some Chapter Title 
55 
blah blah blah blah 
rBlah blah blah blah blah. 

Using the following regex, I'm able to select this text:
<a id="p55"></a>Some Chapter Title 

regex: <a id="p[0-9]*"></a>[A-Z][^<]* 

But what I really want to match is the same text as above AND the page number on the next row:
<a id="p55"></a>Some Chapter Title 
55 

The reason I want to do this is not just to get rid of the page numbers, but also sometimes actual sentences of the book get captured by this regex, but these sentences are not followed by page numbers - the page numbers only follow the chapter title headers in this particular sequence.

Problem is the regex won't wrap to the next line, so if I try:
regex: <a id="p[0-9]*"></a>[A-Z][^<]* [0-9]*

I get zero matches.

Any ideas?

flyash · 09-05-2021, 09:00 AM

Figured it out.

regex: <a id="p[0-9]*"></a>[^<]* [\r\n]*[0-9]* 

Will match:
<a id="p55"></a>Some Chapter Title 
55

09-04-2021, 10:34 PM	#1
flyash Groupie Posts: 196 Karma: 1003498 Join Date: Jun 2010 Device: none	pdf regex question - regex that wraps to a new line I'm trying to eliminate chapter titles that show up as headers in a pdf. The pdf text looks something like this: Blah blah blah blah Blah blah blah blah <br> Blah blah blah blah Blah blah blah blah blah blah <br> Blah blah blah blah Blah blah blah blah <br> Blah blah blah blah Blah blah blah blah blah <br> <hr/> <a id="p55"></a>Some Chapter Title<br> 55<br> blah blah blah blah <br> rBlah blah blah blah blah.<br> Using the following regex, I'm able to select this text: <a id="p55"></a>Some Chapter Title<br> regex: <a id="p[0-9]"></a>[A-Z][^<]<br> But what I really want to match is the same text as above AND the page number on the next row: <a id="p55"></a>Some Chapter Title<br> 55<br> The reason I want to do this is not just to get rid of the page numbers, but also sometimes actual sentences of the book get captured by this regex, but these sentences are not followed by page numbers - the page numbers only follow the chapter title headers in this particular sequence. Problem is the regex won't wrap to the next line, so if I try: regex: <a id="p[0-9]"></a>[A-Z][^<]<br>[0-9]* I get zero matches. Any ideas?

09-05-2021, 09:00 AM	#2
flyash Groupie Posts: 196 Karma: 1003498 Join Date: Jun 2010 Device: none	Figured it out. regex: <a id="p[0-9]"></a>[^<]<br>[\r\n][0-9]<br> Will match: <a id="p55"></a>Some Chapter Title<br> 55<br>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex to count line wraps?	kboogie222	Library Management	12	09-15-2019 09:12 PM
Removing Line breaks using regex in PDF when converting	tankervin	Conversion	3	01-12-2017 04:23 PM
how do I span more than one line with regex	BartB	Sigil	3	12-11-2011 05:12 PM
Importing RegEx Line	TheEldest	Calibre	1	07-05-2011 10:18 PM
Insert new line with regex	deckoff	Sigil	6	08-08-2010 11:24 AM

Advert