Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 06-21-2022, 07:20 PM   #16
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by jordy1955 View Post
I have some eBooks that were clearly produced by less than spectacular OCR software.

[...]

One of the main problems is line breaks in the wrong places (eg in the middle of a sentence), making the text very difficult to follow.
I've written about this many times over the years. Here's 2 of the topics:

Also, you may be interested in this thread:

where I broke down 5 different Regexes + color-coordinated them + explained them step-by-step.

Quote:
Originally Posted by jordy1955 View Post
Awesome stuff guys. Just ran it on a book and - once I got my head around it properly - I completed the editing and re-formatting in about 1hr - about 4 hours less than it usually takes me.
I'll get much quicker with practice but this is great.
Regular Expressions are amazing.

When you learn to search (and replace) via patterns, you can save SO MUCH TIME compared to the old way of doing searches one-by-one.

Like a few helpful ones I've used is:

Regex #1 (Full Month + Day)

Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),

It looks for:
  • "January" OR "February" OR "March" OR [...] "December"
    • Tosses it into Group 1.
  • + a space
  • + 1 or 2 numbers in a row
    • Tosses it into Group 2.
  • + a comma

which matches:
  • January 17,
  • February 20,
  • December 15,

* * * * *

Side Note #1: You could easily replace that with a:

Replace: \2 \1

to change it into a "flip the date from American -> British" regex:
  • March 16, 1999 -> 16 March 1999
  • October 1, 1776 -> 1 October 1776

* * * * *

Regex #2 (Shortened Month + Comma) (Typo)

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec),
Replace: \1.

It looks for:
  • "Jan" OR "Feb" OR [...] "Dec"
    • Captures it in Group 1.
  • + a comma

and Replaces with:
  • Whatever month got captured in Group 1.
  • + a period.

which changes:
  • Jan, 17 -> Jan. 17
  • Feb, 20 -> Feb. 20
  • Dec, 15 -> Dec. 15

Quite common in OCR—when a spec of dust can easily change a period into a comma—and it's even a common error found in tables/footnotes.

(One of the books I worked on was a multi-volume Thomas Jefferson book which cited dates of every written letter... SO many references had that typo in there!)

Last edited by Tex2002ans; 06-21-2022 at 07:34 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
pdf regex question - regex that wraps to a new line flyash Conversion 1 09-05-2021 09:00 AM
Predefined regex for Regex-function sherman Editor 3 01-19-2020 05:32 AM
Regex help please FrostWolf Library Management 2 09-23-2014 11:50 PM
RegEx Help ghostyjack Workshop 4 03-22-2012 09:24 AM
Regex Gunnerp245 Conversion 5 03-05-2012 04:15 PM


All times are GMT -4. The time now is 05:12 AM.


MobileRead.com is a privately owned, operated and funded community.