Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 05-18-2024, 09:38 AM   #16
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 12,388
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by mikapanja View Post
That works in case you know what you are looking for. But what if you don't, which was my original thought? E.g. if you want to know if any 5-word group is repeated in the text.
Tricky.
Proof read with your Deja Vu set to 11?

The concordance tool sounds good to detect repetitive writing, which really annoys people if it slips past the proof reading.
Quoth is offline   Reply With Quote
Old 05-18-2024, 09:48 AM   #17
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 12,388
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by Doitsu View Post
For those kind of searches you'll need to use a concordance tool. For example, Laurence Anthony's AntConc (freeware).
  • Unzip the epub file.
  • If the folder contains .xhtml files change their file extensions to .html.
  • Open AntConc, select Open file(s) as 'Quick Corpus', select .html as the file type and then select the extracted .html files.
  • Click the N-Gram tab, select the desired number of words and click Start.
I've attached a sample screenshot of the output.
Obviously, copy and paste all the chapters/files to one mega file* as the repeat in a novel (likely unwanted if you are the author or official editor) is probably in a separate file.

Indeed any search tool is useless unless you know the exact repeating text.


[* Unless it has a project / session mode that remembers previous files?]
Quoth is offline   Reply With Quote
Advert
Old 05-18-2024, 10:15 AM   #18
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,535
Karma: 136565488
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by Quoth View Post
Obviously, copy and paste all the chapters/files to one mega file* as the repeat in a novel (likely unwanted if you are the author or official editor) is probably in a separate file.

Indeed any search tool is useless unless you know the exact repeating text.


[* Unless it has a project / session mode that remembers previous files?]
It would be best to convert the ePub to text and then run it through AntConc.
JSWolf is online now   Reply With Quote
Old 05-18-2024, 11:00 AM   #19
mikapanja
Perfectionist
mikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 72
Karma: 12802
Join Date: Apr 2014
Device: none
Quote:
Originally Posted by Doitsu View Post
For those kind of searches you'll need to use a concordance tool. For example, Laurence Anthony's AntConc (freeware).
Tried it following your instructions - while setting the frequency to 2 - and it found an erroneus duplication of an unknown word group in no time. I knew it was somewhere in the text, but had no idea what it was and where.

And double-clicking a hit gives you short context. And double-clicking the context brings up the whole text, with repetition highlighted. Perfect.

Doitsu, thank you so much!

Quote:
Originally Posted by Quoth View Post
Indeed any search tool is useless unless you know the exact repeating text.
See above.
mikapanja is offline   Reply With Quote
Old 10-14-2024, 11:17 PM   #20
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by mikapanja View Post
Is there a way to find (and possibly highlight) repeated word groups in ePubs?

[...]

I'd like it to find non-adjacent repeated word groups, i.e scattered throughout the text.
Not directly...

But like Doitsu pointed out, what you want to look for is called:
  • n-grams
    • These are "X number of words in a row".

So:
  • 4-grams = 4 words in a row
  • 3-grams = 3 words in a row
  • 2-grams = 2 words in a row
  • 1-gram = 1 word in a row
    • This is just Spellcheck Lists! A list of every single word (+ its # of hits) in the book!
      • In Calibre: Tools > Check Spelling (Alt+F7)
      • In Sigil: Tools > Spellcheck > Spellcheck (Ctrl+Alt+Q)

You can also use Calibre to temporarily convert your book to a TXT, and then there are plenty of "n-gram" tools out there to try and test out.

- - -

Side Note: I've written about "List-Based Spellchecking" + n-grams in detail, and have been using this to rip apart + edit books... for over 10 years now.

For some of my recent posts, see:

I cover stuff like how I use Spellcheck Lists to catch:
  • Typos
  • All "Foreign Words"
  • Mismatching Accents
  • Misspelled Names
  • Inconsistent Hyphenation

then how I use n-grams to catch repetitious repetitions throughout the books!

I also use Regular Expressions to quickly catch/refine/clean up a lot of this repetitious crap too!

- - -

Side Note #2: I even gave a talk about this last year in the:

- - -

Side Note #3: If you're interested, just last week I wrote an "article" on how I use n-grams.

This past month, I've been working on (conversion+proofing of) a 450k word beast of an ebook...

The author wanted me to copyedit/proofread, so I:
  • generated an n-grams spreadsheet
  • + wrote up a breakdown of how I use n-grams (with real-life examples from the book).

Here's a little sample:

- - - - - - - - - -

N-grams

These show you how many times you "repeat a phrase"/"chunk of words".

So a list of "3-grams" would show you every "chunk of 3 words in a row".

So if you took:
  • Show an example sentence with an example sentence.

and ran 3-grams on it, the output would show:
  • 2 an example sentence
  • 1 Show an example
  • 1 example sentence with
  • 1 sentence with an
  • 1 with an example

You repeated "an example sentence" twice!

When you run this across the entire book, these "repetitive patterns" pop right out!

How I Use Them

1. I start with the biggest n-grams first...
• Then work my way down.
• 6-grams, 5-grams, 4-grams, ...
2. When I find an interesting phrase + high number...
• I search the entire book for it.
3. I read the sentence...
• Use this to chop/refine!
• Fix/reword sentences as needed.
4. Repeat Step 2 in passes.

In your case, we can skip the 7-grams and 6-grams (it's mostly just these super-long titles like "Chairman of the Joint Chiefs of Staff").

5-grams is where we start seeing really interesting patterns.

[... It then goes through 5-grams, 4-grams, 3-grams, 2-grams... showing the types of things/patterns that can be found with each. ...]

Last edited by Tex2002ans; 10-15-2024 at 12:04 AM.
Tex2002ans is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Repeated text after page turn. Pawel1212 Onyx Boox 6 01-13-2023 04:09 AM
Replace repeated item with the number of times it is repeated 1v4n0 Sigil 3 04-01-2021 05:41 PM
Find duplicate books MOJOJE Library Management 1 08-13-2020 06:59 PM
Repeated text pdf to epub conversion magicman1223 Conversion 3 04-25-2014 03:02 PM
Find duplicate books... silentguy Calibre 10 12-10-2010 12:03 PM


All times are GMT -4. The time now is 06:39 PM.


MobileRead.com is a privately owned, operated and funded community.