11-21-2021, 11:18 PM | #1 |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
Hopefully eligible/wanted for MobileRead library?
Yesterday I received a book in amongst an eBay lot of children's books I'd purchased, and it happens to be public domain I believe for both the U.S.A and MobileRead.
But I want to check and be sure that it is OK and also that MobileRead actually might like a copy. There is a Kindle version for 99 cents. I didn't find a copy here. Title: Under the Lilacs Author: Louisa May Alcott (1832–1888) Illustrator: Alice Barber Stephens (1858–1932) Genre: Children's I believe the book was first published in 1878. The copy I possess is a 1910 hardcover published by Little, Brown, and Company. 302 pages and 8 full page illustrations plus a small one on the title page. Gutenberg has a text, edition unknown, but the Gutenberg text has more than a few errors from what I'm finding so far. It also has no curly quotes or italics, ugh! I'm planning to use it though, because the book I have is um, NOT in the best of shape. The illustrations are fine, but the spine is broken, pages have tears and it would be a fair amount of work to cut it, trim it, patch it as needed, and feed it through my document scanner a page at a time. My scanner does not handle old, thick pages well. I'm currently about half-way through getting the blamed curly quotes in. After which, it's going to get a computer screen read against the hardcover to get the italics. It'll almost certainly need an additional final read through or TWO against the hardcover before the text is done. There's language mark up, poetry, abbreviations and heaven knows what else. So it won't be coming soon, but hopefully eventually? I checked the Hathi Trust and their version has a different illustrator and fewer images, so I thought MobileRead might like this one since the illustrator is identified. You may have to put up with a lot of questions in the Workshop or ePub forum before I'm done! But I *think* I'm up for it.... |
11-22-2021, 01:08 AM | #2 |
Running with scissors
Posts: 1,557
Karma: 14325282
Join Date: Nov 2019
Device: none
|
archive.org has several different copies.
|
11-22-2021, 02:27 AM | #3 |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
Thanks for the suggestion! I did find one of theirs that is similar to the book I have, only 1906 instead. I could OCR the PDF, but I do my markup for curly quotes/italics by hand regardless, so kinda moot. And not being a super quality PDF in the first place, it might be worse than cutting, trimming, patching, scanning for a better quality PDF. It's not my first rodeo, I know well the difference in how a PDF out of my document scanner OCRs vs. an IA PDF...
I looked at the IA epub, but, um, NOPE. Too much absolute gibberish. The Gutenberg text, while not proofread to my standards, is not entirely awful. It's just still got errors which'll get winkled out through my process. Buying the Kindle book and stripping out the text would probably be the smart option, but wouldn't eliminate me having to read it against the 1910 text, so I might as well just have at it with the materials I've got. Payback to the MR library for good reading material I've enjoyed over the years! |
11-22-2021, 11:58 AM | #4 |
the rook, bossing Never.
Posts: 12,367
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Quotes are easily fixed by Calibre and other tools. Italics needs you to have fancy OCR from Archive Org scan, or DIY.
I thought I'd seen Gutenberg texts with Italics, though not checked that one. Usually I download mobi + images (if there are any) from Gutenberg and convert in Calibre to epub with automatic smart quotes, 1.4em 1st line in paragraph indent and remove spaces between paragraphs. Sometimes auto fully justify. Occasionally I edit the epub. Usually the only worthwhile Archive Org is the sacn/PDF. The epub/mobi are "simple OCR", hence full of gibberish. I can do better than that with Tesseract on Linux and my 2002 scanner. Last edited by Quoth; 11-22-2021 at 12:01 PM. |
11-22-2021, 03:30 PM | #5 | |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
Quote:
Usually, I think Gutenberg does do italics. Not *this* text though. There doesn't appear to be tons of italic use, just moderate, so I'll just go with DIY. As for automating the smart quotes, I think that would still have a fair few errors. There's fairly heavy apostrophe use for missing letters in dialogue. I'm halfway on curly quotes by hand, so I'll just continue. It's giving me opportunities to pick up other stuff as I go. The entire book puts spaces in contracted words for instance: could n't, would n't, sha n't, is n't, etc... Gutenberg corrected a lot of that to modern use, but didn't get them all by any means, strays keep popping up. I'm also planning to go with modern use for the contractions, makes more sense for an ebook to have it easier for folks to read and not have to insert tons of non-breaking spaces. There's also stray capitalization here and there. And special characters, which Gutenberg also missed. I'll get it, I'm a "noticing sort." Text should be very nice when done. What I really, truly dread is running it through spell check. That's part of my process at the end of my proofreading, and usually finds a small handful of things I've missed, but it's gonna be hell with this text, because there are a lot of deliberately misspelled words in the children's dialog. So I'm glad to have a PDF for searching and checking. Otherwise, I'm enjoying what I'm seeing of the book, so that's a plus. Sent from my iPad using Tapatalk |
|
11-22-2021, 08:04 PM | #6 | |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
QUESTION: Should I report this text to Gutenberg for them to revisit? (Assuming there is a mechanism for that, I've not yet looked.)
I'm glad I'm doing the curlies by hand, as I'm finding tons of extraneous extra spaces that I'm correcting as I go. "There 's," "it 's," "I 'm," etc... A lot introduced by the printing, I'm sure, as I'm seeing it in the hardcover. Still, the Distributed Proofreader guidelines say: Quote:
Anyway, I've got the single quote marks done and the text properly divided into chapters in Sigil. Enough progress for the moment, other things to do. |
|
11-30-2021, 11:46 AM | #7 | |||
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Sigil 1. Install the "PunctuationSmarten" plugin. 2. In Sigil, you can then press Plugins > Edit > PunctuationSmarten. That will open up a menu where you can convert all quotations to their smart versions. Calibre 1. Install the "Diap's Editing Toolbag" plugin. 2. In Calibre's Editor, it's easier if you enable the "Smarten Punctuation" button in your toolbar. To do this, go into: Edit > Preferences, then Toolbars. In the "Toolbar to Customize" dropdown, choose: "Book wide tools from third party plugins". 2.5. You should see 2 columns: Left-hand side = "Available actions" + Right-hand side = "Current actions". On the left-hand column, find "Smarten Punctuation (the sequel)" + move it to the right using the middle arrows. This will put a little "Einstein's face" button on your main Calibre Editor window. You can press that button when working on a book, and it will smarten all the quotes. Note: Calibre now has a built-in Tools > Smarten punctuation (works best for English), but I don't like it as much. Diap's tool lets you customize a lot more (like not messing with ellipses or dashes). - - - Side Note: These algorithms get 99% left/right quotes correct, but there are many edge cases it gets wrong. Especially around:
I went into more detail on fixing quotes many times over the years. For example, see my in-depth posts from:
Quote:
Once you notice the pattern, you can do a mass search/replace to try to correct those in one fell swoop. After Smarten Punctuation... this is one regular expression I use: Find: ‘(Em|em|Til|til|Tis|tis|Twas|twas) Replace: ’\1 which finds common words like ‘em, ‘tis, ‘twas and flips them to the correct apostrophe. Boom... now that 1% of smarten errors turned into .1%. Then I just search for all LEFT SINGLE QUOTES (usually there are < a few dozen), and manually correct any of those leftovers. Side Note: If working on British books, with ‘single quotes’ being used for dialogue instead of “double quotes”... then things get quite a bit more complicated. * * * From there, I'd recommend using a Regex to search for a SPACE + apostrophe + SINGLE CHARACTER by itself. To do this, use this regex: Find: \s(’\w)\b Replace: \1 This will catch things like:
and convert to:
Of course, that regex can be adjusted for more complicated patterns: Find: \s(\w’\w)\b Replace: \1
Quote:
Sigil: Tools > Spellcheck > Spellcheck Calibre: Tools > Check spelling. This will let you mass check/correct/Ignore all the words in a book. You can even use it for tricks, like listing all hyphenated words or catch common OCR errors like 'o' -> '0' or 'l' -> '1'. Last edited by Tex2002ans; 11-30-2021 at 12:56 PM. |
|||
11-30-2021, 04:30 PM | #8 |
the rook, bossing Never.
Posts: 12,367
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Most wordprocessors do ’tis ’90 etc wrong.
|
12-01-2021, 03:16 AM | #9 |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
Thanks! I am definitely going to get on my computer and print this to PDF for future reference. But I feel sort of bad that you typed all that out, as I've already gotten the curly quotes dealt with, finished that several days ago. But I picked up a zillion other small errors in the process.
Currently getting the italics in and also making sure paragraphs are correct. I've also created a custom spelling dictionary for this book. The Gutenberg text is much more of a trainwreck than I'd initially thought. I've found missing punctuation, including some of the quotes, but also em-dashes, hyphens, and some entire words! And a great deal of paragraph problems! Oh, and accented characters missing as well. This text is a case of: "The eyes of the one outweigh/outproofread the eyes of the many." 😁 It's also a case of the right book met the right person, as I've just been delighted by it! Enough so that I'm giving "Little Women" a re-read. Sent from my iPad using Tapatalk |
12-01-2021, 12:15 PM | #10 | ||
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Now something that took you days will take seconds, and the rest of the free time can be spent on hunting down typos or more important issues. Same with the common patterns of errors. Once you notice one, regex can find them all. That's another regex I use: Find: ‘([0-9]) Replace: ’\1 That finds shortened years like:
and flips it to the correct RIGHT SINGLE QUOTE. Quote:
https://www.gutenberg.org/ebooks/3795 Now they're at book ~67k. The quality of that stuff was not so good back then, but I'm still surprised such italics/typo errors snuck through. The one frustration I have with Gutenberg books is they don't offer the original scan (PDF) they worked off of. This would allow you to go in there and re-correct based on the same source + bring it up to today's standards. Modern PG books (done with Distributed Proofreaders) go through lots more rounds of proofing. If they redid this book now, it would definitely be much higher quality. Last edited by Tex2002ans; 12-01-2021 at 02:56 PM. |
||
12-01-2021, 08:25 PM | #11 |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
Well thanks, those tips will help, especially for the 'em, 'tis, 'twas, etc... common words!
I did suspect that Under the Lilacs had to be an early Gutenberg effort. They do MUCH better usually! I agree it would be good if they'd state the edition and provide a PDF of the scan they used. |
12-11-2021, 06:27 PM | #12 |
Wizard
Posts: 1,571
Karma: 11380098
Join Date: Aug 2010
Location: NE Oregon
Device: Kobo Sage, Pocketbook Era, Kobo Forma, Kindle Oasis 2
|
I'm through getting italics back in and checking paragraph structure, did Sigil spellcheck this AM, now I'm on to reading against the print. My eyes may not be able to do more than a chapter a day...
I've also found a PDF of the 1878 edition, which is paginated slightly differently, but may be of use in determining whether some words are hyphenated or not. In the very first chapter, I've got "bean-stalk", hyphenated between lines, but Google search indicated that "beanstalk" had come into use by the time the book was published. And lo and behold, it's "beanstalk" in the first edition PDF! So that's what I'll use. I don't expect to do this stage quickly, this is the stage where I hope to find any further punctuation errors from the Gutenberg text I might have missed, likewise for special characters, stray capitalizations, italics, or paragraph structure. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Importing MobileRead library into calibre library | Alexander Turcic | Library Management | 20 | 01-31-2015 10:57 AM |
Help Wanted With Library Management | Tristitan | Library Management | 5 | 08-02-2013 03:17 AM |
Coming soon: MobileRead Chat - Testers wanted! | Alexander Turcic | Announcements | 11 | 12-17-2012 12:37 AM |
ARe 50% rebate on all eligible titles 12/25-26 | apesmom | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 12-25-2011 02:21 AM |
Beta testers wanted: MobileRead goes Unicode | Alexander Turcic | Announcements | 0 | 08-12-2009 11:28 AM |