Extra "<p>" tags when converting to AZW3 from pdf

MrTanquery · 12-17-2012, 11:23 PM

It's a simple question but there are hours and hours of mind numbing work on the line for me if I can't solve it. I'm getting extra "<p>" tags (HTML paragraph tags) when I convert from pdf to AZW3. I'm sure the extra p tags would be there in the other formats too, it seems to be something in how Calibre is written to handle converting pdf formatting to HTML.

Is there a way to modify Calibre to only use p tags when there is an actual paragraph? Simple line wrapping should be handled automatically by the reader, as per usual.

As a matter of interest, Adobe Acrobat pro handles this properly when you ask it to save a pdf as an HTML file. Which is to say it only uses paragraph tags when there is an actual paragraph, and lets the reader handle line wrapping in between paragraphs...

Your help is greatly appreciated!
C

fidvo · 12-18-2012, 03:12 PM

You've just discovered the frustration of trying to convert from PDF's. I feel your pain.

First, read the sticky, especially the section titled "Some of my paragraphs are split into multiple paragraphs".

Short answer: PDF's don't have paragraphs; they have lines of text. The information to know where one paragraph ends and another begins gets lost in the conversion to PDF, so it's not available for Calibre or any other conversion program to make use of. Some PDF's use workarounds to maintain that information (e.g. by putting blank lines between paragraphs) and therefore Calibre is able to guess where to break paragraphs. The one you're working with apparently does not.

Possible solutions include converting and manual cleanup afterward (a lot of work), using Calibre's heuristic processing to try to guess where the line breaks are (good, but not perfect), or trying to obtain the original in a different format, like epub, mobi, or html. If this is possible, I recommend it as the best solution.

12-17-2012, 11:23 PM	#1
MrTanquery Junior Member Posts: 3 Karma: 10 Join Date: Dec 2012 Device: Kindle Paperwhite	Extra "<p>" tags when converting to AZW3 from pdf It's a simple question but there are hours and hours of mind numbing work on the line for me if I can't solve it. I'm getting extra "<p>" tags (HTML paragraph tags) when I convert from pdf to AZW3. I'm sure the extra p tags would be there in the other formats too, it seems to be something in how Calibre is written to handle converting pdf formatting to HTML. Is there a way to modify Calibre to only use p tags when there is an actual paragraph? Simple line wrapping should be handled automatically by the reader, as per usual. As a matter of interest, Adobe Acrobat pro handles this properly when you ask it to save a pdf as an HTML file. Which is to say it only uses paragraph tags when there is an actual paragraph, and lets the reader handle line wrapping in between paragraphs... Your help is greatly appreciated! C Last edited by MrTanquery; 12-17-2012 at 11:25 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Missing second "l" when converting from PDF	NewEreader123	Conversion	2	03-28-2011 10:55 AM
The option "--extra-css" doesn't work	slex	Conversion	2	02-19-2011 06:26 AM
Repeated "Ignoring missing TOC entry" when converting PDF to MOBI	goldenhair	Calibre	2	01-19-2011 10:30 AM
Converting PDF w/ "Calibre" Problem?	federalbetrayal	Calibre	4	09-28-2010 06:41 PM
Help needed converting PDF of "James Potter and the Hall of Elders' Crossing"	rgodby	Calibre	6	10-17-2009 12:32 AM

12-18-2012, 03:12 PM	#2
fidvo Addict Posts: 298 Karma: 1599870 Join Date: Jun 2012 Device: none	You've just discovered the frustration of trying to convert from PDF's. I feel your pain. First, read the sticky, especially the section titled "Some of my paragraphs are split into multiple paragraphs". Short answer: PDF's don't have paragraphs; they have lines of text. The information to know where one paragraph ends and another begins gets lost in the conversion to PDF, so it's not available for Calibre or any other conversion program to make use of. Some PDF's use workarounds to maintain that information (e.g. by putting blank lines between paragraphs) and therefore Calibre is able to guess where to break paragraphs. The one you're working with apparently does not. Possible solutions include converting and manual cleanup afterward (a lot of work), using Calibre's heuristic processing to try to guess where the line breaks are (good, but not perfect), or trying to obtain the original in a different format, like epub, mobi, or html. If this is possible, I recommend it as the best solution.

Advert