Copying text from pdf files

do
donncha77
Posts: 20
Joined: Tue May 13, 2008 12:32 pm

Tue May 13, 2008 12:39 pm Post

Hi
When I copy text from a pdf file (within Scrivener) many if not most of the spaces between words are lost. Does anyone know why this happens or if it can be sorted out?

Donncha

User avatar
KB
Site Admin
Posts: 20729
Joined: Tue Jun 13, 2006 11:23 pm
Platform: Mac
Location: Truro, Cornwall
Contact:

Tue May 13, 2008 5:00 pm Post

It's because the text system is trying its best to read text that has already been rendered to PDF, so a lot of the information that the text system requires is no longer there. There's nothing that can be done about it, I'm afraid.

All the best,
Keith

User avatar
xiamenese
Posts: 4327
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Wed May 14, 2008 12:32 am Post

That's odd, because I've just tried it to see what happens ... I imported a PDF into the research folder, then copied a couple of paragraphs from that and pasted it into a file in the Draft folder and it copied with all the spaces no problem.

It was a PDF I had created myself by printing to PDF from Nisus, so in case that made any difference, I imported a PDF I'd downloaded into the Research folder and then copied and pasted from that ... again no problem, all the spaces there as required.

Perhaps, if the problem persists you could try choosing "Open in alternate editor" from the contextual menu — I use Skim — and see if that does better for you.

I'm using 10.5.2 on Intel, in case that also makes a difference.

Mark
The Scrivenato sometimes known as Mr X.
iMac 27" (late 2015) 10.15.4, 24GB RAM, 512GB SSID
MBP17" (late 2011) 10.13.6, 16GB RAM, 2TB SSID
2017 iPad, iPadOS 13.3, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

cy
cyberbryce
Posts: 74
Joined: Tue Jan 09, 2007 4:27 am

Wed May 14, 2008 3:11 am Post

donncha77 wrote:Hi
When I copy text from a pdf file (within Scrivener) many if not most of the spaces between words are lost. Does anyone know why this happens or if it can be sorted out?


KB wrote:It's because the text system is trying its best to read text that has already been rendered to PDF, so a lot of the information that the text system requires is no longer there. There's nothing that can be done about it, I'm afraid.


Specifically, I think as primarily a page layout format, PDF (often?) does not represent spaces between words, but instead just draws the text on the page. Selection and copying such as is done by the OS X routines Scrivener uses requires it to guess where the spaces belong, and hence it works well for some PDFs or parts of PDFs, but not others.

Depending on how important this is to you, you have some options. There might be some tool that happens to work better in capturing text than the OS X routines for your particular PDF. Some alternative PDF reading and conversion engines are: Adobe Acrobat, xpdf/pdftotext (http://www.foolabs.com/xpdf/), and Intaglio (a drawing program for the Mac). (Since Skim, like Scrivener, uses the built-in routines, I suspect it won't be of much help...)

Finally, another option is OCR: OCR programs are designed to detect word boundaries, so you could load your PDF into an OCR program and scan it. Adobe offers an online service with a trial membership that does this, if you choose the "make documents searchable" option, http://createpdf.adobe.com/cgi-feeder.p ... percapture .

I have exactly the same problem for a large reference document collection (an electronic textbook), and while I'm sure one of these methods would work, it hasn't been worth the time...

do
donncha77
Posts: 20
Joined: Tue May 13, 2008 12:32 pm

Wed May 14, 2008 11:45 am Post

Thank you all for your help. Opening in an external editor (I seem to never think of contextual menus) is a fine workaround.

And thanks to one and all (or perhaps just one!) for a very fine piece of software.

Donncha

d_
d_a_friedman
Posts: 30
Joined: Wed Feb 27, 2008 8:28 pm

Thu Jun 26, 2008 5:24 pm Post

if one already has pdf's that been ocr'ed (in my case by scanning them in using a canon mx-700 or by downloading them from an online journals server such as jstor) you can use Skim plus a special template to export the skim highlighted notes into a multimarkdown file that can be uploaded in scrivener.

i like to have separate note cards/snippets of text for each highlighted quote from Skim. fortunately, Skim already exports the highlighted passages this say, but not in a format you can use for uploading into scriv.

the solution is to write your own Skim template for exporting the notes.

the file "notesTemplate.txt" needs to be put in the directory library/application support/skim - which you need to create if you dont have one already (dont think Skim creates one).

here is a shot of the template file:
capture2.png
notesTemplate.txt for mmd/Scriv compatible notes
capture2.png (17.01 KiB) Viewed 946 times


the Skim notes will come out in the format:
---
# <filename>, p. <page number> #

text
---

this is an mmd compatible format which you can import into Scriv. the result is a lot of snippets of text which serve as notecards for me.

of course, you could read the Skim wiki and mess around with the template to get the text in a different format. i really wanted lots of snippets in skim.

regards,

df

ps thanks to signinstranger on this forum, Christiaan (Skim) and Keith

fl
flow
Posts: 66
Joined: Tue Jul 24, 2007 12:16 am

Mon Jul 07, 2008 9:14 am Post

pdftotext is a mac app to convert pdf documents to plain text.