OCR to recover an old MS?

PJ
PJS
Posts: 1185
Joined: Sun Jul 22, 2007 5:05 pm
Platform: Mac + Windows
Location: Upstate New York

Sun Jan 31, 2016 8:49 pm Post

Any suggestions for an OCR program?

I have an old print ms — 250 pages — which I'd like to load into the iMac so I can work on it, and page-by-page re-typing promises to be arduous and long.

It's double-spaced TNR, probably not the easiest to scan but so common (back then) it might have been worked out early in the game.

Thanks for any input.

Phil
You can't conquer stupid — or cure it — with more stupid.

User avatar
xiamenese
Posts: 4621
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Mon Feb 01, 2016 11:15 am Post

Hi Phil,

Unless your original pages are very foxed, breaking up, creased, discoloured, etc. OCR should work well, and I don't think TNR would constitute a problem. But if there are any markings of any kind, they are likely to degrade the output at that point. There are lots of OCR apps out there; I have 2.

1 PDFpen Pro. This is expensive, but it does everything you might need with PDFs. It will also import directly from a scanner and automatically run OCR on it. Generally, it's results are good, though there are of course problems around the inner margins when it's a book that has been scanned.

2 I also have PDF OCR X, the "Community Edition", which is free in the Mac App Store. The "Community Edition" is limited to working on one page at a time, though there is an "Enterprise Edition"—paid for or upgradable to—which removes this limit. I have only used it occasionally. I have it because, interestingly, it can OCR Chinese, where the makers of PDFpen (Pro) say it would be far too expensive to add Chinese to their list of languages.

HTH

Mr X
The Scrivenato sometimes known as Mr X.
iMac 27" (late 2015) 10.15.7, 24GB RAM, 512GB SSID
2017 iPad, iPadOS 14.3, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

Hu
Hugh
Posts: 2444
Joined: Thu Mar 08, 2007 12:05 pm
Platform: Mac
Location: UK

Mon Feb 01, 2016 2:49 pm Post

Phil, success with OCR depends not only on the OCR software but also on the quality of the scan, and therefore the quality of the scanner. For "home" users, I think it's fair to say that the standard has been set by the Fujitsu ScanSnap iX500, which can prepare clear, precise scans quickly (assuming the original consists of single sheets of A4, foolscap or similar). This machine comes bundled with (as well as other useful applications) a "free" version of ABBY Finereader for the Mac OCR software, which has worked well for me for several years, including with pieces written in TNR. But at a smidgen under £500 the iX500 is an expensive choice, and definitely not worth it for a single project, although if you were to borrow an iX500 attached to a computer loaded with Scansnap and ABBY software you could probably complete the task of scanning and OCR'ing your 250 pages within 120 minutes. A standalone version of ABBY Finereader Pro for the Mac costs between £80 and £90 (as far as I remember). ExactScan Pro, which also has OCR software, costs around £80. (If the MS is already bound, you'll either have to guillotine off the binding or use a book scanner, a type of which I have no experience, but which I believe involves a higher level of cost and complexity.)

P.S. If you can't get your hands on on an iX500 or similar sheet-fed scanner, or the MS is in book form, you could use an iPad with scanning software (of which there are numerous examples). I've had recent experience of this; for 250 pages it would be a long and time-consuming process, and the images would still need OCR'ing, probably on a Mac or PC. One tip I found useful: when scanning, use a piece of thick, heavy glass to place on the MS and keep the pages as flat as possible. Some iOS apps can adjust for page curvature, but not in all cases.

HTH
H
'Listen, some quiet night, when you've shirked your work that day. Do you hear
that distant, almost inaudible clicking sound? That's one of your
competitors, working away in the night in
Paris or London or Erie, PA.'

PJ
PJS
Posts: 1185
Joined: Sun Jul 22, 2007 5:05 pm
Platform: Mac + Windows
Location: Upstate New York

Mon Feb 01, 2016 5:06 pm Post

H and X —

Thank you both for your (as always) thoughtful and reasonable suggestions. Much appreciated, but a complete re-do seems inevitable.

I did scan a couple test pages, with good clean images, and convert them. Then I re-read the original work. Conclusion: Balance the cost of good scanning and time the involved against nuisance of re-typing — then toss on one side what considerable work the old (30 years +/-) story needs, re-typing seems the better course.

It was a short novel at about 55,000 words. It probably will boil down to a 40,000 word novella. I flung words about quite recklessly back in my innocent middle age.

And again, thank you. It did help. (IDH?)

Phil
You can't conquer stupid — or cure it — with more stupid.

User avatar
Jaysen
Posts: 6302
Joined: Mon Dec 17, 2007 4:00 am
Platform: Mac + Windows
Location: East-Be-Jesus-Nowhere SC, USA

Mon Feb 01, 2016 5:45 pm Post

umm... why not dictate?

I use the OSX feature with not too much trouble. You just speak punctuation...
Jaysen

I have a wife and 2 kids that I can only attribute to a wiggle, a giggle, and the realization that she was out of my league so I might as well be happy with her as a friend. 26 years marriage later, I can't imagine life without her. -Me 10/7/09

ImageImage

PJ
PJS
Posts: 1185
Joined: Sun Jul 22, 2007 5:05 pm
Platform: Mac + Windows
Location: Upstate New York

Mon Feb 01, 2016 7:22 pm Post

Jaysen... funny you should mention it.

Lady of the House made a similar suggestion about an hour ago, which is why I have an answer ready.

First, the reason I didn't go right away to dictation: I've had bad luck with it in the past, finding, in the finished piece, enough confusions and mistakes to outweigh the advantages. By the time I isolated and corrected the problems, I've put in more time than if I'd typed in the first place.

Second, what I'm doing now at the urging of LOTH: I'm trying dictation again, being a bit more careful with elocution, and taking trouble to interject punctuation marks accurately. Considering that the output still needs cleaning up end editing, it looks so far like a toss-up.

I'll keep at it a while. Maybe I'll be converted. There's an uncomfortable sense of talking to myself about dictating, but it might be argued that typing is the same thing from a different part of the brain. Big difference seems to be that my fingers — long accustomed to the chore — make no complaints about overuse, but so much talking — guarded and precise — wears out the vocal cords pretty fast.

Still, thanks for the idea.

Phil
You can't conquer stupid — or cure it — with more stupid.

User avatar
Jaysen
Posts: 6302
Joined: Mon Dec 17, 2007 4:00 am
Platform: Mac + Windows
Location: East-Be-Jesus-Nowhere SC, USA

Mon Feb 01, 2016 8:28 pm Post

it took me about 2 hours to get dictation habits figured out.
1. don't speak too slowly.
2. speak punctuation.
3. don't speak edits. Pain and suffering, the world of Mr K, results.
4. do NOT wait to edit. (read to end, review)

I figure for re-input I save about 30% of the time of typing.
Jaysen

I have a wife and 2 kids that I can only attribute to a wiggle, a giggle, and the realization that she was out of my league so I might as well be happy with her as a friend. 26 years marriage later, I can't imagine life without her. -Me 10/7/09

ImageImage

ro
rontarrant
Posts: 115
Joined: Tue Mar 11, 2014 1:30 pm
Platform: Windows

Mon Feb 01, 2016 9:04 pm Post

I've done this.

I scanned all the pages, then used Acrobat to export them as a Word file. Worked as well as any other OCR software out there and it's dirt cheap. Rent it for a month @ $15 U.S. or it might work with the evaluation version, IDK.

But do all the scanning before you get Acrobat. Save the images in PNG format for best results (but JPeg will do in a pinch).

User avatar
gr
Posts: 2263
Joined: Wed Feb 14, 2007 3:57 am
Platform: Mac + iOS
Location: Florida

Tue Feb 02, 2016 5:17 am Post

Put that old draft up in the attic. I and your agent want to find and publish it later when you are deep in your dotage. Tentative title: 'Go Set a Typescript'. But that could be tweaked a bit when we learn what your story is actually about.

gr

User avatar
scokar
Posts: 24
Joined: Tue Dec 01, 2015 10:41 pm
Platform: Windows
Contact:

Tue Feb 02, 2016 7:07 am Post

Find a local copyshop that will scan to PDF. Or possibly free at your library. Then use one one of many OCR apps suggested.

User avatar
AndreasE
Posts: 737
Joined: Wed Apr 11, 2007 5:33 pm
Location: France
Contact:

Tue Feb 02, 2016 8:42 am Post

It should be mentioned that Mr. Ken Follett, one of the most successful authors on this planet (and fully equipped computer-wise) swears by retyping manually the first version of his manuscript as a means to improve it. In fact, he considers this to be one of the key factors of his success, along with profound research.

Of course, every author is different. But a successful OCR solution might be counter-productive.

Hu
Hugh
Posts: 2444
Joined: Thu Mar 08, 2007 12:05 pm
Platform: Mac
Location: UK

Tue Feb 02, 2016 1:00 pm Post

AndreasE wrote:It should be mentioned that Mr. Ken Follett, one of the most successful authors on this planet (and fully equipped computer-wise) swears by retyping manually the first version of his manuscript as a means to improve it. In fact, he considers this to be one of the key factors of his success, along with profound research.

Of course, every author is different. But a successful OCR solution might be counter-productive.


Yes, as E.B. White, author of "Charlotte's Web" and "Stuart Little", and co-author of the bible of American writers "The Elements of Style", stated: "The best writing is re-writing". (Lots of folk, including me in the past, have attributed this to Ernest Hemingway. But apparently he wrote "The only kind of writing is re-writing," which it seems to me is a slightly different proposition.)
'Listen, some quiet night, when you've shirked your work that day. Do you hear
that distant, almost inaudible clicking sound? That's one of your
competitors, working away in the night in
Paris or London or Erie, PA.'

User avatar
xiamenese
Posts: 4621
Joined: Mon Jan 29, 2007 1:32 am
Platform: Mac
Location: London or Exeter, UK.

Tue Feb 02, 2016 3:12 pm Post

On re-typing vs OCRing, I can see how retyping would prompt the writer to edit and (hopefully) improve the text on the way, but I can see that causing focus problems resulting from editing while entering. Or, if you type in the whole thing without editing, you may remember some of the changes you want to make, but most of the details will be lost by the time you get to the end.

The thing about OCR is that, at least in my experience, it is never perfect, and needs very careful reading through and editing anyway—I'm driven loco by the appalling editing of published books that have been scanned and OCR'd to turn them into e-books, with resulting misspellings, substitutions of upper case 'I' for lower case "l", garbled ligatures, commas read as full-stops … you name it! So when you've OCR'd the text, you need to read through it very closely, and you can do your edits to improve it as you go.

That's my view anyway.

:)

Mr X
The Scrivenato sometimes known as Mr X.
iMac 27" (late 2015) 10.15.7, 24GB RAM, 512GB SSID
2017 iPad, iPadOS 14.3, 128GB, Apple Pencil
Scrivener, Scapple, Nisus Writer Pro, Bookends …

PJ
PJS
Posts: 1185
Joined: Sun Jul 22, 2007 5:05 pm
Platform: Mac + Windows
Location: Upstate New York

Mon Feb 08, 2016 7:29 pm Post

Thanks to all for your comments and suggestions. I tried each of the proposed systems. The only conclusion, for me, is that there is no one best way to do this. Best will vary with situation and writer.

The one working best for me in this instance is OSX speech-to-text. It's faster than re-typing, and cheaper and less-complex than OCR conversions with their necessary scanning, and slow enough that I can correct some of the more immediate mistakes en route.

It also, alas, can produce fascinating reconstruction of ordinary English prose, but I'm able to sort it out along the way.

Phil
You can't conquer stupid — or cure it — with more stupid.

User avatar
Jaysen
Posts: 6302
Joined: Mon Dec 17, 2007 4:00 am
Platform: Mac + Windows
Location: East-Be-Jesus-Nowhere SC, USA

Mon Feb 08, 2016 7:39 pm Post

Wait. That sounds oddly like a suggestion I made.

Dang it all! Now I'm being useful. Vic-k may through me out of the club again.
Jaysen

I have a wife and 2 kids that I can only attribute to a wiggle, a giggle, and the realization that she was out of my league so I might as well be happy with her as a friend. 26 years marriage later, I can't imagine life without her. -Me 10/7/09

ImageImage