Skip to main content

View Post [edit]

Poster: R Pal Date: Aug 30, 2016 4:36pm
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

maven-raven,

Would you consider downloading the original PDF and re-uploading it as a new item?
If you then wait for the other file formats to be derived and re-download them, correct the errors and bad text and re-upload the corrected files to the new items directory it may work.
Does someone here have any experience with correcting previously uploaded and OCRed files?

Reply [edit]

Poster: maven-raven Date: Aug 31, 2016 11:33am
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

Hi R Pal,

It's probably better if we still keep the current version of the book around. It might confuse people if we replace the files. I will try to put together a new version of the book and then upload that independently. I will leave a comment on the original book page so that others know where to look for the remastered version.

I have already downloaded the scanned images. I will work with the txt version of the book and whenever I encounter an OCR error, I will correct it and given any the ambiguity I will consult the scans.

I still don't know which file format would be most beneficial to the internet archive. I could certainly create a PDF file but I would also not mind to use something that supports more explicit structure to the document like Docbook XML.

And I am a programmer. I know how to automate repetitive corrections.

Reply [edit]

Poster: Jeff Kaplan Date: Aug 31, 2016 2:18pm
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

to understand what is required of an OCR corrected file i'd suggest you find a recently scanned downloadable book on the archive and downloadand open the abbyy.gz file. in it you will see that there is data regarding, among other things, the x-y coordinated for each letter such as (tags have been removed so it would render): charParams l="1415" t="1994" r="1439" b="2022" wordStart="false" wordFromDictionary="true" wordNormal="true" wordNumeric="false" wordIdentifier="false" wordPenalty="92" meanStrokeWidth="48" charConfidence="83" serifProbability="255" O charParams that is the data just for the letter "O" in a word. so unless you're going to duplicate this schema or track down each letter that needs correcting your file will not function. And that file needs to function properly for other formats such as epub and mobi to be created. that is why i said earlier in the thread there is currently no, well no reasonable, way to do what you want.
This post was modified by Jeff Kaplan on 2016-08-31 21:18:01
This post was modified by Jeff Kaplan on 2016-08-31 21:18:25

Reply [edit]

Poster: maven-raven Date: Sep 1, 2016 12:03am
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

Thanks for the detailed explanation! I already looked at the abbyy.gz file but I did not know that this is *the* source file that the internet archive uses.

Is there a section on the internet archive where you have non-OCRed books? I could just post the corrected book there instead.

Reply [edit]

Poster: Jeff Kaplan Date: Sep 1, 2016 8:46am
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

all books uploaded automatically go through OCR unless the language is not OCRable.

Reply [edit]

Poster: R Pal Date: Aug 31, 2016 4:13pm
Forum: opensource Subject: Re: I would like to remaster a book with many OCR errors

I think the original page of an item here can only be changed by the original uploader or the administrators. But in Old Time Radio there are many versions of the same show, each one placed here by a different person using different source material. Having several versions of a book would be acceptable and perhaps a boon to Archive users. Good luck.