Java tool + automatic translations of PDFs (Google Translate etc)

posted on Mar, 16 2016 @ 07:39 PM

link

I don't know if they have a way of translating files, but have tried the Bing Translator? In some cases/languages it's better than Google.

ArMaP

posted on Mar, 16 2016 @ 07:42 PM

link

I forgot to say that I have been using Tesseract to convert scanned books to text, and after training it and using 400 DPI images I get very good results, but without training the results are not that good and the quality decreases sharply with lower image resolutions.

DexterRiley

posted on Mar, 16 2016 @ 07:46 PM

link

a reply to: ArMaP

How do you train Tesseract to improve its OCR performance? Do you have to retrain it for each book?

-dex

ArMaP

posted on Mar, 16 2016 @ 09:12 PM

link

a reply to: DexterRiley

I have the details at work, I will post them tomorrow.

And you have train Tesseract for each type of characters. In my case, although they were several books, they were all written by the same man, with the same typewriter, so I only had to train it once, but if I have one book written in Arial, for example, and another in Times Roman I would need to do one training session for each.

PS: If I'm not mistaken, after you do a training session you will get some files that you need to tell Tesseract to use, so if you have one book that uses the same characters that one that you already used you only have to tell Tesseract to use the training you used before.

ArMaP

posted on Mar, 17 2016 @ 02:35 PM

link

I was able to find the original blog post from where I got the information to train Tesseract.

It worked for me, so it should work for you too. If you have any doubts just ask, maybe I can help.

DexterRiley

posted on Mar, 17 2016 @ 06:55 PM

link

a reply to: ArMaP

Thanks for the link. That's an excellent "Reader's Digest" synopsis of training Tesseract.

Now the problem with using something like that with the document set presented by the OP is that the vast number of different sources probably means there are a non-trivial number of fonts and typefaces in use. Is it necessary to specify a single set of font training files for each OCR session? Or can you create a "catalog" of training file sets that Tesseract can choose from? I'll spend a bit more time looking at the documentation later. I was just curious if you knew the answer off the top of your head.

In an OCR project that I just completed, I used OmniPage. It worked quite well with the relatively clean hardcopy that I was working with. However, I tried a small sample of the OP's source material and it didn't do so well. It also exhibited the same problem with recognizing diacritical marks. One nice feature though is that OmniPage is able to recognize the font and typeface. So, it seems to me that something like that program could be used to recognize the typeface, then specific training files could be created and deployed based on that analysis.

Sounds like an interesting project to experiment with. Thanks for the feedback.

-dex

ArMaP

posted on Mar, 17 2016 @ 07:15 PM

link

originally posted by: DexterRiley
Now the problem with using something like that with the document set presented by the OP is that the vast number of different sources probably means there are a non-trivial number of fonts and typefaces in use. Is it necessary to specify a single set of font training files for each OCR session? Or can you create a "catalog" of training file sets that Tesseract can choose from?

You can have several training files, but you have to tell Tesseract which one to use.

IsaacKoi

posted on Mar, 29 2016 @ 04:07 AM

link

originally posted by: DexterRiley
When you are able to get some better scans let me know. I'd like to work on this a bit more to see if we can establish a process to streamline and enhance the digitization effort.

I've now heard back from UFO-Sweden. Unfortunately, they are not aware of any better scans being available than the PDFs I collated from their website. The scans were apparently prepared by the national archives in Sweden, at the quality which UFO-Sweden made available. UFO-Sweden had tried OCR software on those scans themselves, with similar results to us.

Unfortunately, that is probably the end of that...

On a more positive note, UFO-Sweden is organising an effort for manual translation of the entire archive. (I had hoped to get some modest automated results without the huge amount of work that manual translation presumably involves...).

So, I'll just finish up a (fairly long) thread about ghost rockets and move on to another collection of material.

DexterRiley

posted on Mar, 29 2016 @ 04:42 PM

link

a reply to: IsaacKoi

Thanks for the update. I've been experimenting a bit more and have achieved a little better result. But it's still not good enough.

I found a program that allows me to extract the graphic images from the PDF at the original resolution. Which I believe is about 200 DPI. And I used ScanSoft PaperPort to perform the OCR. I achieved a slightly better result than I had in previous experiments.

I'm still looking into this a bit more. I'll let you know if I make any progress.

-dex

Java tool + automatic translations of PDFs (Google Translate etc) - Any help? Foreign UFO documents