OCR Tweaking: Converting Low-Quality Scanned PDF Files
Converting a scanned PDF file can be tricky if it doesn't have the necessary elements for a successful output. You may find your conversion is messy, contains odd characters, and confuses many letters and numbers.
Click here to download a sample of a poorly scanned PDF
While we highly recommend that you try to get a good quality scan we understand that this isn't always possible, therefore, PDF2XL OCR and Enterprise contain OCR Tweaking options to try and help you improve the character recognition.
First and foremost, if you map out your tables and fields and notice that no data is showing up in the Page Preview, go to the OCR tab and ensure that the Start button is enabled.
Even with the OCR running, you might see that your preview isn't very accurate.
- On the OCR tab, click OCR Options.
- Under the OCR Tweaking section, you'll see the following features:
- Threshold: You can either allow PDF2XL to select an automatic monochrome threshold, or set it manually. Change this setting if the scanned page is either very light or very dark.
- Despeckle: By setting this option you can make the OCR process ignore small dots and imperfections in the scanned image. If the scanned document has a lot of 'noise', this option can help enormously. To use it, check the despeckle box, and select the maximum size of the dot to remove. Moving the bar to the right will make PDF2XL remove larger and larger 'dots', up to removing quite sizable chunks.
- Remove lines: If this option is set, PDF2XL will try to remove vertical and horizontal lines before processing the image. This is mostly useful when trying to process an image scanned from old computer print-out papers that have pre-printed lines on them.
- Force DPI: This affects the dots per inch to try and provide more clarity. The higher you move the slider, the clearer the page looks to the OCR.
There is no set level for any of these settings, as all scanned PDF files are different, so you need to play around with the sliders until you have the best possible result.
If you've set the OCR Tweaking to the best possible result and you still have some errors, there are a couple more things you can try.
- You can try changing the column/field format. For example, by selecting Text, it will force the OCR to recognize that a charcter is an 'O', not a '0'.
- Finally, you can run the Validation. This will allow the OCR to run through the words that it has difficulty recognizing, and allow you to correct them before converting your file. Note that it will look for all instances of poor character recognition, which means it could be a lengthy process if you have a very large, poorly scanned file.
- If you have the PDF2XL Enterprise edition, you can use the Replace List to correct repeating instances of wrongly recognized words. For example, if your document has the word "auto" several times, and the OCR recognizes it as "aut0" every time, the Replace List will make correcting them all at once quick and easy.