How to downsample and convert from grayscale-> black&white?

R
Posted By
ramon
Jun 5, 2004
Views
336
Replies
4
Status
Closed
I have a bunch of TIFF images that were scanned in grayscale mode at 600 dpi. Each one takes ~32MBytes of disk space, and the images are typical office documents -mostly text with a few logos-, which are being processed by OCR.

My main concern is: what is the best way to obtain as much text recognized as possible? I chose 600 dpi in order to get even the smallest type. The grayscale leaves a lot of "gray dust" in the areas were the original paper page was the purest white. Is there an Photoshop filter that will leave the white background really white? If such filter exists and I apply it, will it affect the OCR recognition? (in a positive, negative way?).

Since I won’t have access to the documents forever, I am trying to get the most complete file at scan time, but I may be doing an overkill.

Should I reduce the sampling to 300 dpi? Or perhaps I should stick with 600 dpi but scan in black and white?

Finally, how do I change a 600dpi TIFF to 300 dpi?
How do I change a grayscale to B&W? (both with Acrobat)

My OCR software (ABBYY FineReader) takes the original file that I provide and makes a working copy which is the one that actually gets OCR’d. The copy that I provide is 32MBytes and the working copy is 100 KBytes. They achieve that by (1) converting from grayscale to B&W and (2) doing some compression (lossy or non-lossy? I don’t know).

Thanks in advance,

-Ramon F. Herrera

Must-have mockup pack for every graphic designer 🔥🔥🔥

Easy-to-use drag-n-drop Photoshop scene creator with more than 2800 items.

A
arrooke1
Jun 5, 2004
I have a bunch of TIFF images that were scanned in grayscale mode at 600 dpi. Each one takes ~32MBytes of disk space, and the images are typical office documents -mostly text with a few logos-, which are being processed by OCR.

My main concern is: what is the best way to obtain as much text recognized as possible? I chose 600 dpi in order to get even the smallest type. The grayscale leaves a lot of "gray dust" in the areas were the original paper page was the purest white. Is there an Photoshop filter that will leave the white background really white? If such filter exists and I apply it, will it affect the OCR recognition? (in a positive, negative way?).
Since I won’t have access to the documents forever, I am trying to get the most complete file at scan time, but I may be doing an overkill.

Should I reduce the sampling to 300 dpi? Or perhaps I should stick with 600 dpi but scan in black and white?

Scan for line copy (black & white) @ 600 ppi. Adjust your exposure to obtain a suitable balance between background noise & image quality. If you have some images (fancy colour logo’s) on some pages you can scan the image only, as greyscale, and place it into your line copy.
Keith.
XT
xalinai_Two
Jun 5, 2004
On 4 Jun 2004 21:35:17 -0700, (Ramon F Herrera)
wrote:

I have a bunch of TIFF images that were scanned in grayscale mode at 600 dpi. Each one takes ~32MBytes of disk space, and the images are typical office documents -mostly text with a few logos-, which are being processed by OCR.

My main concern is: what is the best way to obtain as much text recognized as possible? I chose 600 dpi in order to get even the smallest type. The grayscale leaves a lot of "gray dust" in the areas were the original paper page was the purest white. Is there an Photoshop filter that will leave the white background really white? If such filter exists and I apply it, will it affect the OCR recognition? (in a positive, negative way?).
Since I won’t have access to the documents forever, I am trying to get the most complete file at scan time, but I may be doing an overkill.

Should I reduce the sampling to 300 dpi? Or perhaps I should stick with 600 dpi but scan in black and white?

It depends on your scanning software. Older software needed clean black and white scans and a resolution as high as possible. Modern software will work better on grayscale scans with a not too big dynamic range.
If you try to clean the images for the scanning software you sometimes end up with the software assuming a perfect scan and trying to interpret each little lost pixel as some text.
If you feed the software with the raw scan it corrects contrast by itself, has a better guess on the decision between paper structure and real text and produces better quality.

Finally, how do I change a 600dpi TIFF to 300 dpi?
How do I change a grayscale to B&W? (both with Acrobat)

My OCR software (ABBYY FineReader) takes the original file that I provide and makes a working copy which is the one that actually gets OCR’d. The copy that I provide is 32MBytes and the working copy is 100 KBytes. They achieve that by (1) converting from grayscale to B&W and (2) doing some compression (lossy or non-lossy? I don’t know).

FineReader works even with averagely compressed greyscale JPGs. Saves a lot of disk space and scanning time.

Michael

Thanks in advance,

-Ramon F. Herrera
T
tacitr
Jun 5, 2004
The grayscale leaves a lot of "gray dust"
in the areas were the original paper page was the purest white. Is there an Photoshop filter that will leave the white background really white?

Don’t use a filter for this. use the Levels command.

Once you’ve created a good, crisp image, leave it at 600 pixels per inch and turn it into a bitmap; this is usually what OCR software will perform best with.


Biohazard? Radiation hazard? SO last-century.
Nanohazard T-shirts now available! http://www.villaintees.com Art, literature, shareware, polyamory, kink, and more:
http://www.xeromag.com/franklin.html
B
Brian
Jun 7, 2004
When I have to scan something that will be run through OCR software I scan at 1200ppi, linework (1-bit) mode.

MacBook Pro 16” Mockups 🔥

– in 4 materials (clay versions included)

– 12 scenes

– 48 MacBook Pro 16″ mockups

– 6000 x 4500 px

Related Discussion Topics

Nice and short text about related topics in discussion sections