Digital Document Image Clean-Up

RR
Posted By
Rip Rapalski
Mar 7, 2004
Views
3640
Replies
3
Status
Closed
I am in the process of scanning a large amount of imagery – photos and printouts of microfilmed records, copies of newspaper articles, actual newspaper clippings, etc., – to digitally preserve them for genealogical purposes.

In the process of second generation copying (generally from microfilming the original) the yellow age tinge of the original document and bleed through of text on the reverse side of the document greatly effects the readability of the documents. As further copying is made this tinge is reproduced by grey shading. Specifically this is black ink script or text on (once, more or less) white paper. There is no color – it’s a monochrome world. (Until the 20th Century – for photo’s)

I note that ancestry.com on their 1930 census data offers a "clean-up image" option which does provide some improvement to the images they provide. This seems to be a removal of some of the background noise. I can not determine if there is any enhancement of the text/script.

I am seeking the suggestions of experienced digital artists or others who have experimented with image restoration to suggest procedural STEPS (best utilized in Photoshop) to enhance the readability of scanned documents.

There are three key areas that I suggest need be considered:

– removal/reduction of the background shading (background noise)
– enhancement of the (desired) script text
– reduction of the reverse page image bleed-through

I believe there could be suggested a sequence of measures available within Photoshop and/or plug-ins applied to achieve a much better image than the orginal without a significant loss of the essential information.

I realize the last requirement (bleed thru removal) may be impossible to achieve utilizing Photoshop, because the only differential between what is to be enhanced and this undesired text is a degree of brightness/contrast. Unfortunately, for 90% of the documents the original is not available so one is usually dealing with third generation (a print copy from microfilm) images. Some improvement may be possible inasmuch as the reverse text is more consistent with the background noise and enhancement of the desired script text may provide a partial solution. But do you enhance the text before or after background removal? Or does one apply an alternative iterative process? Indeed, a solution may require a much more sophisticated approach such as using an analysis of the direction of the script strokes in the image (beyond the capabilities of Photoshop).

I await some suggestions.

TIA
Rip Rapalski

MacBook Pro 16” Mockups 🔥

– in 4 materials (clay versions included)

– 12 scenes

– 48 MacBook Pro 16″ mockups

– 6000 x 4500 px

J
john
Mar 7, 2004
In article , wrote:

I am in the process of scanning a large amount of imagery – photos and printouts of microfilmed records, copies of newspaper articles, actual newspaper clippings, etc., – to digitally preserve them for genealogical purposes. […]

Okay, correct me if I’m wrong but the summary is – your images are from very high contrast microfilm documents and have some extra tones or colors caused by copying or transforming the microfilm into conventional prints. There is also some bleed-through from the microfilm process. Correct?

What we _really need_ here is a post of an example image of a bad case, but moving on regardless: First to restore the images to what the original microfilm would have produced, then get rid of the cast caused by photographic copying by using "Filter – Other – High Pass" (adjust the slider as neccessary). That will take care of a lot of the noise (color, gray cast) caused by additional copying, etc..

To get text-only from the remainder, and presuming the remainder of the image has an exceedingly short bandwidth (close to pure black, pure white), then the difficult task is before you. If the text you want is relatively conventional mechanical type (typewriter, for example), then OCR is your friend. A _good_ OCR progam will recover a lot of that text, even when there is bleed-through. However, it won’t be an image any longer. It will be digital type. If you want the ‘image’ type, then IMHO you can revert to "Filter – Other – Minimum", play with the slider to see if the bleed is diminished, then touch-out the rest.

I look forward to tips others have.
JK
JP Kabala
Mar 7, 2004
Have you tried working with the image threshold?
Threshold looks at an image and says
"anything darker than this is black, anything whiter than this is white"
This may help you clean up gray tones on text docs.

"Rip Rapalski" wrote in message
I am in the process of scanning a large amount of imagery – photos and printouts of microfilmed records, copies of newspaper articles, actual newspaper clippings, etc., – to digitally preserve them for genealogical purposes.

In the process of second generation copying (generally from microfilming the original) the yellow age tinge of the original document and bleed through of text on the reverse side of the document greatly effects the readability of the documents. As further copying is made this tinge is reproduced by grey shading. Specifically this is black ink script or text on (once, more or less) white paper. There is no color – it’s a monochrome world. (Until the 20th Century – for photo’s)

I note that ancestry.com on their 1930 census data offers a "clean-up image" option which does provide some improvement to the images they provide. This seems to be a removal of some of the background noise. I can not determine if there is any enhancement of the text/script.

I am seeking the suggestions of experienced digital artists or others who have experimented with image restoration to suggest procedural STEPS (best utilized in Photoshop) to enhance the readability of scanned documents.

There are three key areas that I suggest need be considered:
– removal/reduction of the background shading (background noise)
– enhancement of the (desired) script text
– reduction of the reverse page image bleed-through

I believe there could be suggested a sequence of measures available within Photoshop and/or plug-ins applied to achieve a much better image than the orginal without a significant loss of the essential information.

I realize the last requirement (bleed thru removal) may be impossible to achieve utilizing Photoshop, because the only differential between what is to be enhanced and this undesired text is a degree of brightness/contrast. Unfortunately, for 90% of the documents the original is not available so one is usually dealing with third generation (a print copy from microfilm) images. Some improvement may be possible inasmuch as the reverse text is more consistent with the background noise and enhancement of the desired script text may provide a partial solution. But do you enhance the text before or after background removal? Or does one apply an alternative iterative process? Indeed, a solution may require a much more sophisticated approach such as using an analysis of the direction of the script strokes in the image (beyond the capabilities of Photoshop).

I await some suggestions.

TIA
Rip Rapalski

RR
Rip Rapalski
Mar 7, 2004
(jjs) wrote:

In article , wrote:

I am in the process of scanning a large amount of imagery – photos and printouts of microfilmed records, copies of newspaper articles, actual newspaper clippings, etc., – to digitally preserve them for genealogical purposes. […]

Okay, correct me if I’m wrong but the summary is – your images are from very high contrast microfilm documents and have some extra tones or colors caused by copying or transforming the microfilm into conventional prints. There is also some bleed-through from the microfilm process. Correct?

Yes

What we _really need_ here is a post of an example image of a bad case, but moving on regardless: First to restore the images to what the original microfilm would have produced, then get rid of the cast caused by photographic copying by using "Filter – Other – High Pass" (adjust the slider as neccessary). That will take care of a lot of the noise (color, gray cast) caused by additional copying, etc..

I will forward you an example….is that your correct e-mail address above or does one lose the "xyzzy" ?

To get text-only from the remainder, and presuming the remainder of the image has an exceedingly short bandwidth (close to pure black, pure white), then the difficult task is before you. If the text you want is relatively conventional mechanical type (typewriter, for example), then OCR is your friend. A _good_ OCR progam will recover a lot of that text, even when there is bleed-through. However, it won’t be an image any longer. It will be digital type. If you want the ‘image’ type, then IMHO you can revert to "Filter – Other – Minimum", play with the slider to see if the bleed is diminished, then touch-out the rest.

A significant majority of the documents are script based, i.e., pre-typewriter – so OCR (at least the current art – I have not opened the door into handwriting recognition {OHR ?} yet) is applicable only to part of the documents. Yes, for historical context, retention of the original "image type" is desired.

"touch-out the rest"……sounds like a labor of love will be required.

Thank you for your input.

Rip

I look forward to tips others have.

Master Retouching Hair

Learn how to rescue details, remove flyaways, add volume, and enhance the definition of hair in any photo. We break down every tool and technique in Photoshop to get picture-perfect hair, every time.

Related Discussion Topics

Nice and short text about related topics in discussion sections