A Caveat About Online Digital Newspapers

Over on the FamilySearch TechTips Blog, James Tanner has written a great article about the availability of newspapers that are becoming increasingly available online.   I absolutely love his opening statement 

Throughout the world, local and national organizations, including governments, are realizing that much of their national heritage, culture and history has been chronicled in newspapers.

This is certainly true and the reason why I absolutely love historical papers – so much history is embedded within them!  James then discusses a number of sources online where you can go and access digitized versions – including ChroniclingAmerica, Google News Archive, and more.  All those in the pursuit of family history should definitely read this article and become familiar with these online resources.  The links he embeds to online lists are particularly helpful. 

But, I caution the reader in becoming too persuaded by the notion as James states that

Today, most of the online newspaper archives are completely searchable by any word in the newspaper. Searching for an ancestor’s name is no more difficult than it is searching in Google or any other online search engine.

Lo and behold, it is not that easy! Sure, many papers do offer some level of keyword searching but results can be spotty in many cases because the technology used to create the indexes being searched relies on Optical Character Recognition (e.g. computers trying to determine what is on the page). In other situations,  Captcha technologies are used to enhance the knowledge of what is on the page.  But so much more goes missing when you rely upon a keyword search because these processes are far from perfect.

Here is an example – on the ChroniclingAmerica website the Columbia Herald’s May 6, 1866  issue has a notice on page 2 (column 6) of several suicides/attempted suicides by people in the community – one of those being a Mr. Fountain Cleveland.

I found this notice by browsing – the page-by-page method that to which James refers. Yet, now that I know this, will this article come up if I search his name using the Chronicling America search interface? They even offer the ability to limit by state, year, and to do phrase searching.

And, what are the results?

Nope!  The reason this search does not work is because the word “Cleveland” is split with a hyphen — OCR has not put the whole word together.  This is just one of many ways that a keyword search can fail.  If you spend some time comparing known information against what you can access via a search, you will quickly see other types of discrepancies also.

These type of searching challenges are inherent in many other repositories as well – it happens with Google News Archive, it happens with the Internet Archive, it happens with smaller collections of newspapers that are maintained by universities and/or other interest groups; it happens in the Australian TROVE collection, it happens in all digital collections.  This is the reason I dedicate time indexing from online newspapers – to help make the content even more useful for all of us looking.  Not that human intervention is 100% accurate either, but it can certainly enhance the digital access.

Access to these collections are indeed wonderful and I wouldn’t trade them for the world.  But, it is important that we are all acutely aware of their limitations.  Thank you James for the inspiration – this is actually a blog post I’ve been working on off and on for awhile 🙂

Comments (11)

  1. Denise Coughlin

    Very good point. I usually start going page by page in a newspaper where I know something should be there but is not coming up with the OCR. Of course I end up reading the paper and laugh at the total lack of bias in some of the articles! But love the local gossip columns with the comings & goings of everyone!!

  2. taneya (Post author)

    the bias in the articles is something indeed. makes you really wonder sometimes :-). Glad you liked the post!

  3. Rorey Cathcart

    This is something I have run into repeatedly in Chronicling, GenealogyBank, Google Archives et al. I’ve come to the conclusion, for the time being at least, that these newspaper repositories are best for searching for something you expect to be there. For example, a wedding or death announcement.

    Common surnames are almost impossible. Try doing a search of any of these repositories for the name Smith, as I did recently for a client. Even specifying a time frame and geographic region produces more hits than I could sort through. And, as you mentioned in your post, doesn’t account for all the missed indexes as well.

    On the upside though, at least now so many papers are digitized that later and greater OCR can be applied to them in the future. For those with a unique name, or a narrow focus good results can be found. And its more than we used to have.

    Thanks for the article.

  4. taneya (Post author)

    great point about applying better OCR in the future Rorey. I do hope that the technology continues to get better and make these wonderful resources even more useful than they already are. thanks!

  5. Shelley

    Great post! The statement about it being easy to search newspapers jarred with me as well. More strategies than just typing in a name are needed to find useful content.

  6. taneya (Post author)

    Thx Shelley. Yes, newspapers are a great asset but do require time to search through 🙂

  7. WRich

    Thanks for the great info. I am greatful just to have the newspapers online as I have no way to go to any repositories.

  8. taneya (Post author)

    oh, i definitely agree they are wonderful to have online! no argument there :-). Newspapers have such rare tidbits of information and really give color to local communities!

  9. Lisa Gorrell

    I love your post. The digital newspapers online have been wonderful and I wish more small town newspapers were available. One of the things that make it hard for the index that hasn’t been mentioned is the way the OCR “reads” letters. My very uncommon maiden name “HORK” is ‘”read” by OCR many times as “York” or “Work” because of the not-so-clear quality of the microfilm or newspaper itself. So I echo the earlier sentiment of a better OCR process.

  10. taneya (Post author)

    good point Lisa. I too have seen many letters misread. oh the day when that better OCR does come along! 🙂

  11. Pingback: TNGenWeb Historical News Portal Featured at NEH Meeting | TNGenWeb Project, Inc.

Comments are closed.