Over on the FamilySearch TechTips Blog, James Tanner has written a great article about the availability of newspapers that are becoming increasingly available online. I absolutely love his opening statement
Throughout the world, local and national organizations, including governments, are realizing that much of their national heritage, culture and history has been chronicled in newspapers.
This is certainly true and the reason why I absolutely love historical papers – so much history is embedded within them! James then discusses a number of sources online where you can go and access digitized versions – including ChroniclingAmerica, Google News Archive, and more. All those in the pursuit of family history should definitely read this article and become familiar with these online resources. The links he embeds to online lists are particularly helpful.
But, I caution the reader in becoming too persuaded by the notion as James states that
Today, most of the online newspaper archives are completely searchable by any word in the newspaper. Searching for an ancestor’s name is no more difficult than it is searching in Google or any other online search engine.
Lo and behold, it is not that easy! Sure, many papers do offer some level of keyword searching but results can be spotty in many cases because the technology used to create the indexes being searched relies on Optical Character Recognition (e.g. computers trying to determine what is on the page). In other situations, Captcha technologies are used to enhance the knowledge of what is on the page. But so much more goes missing when you rely upon a keyword search because these processes are far from perfect.
Here is an example – on the ChroniclingAmerica website the Columbia Herald’s May 6, 1866 issue has a notice on page 2 (column 6) of several suicides/attempted suicides by people in the community – one of those being a Mr. Fountain Cleveland.
I found this notice by browsing – the page-by-page method that to which James refers. Yet, now that I know this, will this article come up if I search his name using the Chronicling America search interface? They even offer the ability to limit by state, year, and to do phrase searching.
And, what are the results?
Nope! The reason this search does not work is because the word “Cleveland” is split with a hyphen — OCR has not put the whole word together. This is just one of many ways that a keyword search can fail. If you spend some time comparing known information against what you can access via a search, you will quickly see other types of discrepancies also.
These type of searching challenges are inherent in many other repositories as well – it happens with Google News Archive, it happens with the Internet Archive, it happens with smaller collections of newspapers that are maintained by universities and/or other interest groups; it happens in the Australian TROVE collection, it happens in all digital collections. This is the reason I dedicate time indexing from online newspapers – to help make the content even more useful for all of us looking. Not that human intervention is 100% accurate either, but it can certainly enhance the digital access.
Access to these collections are indeed wonderful and I wouldn’t trade them for the world. But, it is important that we are all acutely aware of their limitations. Thank you James for the inspiration – this is actually a blog post I’ve been working on off and on for awhile