Statistics and the 1940 Census US Community Project Society Dashboard

This morning I was quite happy to see that the US Community Project has shared information from societies participating in the indexing on their Society Dashboard.

I am pleased that the group I’ve coordinated – the TNGenWeb Project, has placed 10th in the list of “large” societies! Our group currently has 36 members and they are all doing an awesome job.  However, my pleasure is seriously hampered by what appears to be methodological problems in how these numbers were calculated and posted.

1) the first list on the page reports the Top 10 Societies for the number of records indexed “per capita.”  Later in the page, there is another table showing the top societies for the highest number of records indexed on average. Per capita, is a measure of the average; it is not necessary to have both tables. This also holds true for the arbitration tables on the page.

2) FamilySearch is categorizing societies into “small” (less than 16 members) and “large” (16 or more members).  Thus, their tables showing highest numbers of records indexed on average is presented as two tables – one for the small societies, and one for the large societies.  However, the table shown for highest numbers of records indexed for small societies is the exact same table as the per capita list (the 1st one on the page).  This does not make sense since the “per capita” at the top (even if they really meant to have a per capita list) should include all societies, not just the small ones.  Essentially, that first list, the per capita list -is not needed; not only is it repetitive of a later table, but it omits the large societies.

3) Reporting the “average” number of records indexed assumes that when you plot the data in a histogram it has a normal distribution (which means it looks like a bell-shaped curve).  Without getting too technical, to tell someone what the “average” of the group is assumes that most people in the group are working at about the same level within a specific range, and that range is around the  middle of the data set values. I would be willing to bet that of all the thousands of indexers participating in this effort, we are not all working at the same productivity level.  There are probably many indexers who are transcribing very high numbers of names, and many, many more who are indexing far fewer.  This could produce a data set that is skewed (therefore NOT on a bell-shape curve).

Here is the curve for the 35 indexers from our group who have indexed records (one person has not) as of 4pm CST today:

What this graph shows is that there are many indexers who have transcribed less than about 1800 records and there are very few indexers who have transcribed more than 6,000 records.  The high point is off to the left, which means this data set is skewed.  Therefore, to better understand the “middle” of the data set (which is what an “average” is reporting) it is more accurate to report our median instead of our average.  Our group’s “average” is about 1,648 records indexed; our median is 1,016 indexed.  That is a big difference. I would love to know if the numbers of records done by all the indexers for the 1940 census are skewed or not. I would be willing to bet that it is just given the nature of the work we are doing.  If the data set is not following a bell-shaped curve, then FamilySearch should be reporting the medians.

4) FamilySearch is reporting these values as values for April 2012, but the month of April is not even over yet.  What was the cutoff date for this data set? They should have reported the dates covered by this report.

5) Do the “averages” reported also include the non-contributors in a group?  If the numbers reported do not include the non-contributors, then, I question the need to divide contests between small and large societies. Even with a median value reported, if the data set is limited only to those contributing,  then it could be entirely possible that a small society can be far more productive than a larger one – why make the division?

I would love to know more about how the data was analyzed and perhaps learn I am incorrect in some of my points, but from what I’ve seen today, I am can’t trust the data shown.  I understand that we are all in this to contribute to a worthwhile cause and I am thrilled to do so. However, if this is going to be contest, then FamilySearch should at the least report the data accurately.  Ideally, I would love to speak to whomever generated this posting so I can better understand the report was derived.

More to come as I learn it! :-)

Leave a Reply