Handwriting recognition software for indexing?

spencerschumann
New Member
Posts: 4
Joined: Mon Jul 31, 2017 3:40 am

Handwriting recognition software for indexing?

Postby spencerschumann » Mon Jul 31, 2017 5:58 am

I saw a snippet of video yesterday depicting pioneers pushing handcarts over the plains and through rivers. I marveled at how far we've come technologically since then. This journey was a tremendous, difficult, even life threatening task for the early saints, but today we can simply board a plane or hop in a car and comfortably make the same trip they took in a matter of hours. I liken that experience of the pioneers to the massive genealogical indexing labor undertaken by saints of our day. While it doesn't threaten our personal safety, it is nonetheless a long and difficult process, a seemingly endless journey of small steps as each record is carefully transcribed.

My personal interest in family history has been growing recently. I served a mission in Brazil nearly 20 years ago. While exploring my family tree, I was surprised to learn that my Grandfather had an aunt that moved to Brazil and raised a family there. I have very little information about this family other than some names and some rough dates. I saw firsthand the need for indexing as I tried to find more information about them. Searches turned up no direct hits, but I did find a large collection of handwritten birth records that hadn't been indexed. I tried to manually find information about this family in those records, but was unsuccessful. There's just too much data for one person to filter through.

Much like the leap from handcarts to planes, handwriting recognition (HWR, https://en.wikipedia.org/wiki/Handwriting_recognition, an advanced form of OCR) is the obvious technological step to hasten the work of indexing. I'm sure I'm not the only one to have thought of this, and I'm sure it has been used already in this work (for example, there's a mention of computer-indexed records at https://tech.lds.org/forum/viewtopic.php?f=58&t=28978). But today indexing is still primarily a manual task.

Despite being able to travel in speed and comfort now, we admire our pioneer ancestors for their strength and perseverance in completing the task given to them with the tools that were available at the time. In the same way, I expect that success in automatic indexing would not reduce our appreciation for the long hours of manual indexing labor that are being donated by people today.

I submit that the time is ripe for us to diligently work toward making automatic indexing a reality.

There have been many recent advances in AI, and we're seeing it used more and more in areas such as voice recognition and in the progress toward self driving cars. We have a wealth of tools available at our disposal that didn't exist just a few years ago. I am a software engineer. Around 15 years ago while in college I studied AI and machine learning, but I haven't done much with those subjects since then. I've recently come across several articles about AI that have sparked new ideas in my mind. I was awoken during the night last night, and as I lay awake I pondered on those ideas. I was then struck with the thought that those ideas could be applied directly to the work of indexing, and that perhaps it is part of my life's mission to apply the knowledge and skills I've been given to advance the work of family history in this way.

My next thought after this realization was that this effort needs more hands than mine alone. I wondered what would be the best way to organize a community around this effort, and I soon found this forum, which is full of other like-minded individuals with a variety of skills that can be applied to this task. Who else out there would like to work on this ambitious goal?

drepouille
Senior Member
Posts: 1678
Joined: Sun Jul 01, 2007 5:06 pm
Location: Plattsmouth, NE
Contact:

Re: Handwriting recognition software for indexing?

Postby drepouille » Mon Jul 31, 2017 8:13 am

Many printed obituary source documents available at FamilySearch contain the warning, "This record was indexed by a computer; there may be errors."
Dana Repouille, Plattsmouth, Nebraska

davesudweeks
Senior Member
Posts: 835
Joined: Sun May 09, 2010 8:16 pm
Location: Owasso, OK, USA

Re: Handwriting recognition software for indexing?

Postby davesudweeks » Mon Jul 31, 2017 10:09 am

drepouille wrote:Many printed obituary source documents available at FamilySearch contain the warning, "This record was indexed by a computer; there may be errors."


And there are. The software is pretty good with the letters, but getting the relationships correct with a complex document like an obituary fails more often than it succeeds. I'm ok with all that if the original is available to verify, but too many obituaries that have been indexed by computer are not available (or require a paid subscription) to view. :x

User avatar
sbradshaw
Senior Member
Posts: 3771
Joined: Mon Sep 26, 2011 8:42 pm
Location: Provo, UT
Contact:

Re: Handwriting recognition software for indexing?

Postby sbradshaw » Mon Jul 31, 2017 12:28 pm

I know that FamilySearch is looking into and experimenting with OCR and handwriting recognition, but I have no idea where they are in the process. If we want to organize a community effort, we'd want to be sure we're working with FamilySearch – otherwise a lot of effort will be wasted retreading the work that's already been done. Unfortunately, there aren't many developers who regularly visit the LDSTech forums (I don't know if I've ever seen any FamilySearch employees here), so the best bet would be sending feedback through the FamilySearch site.
Samuel Bradshaw • If you desire to serve God, you are called to the work.

spencerschumann
New Member
Posts: 4
Joined: Mon Jul 31, 2017 3:40 am

Re: Handwriting recognition software for indexing?

Postby spencerschumann » Mon Jul 31, 2017 5:13 pm

This is my first post here, so I wasn't sure what to expect. Thanks for the quick replies! I've posted a question on the FamilySearch site at http://gsfn.us/t/50zoz at sbradshaw's suggestion. I agree that avoiding duplicate work is important.

From what I can gather, OCR for printed text is getting good, but handwriting recognition, especially for cursive, lags behind. And unfortunately most of the handwritten records I've seen are written in organic, flowing script that I myself have a hard time reading.

It's too bad that the originals are often not freely available! But some are, and the ones that have already been indexed provide a wealth of training and testing data, which can be one of the big hurdles in making machine learning work.

Even before we can get to fully automated indexing, there's a helpful baby step: let the computer do its best first, then let a human review the output and correct any errors. Also, a computer could examine completed records to look for transcription errors. If the false positive rate is low enough, even this stage could be a helpful improvement over the fully manual approach.

User avatar
sbradshaw
Senior Member
Posts: 3771
Joined: Mon Sep 26, 2011 8:42 pm
Location: Provo, UT
Contact:

Re: Handwriting recognition software for indexing?

Postby sbradshaw » Mon Jul 31, 2017 8:09 pm

Commercial handwriting recognition gets better every year – though most consumer applications have the advantage of the person writing directly on a tablet or writing board, where hand movement can be tracked, and you don't have to deal with ink blots, fading text, or translucent pages with bleed-through from behind. Having a large corpus of human-verified text from similar documents would be key – FamilySearch's corpus is pretty good, but sometimes there are interpolations (like when the indexing project instructions have the indexer expand an abbreviation or skip punctuation). I'm certain that the technology will keep getting better every year, and eventually surpass the average human for interpreting text.

Another forum where you might get some traction is the FamilySearch Yammer community – I have seen employees interact with the community there.
Samuel Bradshaw • If you desire to serve God, you are called to the work.

russellhltn
Community Administrator
Posts: 24038
Joined: Sat Jan 20, 2007 2:53 pm
Location: U.S.

Re: Handwriting recognition software for indexing?

Postby russellhltn » Mon Jul 31, 2017 8:47 pm

spencerschumann wrote:let the computer do its best first, then let a human review the output and correct any errors.

I think that's actually a step backwards. From what I've seen, it's easier to "enter what you see" then to sit and compare two fields looking for mistakes.

I'm pretty sure FS uses multiple data-entry persons: accepting matches and sending discrepancies to arbitration. The computer might be useful as a "entry" person and accepted if validated by other human entries. But it depends on the accuracy. If the quality is sub-human it still might have much to add.
Have you searched the Wiki?
Try using a Google search by adding "site:tech.lds.org/wiki" to the search criteria.

lidlurch
New Member
Posts: 1
Joined: Sun Dec 03, 2017 1:02 pm

Re: Handwriting recognition software for indexing?

Postby lidlurch » Sun Dec 03, 2017 1:16 pm

I completely agree with you! I'm an intern for the church right now and I work in their Business Intelligence department. I can tell you that the church isn't exactly on the bleeding edge of technology because of limited resources and the fact that it's not exactly competing with Google or Baidu. I do believe, however, that if the church opensourced the millions of records that have been indexed and arbitrated successfully, a neural network could easily work through all that data and start indexing by itself (and the tools we'd need are open source tools). I would love to join you in this effort- I actually had a similar spark of inspiration yesterday while I was taking pictures of each page of my mission journal to digitize it. I had the thought that there must be some easier way, then, like a spark of lightning, the idea came to my mind that an AI used for hand writing recognition could work through this easily!

I'm no expert in machine learning- I just started Andrew Ng's course on Courera last week, but the concepts are simple enough and the technology is there. Tensorflow might be a good option for this, and there are plenty of open source libraries for handwriting recognition nowadays. The only scarce resources here are having enough training and testing data, as well as a team of people who understand how to apply AI to a project of this scale. I'm not at that level yet, but this idea came to my mind yesterday so suddenly that I feel like it was a spark of inspiration! And I would love to join in with anybody else interested in this effort.

We've been blessed with these tools, and they certainly would not remove he need for help with indexing. But it would accelerate and hasten the work in a way that no amount of humans could, with accelerating accuracy!

A couple of thoughts with my limited knowledge of how machine learning works...
-DIfferent sets of records may require different neural networks in order to optimize accuracy. This would help the network learn more quickly and get stumped less frequently because of handwriting differences.
-This could really remove the obstacle of having different languages needing to be indexed. If you feed an AI system enough data, it could "learn" how to translate any language.

User avatar
sbradshaw
Senior Member
Posts: 3771
Joined: Mon Sep 26, 2011 8:42 pm
Location: Provo, UT
Contact:

Re: Handwriting recognition software for indexing?

Postby sbradshaw » Sun Dec 03, 2017 1:46 pm

BYU has a research team who has been working with FamilySearch. They recently won second place in an international competition:
https://cs.byu.edu/article/cs-students-win-award-international-software-competition
Samuel Bradshaw • If you desire to serve God, you are called to the work.

devinbost
New Member
Posts: 7
Joined: Sun Aug 03, 2014 2:39 pm

Re: Handwriting recognition software for indexing?

Postby devinbost » Tue Dec 19, 2017 12:45 am

Today, the best algorithms for things like this involve Convolutional Neural Networks (CNNs). They are very good at this type of image processing, as well as for many of the other parts of a machine learning pipeline, such as image segmentation and character segmentation, object recognition, dealing with distortion and noise, etc.
Here's a good visualization of how neural networks operate: http://playground.tensorflow.org/
(Note that the visualization is for a very simple neural network that is only performing a binary classification, but it's a good example for starters.)
If you want to learn more about neural networks, I recommend taking Andrew Ng's course on Machine Learning on Coursera.com.
If you want to go deeper, his Deep Learning course is pretty impressive.

(Disclaimer: I lead the research on machine learning and artificial intelligence at a software company.)


Return to “Software Development”

Who is online

Users browsing this forum: No registered users and 1 guest