SuperPAF/HyperPAF Idea

huffkw · #11

Suggestions on private firm involvement in genealogy program development

Here is what I hope to see happen.

1. The Church sponsors the development and operation of this whole cooperative central “finished genealogy” database concept. That way, it can be sure the underlying basic software is appropriate and reliable. I see this database as the summary and index for the whole genealogy world, and therefore needs to be done well.

Trying to chop that single development project into internal Church-created pieces, and external private company-created pieces, while maintaining consistency of concept and quality of code, sounds like a project manager’s coordination nightmare to me. It would probably be better to have a nice conceptual split between the two groups.

2. With the basic database in operation, supplying unique person numbers, the main focus of new private company work would be outside that database, exploiting all the new possibilities for storing genealogy-related media on the web, and then finding, collecting, and displaying that data, publishing it in one-of-a-kind books, formatting and editing it for use on family websites and wikis, storing copies of the collected data on genealogists’ PCs, etc.

All this web-stored data could be tied together by the central database person number. That is the logical breakthrough that can gradually overcome the chaos of today’s genealogy web research and data storage. Having a reliable person number available is probably far more valuable to private companies profitability than helping develop new ways to access the Church’s ordinance files, membership files, etc.

We might view the new person number as a development like the GPS satellites that allowed a huge flurry of commercial and consumer gadgets to be developed and sold.

The private companies would need to create the peer-to-peer client-based programs (simple at first, and then increasing more complex) that would do the “loosely coupled distributed database” processing to help find and assemble media from anywhere on the web, usually tied to individuals identified in the central database. (PAF-like products might find it useful to add this new data element of “central database person number” along with its internal Record Id, Ancestral File ID number, etc., and add in some of the code to do the consolidating of web data.)

At a genealogy conference I attended about two years ago (was it at the BYU Family History Technology conference, noon session?) a librarian from the Midwest forcefully made the point that the world is producing mountains of new data in numerous new media formats, but there were few good places to store this data as it relates to genealogy and specific people. Since then we have seen YouTube.com, MySpace.com, photo storage sites, etc., but has anyone seen such data storage options for more serious genealogy research storage purposes? I don’t know of any. This sounds like a rich mine for private companies to dig in.

3. The traditional PAF-like programs would continue to play an important role for a long time, although that role might gradually change a bit. Right now, I believe nearly all the good data in the world is stored on someone’s PC-PAF-like data file. Normally, if it is in a central file somewhere, it has been degraded or trimmed in various ways on the way in. What is needed is a central file (system) that stores data of as high a quality or higher than found in PC holdings. That is what I am suggesting. But at least for Americans, the PC-held data will probably still be the trusted gold standard for a long time.

Although much new data will probably be keyed directly into the central database, especially links between families in separate submission spaces, large amounts of researched names will probably continue to be assembled and verified in PC-PAF-like programs and then uploaded and linked into the central file. But these uploads will likely be much smaller than the initial uploads. (Periodic downloads containing all updates, will be available for backups, for analysis, and for family purposes no one has thought of yet.)

mikael.ronstrom · #12

Great ideas, I've had similar ideas for 15 years and the technology is now becoming available
that makes it possible to start the process of implementing this vision. Let me add some more
ideas to this.

You mention that it should be possible to link between source records and people's
genealogies using URL's of source records. I think it needs to be even more detailed
than that. First one should be able to make references to part of a page, essentially
drawing a rectangle around something in the source record and create a link record
from this rectangle. Even more it should be possible to also link this record to other
things such as transscribed records of the source. After some time there will be
many links, both links of high quality and links of low quality so here it should be
possible to use both manual and automatic reference techniques to measure the
quality of links.

The base of those ideas I have I wrote down in Ph.D thesis:
http://citeseer.ist.psu.edu/161921.html
I presented some of those ideas at the first workshop on Family History Technology at BYU,
2001:
http://www.fht.byu.edu/prev_workshops/w ... nstrom.pdf
And I made a presentation last year at Google where those ideas were part of a presentation
around a future open source storage system I am thinking of.
http://video.google.com/videoplay?docid ... 2299655271

Rgrds Mikael Ronstrom

huffkw wrote:SuperPAF/HyperPAF Idea

I attended the recent TechTalk in Salt Lake City (Jan. 18) and found it very helpful. I want to take this opportunity to put in my two cents worth of opinion. I am happy to see the Church starting a new set of genealogy data projects.

A new worldwide missionary/genealogy system is needed
My main personal interest is in seeing Church genealogy systems include a major strategic element that is aimed at serving the entire world and drawing hundreds of millions of people into using it. Our existing database systems, containing many Americans and Europeans who made up so much of our early membership, could be viewed as an extended “beta test” we are still working on before rolling it out to the world.

The overriding goal of the coming worldwide system should be to entice billions of people who have never heard of the Church to ask themselves why the Church even cares about genealogies, even as those people are actively and happily using the system to enter their own genealogy for preservation purposes. Their questions will eventually lead them to learn about our Plan of Salvation. Every person on the planet cares about their ancestors, and many include that concern in their religions.

In summary, I believe we need a truly generalized worldwide “genealogy research results storage” system that encourages and supports extensive cooperation, something that is quite a bit more than the specialized temple work system we have now. It is possible now to go far beyond the Ancestral File, Pedigree Resource File, and similar commercial efforts of an earlier day.

This system is intended to sit on top of all the large “raw genealogy data” databases in the world, acting as the lineage-linked summary and index to it all.

The elusive “One World Family Tree”
As with our past systems, it should do at least as much missionary work as genealogy work, but on a much bigger scale.

Success with this grand idea has eluded many a serious attempt in the past, but it is now perfectly feasible to do a really fine job of it, if the genealogy community is willing to study and accept a slightly different paradigm, a new way of thinking about the old goals and methods. This year seems like the perfect time to explore it, with all the new resources going into genealogy systems. I could describe the suggested system to any desired level of detail, but this is probably not the right time and place to try to do that. (I do have a running prototype at http://www.genreg.com – which needs more introductory verbiage and pictures. There is lots of theory (ten years worth) under the hood that is hard to spell out quickly).

At the risk of causing instant misunderstanding, skepticism, and controversy, I will briefly describe three key features. The most notable internal difference comes in storing the “finished” data in descendant format because that discipline almost automatically eliminates 95% of the duplication problems and questions we see today. The most notable external feature is that people could, theoretically, dispense with all their PC-based software and enter and modify all their data online, having near-instant access to everyone else’s best work as well, making ease of cooperation a major benefit. There are three software layers, 1) one that allows an individual to enter and modify his or her own data, as with today’s PAF, 2) a higher level “SuperPAF” that allows many members of an extended family (or workgroup) to enter updates simultaneously (with appropriate controls), and 3) the highest level “HyperPAF” which allows anyone in the world to view lineage-linked data that has been designated as “public,” and to enter competing, but nondestructive overlays of data for particular people or families. A third crucial feature provides for the explicit linking of name records to source records and images or other documents or multimedia objects, by actual URL or by more generic library call number references. No longer need these important links be buried in someone’s obscure notes, but can become part of the active database. Perfecting this new system using the American/European data “beta test” should allow it to be rolled out to the world with high credibility for all to use.

(Notes: 1. Purely practical and performance reasons would cause PAF-like software to be used for a long time, even if only as an assembler of data segments to be added to the larger collection. 2. This system could be viewed as a very specialized “wiki” linked to other less specialized wikis.)

This relatively inexpensive system (low storage requirements) would allow a person newly interested in genealogy research, anywhere in the world, Church member or not, to find out in a few minutes all the genealogy research that had been done in the world that is linked to him or her (or a near relative). (Their looking has nothing to do with temple work at this point). They could then quickly determine what new non-duplicative genealogy research they might profitably contribute to the whole.

This new system would be powerful for LDS use as well, because long before any questions about submitting temple work came up, all relevant completed genealogies (and related history, stories, photos, etc.) could be explored. This would almost completely avoid the current problem where people, especially new members, may work for years to gather family names for temple work, only to find at the end that 98% have had their work done multiple times before, probably because those names came out of Church systems in the first place. That is quite shocking and discouraging to many neophyte genealogists.

Incidental operational benefits
Strangely enough, implementing the new system, where “finished” data is stored, should eventually cut down drastically on the high volume of FamilySearch hits and other expected future open-ended searches of a similar nature. Besides its social and missionary contributions, it should more than pay for itself by perhaps cutting in half the number of servers the Church would otherwise need to deploy to support its genealogy systems.

Having the new system in place should offer a huge incentive for people to participate in the Online Indexing (microfilm data transcription) process related to the Granite Mountain Vault project, especially if people are able to choose the microfilm they wish to transcribe, film that has a high likelihood of relating to their family. Having a reliable central place to store some of the results of their transcription, especially the parts relating to their family name identification and lineage-linking, should give them a real sense of accomplishment, a job done well, once and for all.

In fact, if the database system I propose were done early, and the vault images were gradually put online (or copies of those same images offered by other online data suppliers were used), it would be theoretically possible to skip much of the (centrally driven) Online Indexing process as now envisioned for those vault images. People using the online images (or even the traditional microfilm copies), and linking them to their family members in the new database, using actual record keys or library-style references, would in the process be gradually creating an index to those images as a byproduct of their efforts. If desired, at some point a final visual comparison of the images and the partially completed index could finish off the indexing/transcription process for the most often-used segments of the image database. (The links of names to source images could be turned around, inverted. This view would begin with source images and show the names linked-to from each image. The results could be completed and corrected, if necessary, much like the current plan for Online Indexing)

Kent Huff

----------------------------------------
801-798-8441, cell 801-787-3729
1748 West 900 South
Spanish Fork, Utah 84660
huffkw@juno.com, http://www.GenReg.com
computer consultant, attorney, author
================================================

huffkw · #13

mikron:
Thanks for letting us know about your interest and expertise.
I want to read all your stuff, and watch the Google video, and react. But that will take a little time.

So I thought a quick reply might be in order, for those who read the thread this far.

One of my design preferences is that the most central database be pretty compact and mostly act as an index to everything else -- no huge photos or images, etc., and just enough notes to make clear the identification of individuals. For example, if eventually there are 1.2 billion non-duplicated American and European names in the database, at 2,000 bytes each, that is about 2.4 terabytes. It would cost maybe $2,000 to store. That is a very small space compared to the space needed for all the source document images that are planned.

The database would be intensely used and soak up a lot of CPU cycles, but would otherwise be quite manageable in size. To start with, a mere 200 GB drive would do the job, up to 100 million names. One could even cache multiple copies to improve performance. That is probably not something you would try with giant image databases.

#14

Unless there is a third-party site out there that allows the archival of geneological media, you could have the church allow the storage of media on a seperate server so that it wouldn't interfere with basic storage of geneological information.

huffkw · #15

thedqs:
You are right.
I hope that is how it turns out.

brannonking · #16

I think many missing the point, here, though. Storage for name hierarchy is nice, but not sufficient. People will use the website when it works like a Wiki. In other words, not only do we want a hierarchy of all 20 billion people who've inhabited the earth in the past 6k years, we want a nice write-up on each with uploaded pictures, journals, ordinances for all religions, etc. We also want nice tools to search the hierarchy for relations, browse it for fun, and read it for education. Suppose you start out with 100MB per entry. Multiply that by 20bil and you need 2 exabytes of storage. It sounds huge and expensive, but in reality it is easily feasible over the next 20 years. We already have databases that can support 20bil entries over a distributed system. That part is already done.

And all the talk about thumb drives... Those are for places without internet connections. And all the talk about client software; what's it for? Web applications these days are plenty advanced for nice data entry, file format import, data browsing, etc.

huffkw wrote: thedqs: “the only problem is storage”

As I estimate it, the Church only has about 40 to 50 million unique names in all its active genealogy and temple ordinance files at this point. All the rest are duplicates of one kind or another, even if that totals 1 billion records. (At this point, it would be hard to get the real number of unique names).

Allowing 2,000 characters storage for each of 50 million names only comes to 100 GB total. Pretty small. All the 300 million deceased Americans would only take up about 600GB, still a small amount. Maybe spend $500 for a 1Terabyte drive.

Putting the whole “bit mountain” on disk in image and text format will truly take petabytes, with megabytes per image, but I am assuming that the file I am suggesting will have no more than short URLs or library-style general references to these huge image files, so it will not expand much.
===============================================

brannonking · #17

haledn wrote: The value that I see the new FamilySearch offering is to jump start your pedigree so that you don't have to start with an empty database structure. Even though this data probably needs to be tweaked to be sure that it has the correct information and duplicates are handled correctly, this seems like a step in the right direction for many inexperienced genealogists who don't want to re-enter genealogical data which they know is already available.

I think you're missing the point here. Get over personal genealogy. We don't want personal genealogy. We want a record of the people of the planet that has no duplication.

huffkw · #18

brannon01 wrote: In other words, not only do we want a hierarchy of all 20 billion people who've inhabited the earth in the past 6k years, we want a nice write-up on each with uploaded pictures, journals, ordinances for all religions, etc.
. . . .
Web applications these days are plenty advanced for nice data entry, file format import, data browsing, etc.

I certainly agree with your sweeping view of what this should all be about, and the product we should be after.

I have just been trying to make it as manageable as possible by suggesting how it could be subdivided for actual construction.

huffkw · #19

mikron:
I like your more detailed link idea. I had not thought any more about it than just giving a line number as one might do with a book, page, and line. But it would be pretty cool to have a document image page come up with a highlighted rectangle around the data of interest. How would you describe the rectangle size and location in some generic way?
----------------------------
I read your 2001 FHT paper. I think it is great, especially your ability to estimate the number of servers needed to process the six object classes you describe: Application Objects, Catalog Objects, Link Objects, Scanned Objects, Translation Objects, and Name Translation Objects. I have wondered if it would be feasible to load the entire set of Application Objects and Link Objects (and perhaps other small objects) into server main memory, and it appears it is. That should deliver an enormous performance boost.

On your link object type that could specify which part of an image page is referenced, I wonder if that would need to be one step more generalized by using percentages of distances between the four corners of the image, since there could be many sizes and compressions used in moving the image around and displaying it, and any absolute size or pixel count could be inappropriate in some cases.

I see most of my ideas as relating to yours by specifying a further set of rules for the formation and interaction of objects, mostly the Application Objects and Link Objects. I am happy to leave the underlying industrial strength memory and storage systems to somebody else.
---------------------------
Your telecom paper sounds like slightly familiar territory.
I should mention that I worked for a time (among a staff of 900 programmers) on the gigantic Verizon telecom Event Processing System covering most of the US east coast. We were testing for about 200 million transactions a day in 1999, but I assume it has gotten much bigger since then.
I have read only the first chapter of your paper, but I was fascinated by the idea of using a second main memory as a backup or commit space, plus all the bottleneck studies and solutions.
---------------------------
The Google video mentioned your major PhD work adopted for MySQL clusters. That was pretty interesting. It makes me wonder whether the Church guys have used any of your open source work on very large data storage systems. Seems like it would be a good thing for them to check out, if they haven’t already.
---------------------------

sochsner · #20

I heard talk that the records stored in the granite vaults are being recorded electronically. Do you think that PAF could connect to that database, So when you enter someone in that PAF recognizes, It brings up a tree that is already existing and asks if it is your family. If it is, then it imports.

With this ides making one big tree connecting all of us to Adam could be a little easier.

Tech Forum

SuperPAF/HyperPAF Idea

Suggestions on private firm involvement in genealogy program development

About SuperPAF

Quick reply, pending reading your materials

wiki is more important than space

personal genealogy? bah

Open source MySQL storage clusters for very large systems