Auto-cross referencing Prototype

Discussions about the Notes and Journal tool on LDS.org. This includes the Study Toolbar as well as the scriptures and other content on LDS.org that is integrated with Notes and Journal.
josh-p40
New Member
Posts: 24
Joined: Fri Jan 26, 2007 10:25 am
Location: San Jose, CA

Auto-cross referencing Prototype

Postby josh-p40 » Wed Feb 14, 2007 12:17 am

Here is a VERY rough protoype for auto-cross referencing using the simple verse matching method I mention in the other post.

It's still in development. There are 100 ways I can think of to improve it, so it may get better day to day. But as is, it's not bad.

Nothing fancy yet. You have to know the text of the verse ahead of time.

Basically, put in the exact verse reference written out fully (like: 1 Nephi 3:7) and then put in some phrase or word from the verse like: commandments.

If you want the EXACT phrase, use quotes like "go and do"

http://menkefamily.no-ip.org/

[thread="181"]I posted it in the other post too[/thread]

I double posted since I felt it new enough of a topic being a specific instance of a stat-linking app.
--josh

User avatar
WelchTC
Senior Member
Posts: 2088
Joined: Wed Sep 06, 2006 7:51 am
Location: Kaysville, UT, USA
Contact:

Postby WelchTC » Wed Feb 14, 2007 7:20 am

This is cool stuff. The online scriptures does some of this also but not to the extent you did it. For example, if you use the online scriptures and do a search for "dwealt" (notice the mis-spelling) it will ask if you mean "dwelt". Also you can search for "gather" and it will find all words like "gathering, gathered".

Is your code open source so we can look at it?

Tom

josh-p40
New Member
Posts: 24
Joined: Fri Jan 26, 2007 10:25 am
Location: San Jose, CA

Postby josh-p40 » Wed Feb 14, 2007 2:17 pm

Hey, thanks.

Yeah, I don't do anything nearly as fancy as the scriptures.lds.org search yet. I mean, I don't handle alternate spellings, synonyms, word conjugations, or even mispellings. Those would all improve the results.

As for open source, I'll clean up the code a little and then post it either here or somewhere.

If you guys use it, I'd be interested in hearing "how it's going" and even participating if possible.

Even without cleaning it up, it's only 162 lines of code. Of course, that assumes you already have an access to a full-text index of all the scriptures.

I'll post where the code is available in this thread and maybe the other.
--josh

josh-p40
New Member
Posts: 24
Joined: Fri Jan 26, 2007 10:25 am
Location: San Jose, CA

Postby josh-p40 » Wed Feb 14, 2007 5:27 pm

OK, here is the source code.

It's also available here for now:

http://menkefamily.no-ip.org/MatchVerses.pys

You still need a Lucene index and PyLucece to use it.

Code: Select all

#!/usr/bin/env python
from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory, Term, QueryTermExtractor
from PyLucene import VERSION, LUCENE_VERSION

import string
import optparse

"""
MatchVerses.py
Author: Josh Menke
Date created: Feb 12 2007
Last Modified: Feb 14 2007

Given a verse and a search for the verse, this will find verses that contain the
search and will rank them by how closely they match the original verse.

Meant as an "auto-cross-indexer"

This script is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.SearchFiles.
"""

def remove_punctuation(text):
   cleaned = ""
   for char in text:
      if not char in string.punctuation:
         cleaned += char
   return cleaned

def get_bigrams(words):
   bigram_set = set()
   for word_number in range(len(words)-1):
      bigram_set.add(words[word_number]+words[word_number+1])
   return bigram_set

def run(searcher, analyzer,verse=None,command=None):

   # output is handled differently if this was called from the console vs. interactively
   is_console = verse != None
   if is_console:
      paragraph = "<p>"
   else:
      paragraph = ""

   while True:

      if not is_console:
         print paragraph
         verse = raw_input("Verse:")
         if verse == '':
             return

      print paragraph,"Verse:",verse

      verse_path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/./verses/"+verse+".txt"
      try:
         verse_raw = open(verse_path).read()
      except IOError:
         print paragraph,"ERROR: Verse not found. Please use full names for now"
         if is_console:
            return
         else:
            continue

      print paragraph,verse_raw

      # weight terms using API into my PyLucene created full-text index
      term_iter = searcher.getIndexReader().terms();
      term_weight_map = {}
      while (term_iter.next()):
         term = term_iter.term()
         text = term.text()
         freq = term_iter.docFreq()
         term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()

      # remove punctuation
      words_in_verse = remove_punctuation(verse_raw).split()

      # only use words Lucene likes (removes stopwords)
      word_set = set()
      for word in words_in_verse:
         query = QueryParser("contents", analyzer).parse(word)
         hits = searcher.search(query)
         if hits.length():
            word_set.add(word.lower())

      # calculate highest possible score
      highest_possible = 0
      for word in word_set:
         try:
            highest_possible += term_weight_map[word]
         except KeyError:
            continue

      # add bigrams
      bigram_set = get_bigrams(words_in_verse)
      highest_possible += len(bigram_set)

      if not is_console:
         print "Hit enter with no input to quit."
         command = raw_input("Query:")
         if command == '':
             return
         print

      if is_console:
         print "

"

      print paragraph,"Searching for:", command

      query = QueryParser("contents", analyzer).parse(command)
      hits = searcher.search(query)
      print paragraph,"%s total matching documents." % hits.length()
      matching_docs = []
      for doc in hits:
         path = doc.get("path")
         path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/"+path
         if path == verse_path:
            continue
         raw_verse= open(path).read()

         # remove punctuation
         verse_words = remove_punctuation(raw_verse).split()

         # score match
         matches = word_set.intersection(set(verse_words))
         match_score = 0
         for match in matches:
            try:
               match_score += term_weight_map[match.lower()]
            except KeyError:
               continue

         # add bigram bonus
         match_bigram_set = get_bigrams(verse_words)
         match_score += len(match_bigram_set.intersection(bigram_set))

         # store match
         matching_docs.append((match_score,doc.get("name"),raw_verse))

      matching_docs.sort()
      if is_console:
         matching_docs.reverse()
         matching_docs = matching_docs[0:10]
      else:
         matching_docs = matching_docs[-10:]

      if is_console:
         print "
","
"
      for doc in matching_docs:
         print paragraph,doc[1][0:doc[1].find(".")],"Score:",round(float(doc[0])/highest_possible,2)
         print paragraph,doc[2]
         if is_console:
            print "
","
"

      if is_console:
         return


if __name__ == '__main__':
   usage = "usage: %prog [options] [\"verse\" \"word or phrase\"]"
   parser = optparse.OptionParser(usage=usage)
   parser.add_option("-d", "--directory",dest="directory", help="directory with index", default="index")
   (options, args) = parser.parse_args()

   if not (len(args) != 0 or len(args) != 2):
      parser.error("Incorrect number of arguments")

   STORE_DIR = options.directory
   directory = FSDirectory.getDirectory(STORE_DIR, False)
   searcher = IndexSearcher(directory)
   analyzer = StandardAnalyzer()
   if len(args) == 2:
      run(searcher,analyzer,args[0],args[1])
   else:
      run(searcher, analyzer)
   searcher.close()
--josh

User avatar
WelchTC
Senior Member
Posts: 2088
Joined: Wed Sep 06, 2006 7:51 am
Location: Kaysville, UT, USA
Contact:

Postby WelchTC » Thu Feb 15, 2007 7:34 am

Cool! Nice work!

Tom

josh-p40
New Member
Posts: 24
Joined: Fri Jan 26, 2007 10:25 am
Location: San Jose, CA

Postby josh-p40 » Thu Feb 15, 2007 12:37 pm

FYI, I slightly modified the term weights:

I changed:

Code: Select all

term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()


to

Code: Select all

term_weight_map[text] = 1-freq*4.0/searcher.maxDoc()


The most common word (outside the normal english stop words) is "unto" which appears in over 11,000 verses. By quadrupling the commanlity penalty, "unto" becomes worth almost 0.
--josh

User avatar
mkmurray
Senior Member
Posts: 3241
Joined: Tue Jan 23, 2007 9:56 pm
Location: Utah
Contact:

Postby mkmurray » Sun Feb 25, 2007 10:28 pm

Follow over to [thread=264]this thread[/thread] for a better AJAX-style prototype.


Return to “Notes and Journal, and Online Scriptures”

Who is online

Users browsing this forum: No registered users and 1 guest