Plain Text of General Conference Talks

Discuss ideas and suggestions around the LDS.org website.
nutterb
Member
Posts: 257
Joined: Tue Feb 10, 2009 7:06 am
Location: Shaker Heights, OH, USA

Plain Text of General Conference Talks

Postby nutterb » Wed Oct 08, 2014 6:03 am

Is there a resource somewhere that compiles all of the General Conference talks in plain text?

I want to try my hand at some textual analysis of General Conference talks. I've tried scraping the text out of the HTML, and have actually had a fair amount of success doing it. The problem is that some of the characters being pulled off of the web pages aren't formatting well in my software (R) which makes it hard to match characters correctly. Most notably, I'm getting an odd character code for some spaces after names.

For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.

I would assume a plain text file would remove some of these problems (the HTML on lds.org appears to use directional brackets, I would prefer a plain text file non-directional brackets).

mevans
Senior Member
Posts: 1282
Joined: Tue May 22, 2012 12:52 pm
Location: California, USA

Re: Plain Text of General Conference Talks

Postby mevans » Mon Oct 20, 2014 11:10 pm

You could try an HTML to plain text converter. Maybe your browser would even have a plugin that does this? You could also look for accessibility tools. I'm not sure if you're programming or doing cut and paste. You can search for such tools. They might help you get around your character problems. There are several HTML whitespace characters. A converter will transform all of these characters into a plain space.

For example, I took a look at one of Jeffrey R. Holland's talks and see that there's a non-breaking space between "Jeffrey" and "R.".

User avatar
Mikerowaved
Community Moderators
Posts: 3132
Joined: Sun Dec 23, 2007 12:56 am
Location: Layton, UT

Re: Plain Text of General Conference Talks

Postby Mikerowaved » Tue Oct 21, 2014 1:42 am

nutterb wrote:For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.

You might want to read Wikipedia's page on non-breaking space, particularly the Encodings section. It shows that "C2 A0" is valid for a non-breaking space with UTF-8.
So we can better help you, please edit your Profile to include your general location.

User avatar
johnshaw
Senior Member
Posts: 1839
Joined: Fri Jan 19, 2007 1:55 pm
Location: Syracuse, UT

Re: Plain Text of General Conference Talks

Postby johnshaw » Tue Oct 21, 2014 7:31 am

You could work with this database from BYU, not only will it help with the text, but the analysis you are wanting will be facilitated in many through this site.

http://www.lds-general-conference.org/
“A long habit of not thinking a thing wrong, gives it a superficial appearance of being right, and raises at first a formidable outcry in defense of custom.”
― Thomas Paine, Common Sense


Return to “LDS.org Website”

Who is online

Users browsing this forum: No registered users and 1 guest