Plain Text of General Conference Talks

Discuss ideas and suggestions around the Church website.
Post Reply
nutterb
Member
Posts: 276
Joined: Tue Feb 10, 2009 7:06 am
Location: Berea, KY, USA

Plain Text of General Conference Talks

#1

Post by nutterb »

Is there a resource somewhere that compiles all of the General Conference talks in plain text?

I want to try my hand at some textual analysis of General Conference talks. I've tried scraping the text out of the HTML, and have actually had a fair amount of success doing it. The problem is that some of the characters being pulled off of the web pages aren't formatting well in my software (R) which makes it hard to match characters correctly. Most notably, I'm getting an odd character code for some spaces after names.

For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.

I would assume a plain text file would remove some of these problems (the HTML on lds.org appears to use directional brackets, I would prefer a plain text file non-directional brackets).
mevans
Senior Member
Posts: 2049
Joined: Tue May 22, 2012 1:52 pm
Location: California, USA

Re: Plain Text of General Conference Talks

#2

Post by mevans »

You could try an HTML to plain text converter. Maybe your browser would even have a plugin that does this? You could also look for accessibility tools. I'm not sure if you're programming or doing cut and paste. You can search for such tools. They might help you get around your character problems. There are several HTML whitespace characters. A converter will transform all of these characters into a plain space.

For example, I took a look at one of Jeffrey R. Holland's talks and see that there's a non-breaking space between "Jeffrey" and "R.".
User avatar
Mikerowaved
Community Moderators
Posts: 4734
Joined: Sun Dec 23, 2007 12:56 am
Location: Layton, UT

Re: Plain Text of General Conference Talks

#3

Post by Mikerowaved »

nutterb wrote:For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.
You might want to read Wikipedia's page on non-breaking space, particularly the Encodings section. It shows that "C2 A0" is valid for a non-breaking space with UTF-8.
So we can better help you, please edit your Profile to include your general location.
User avatar
johnshaw
Senior Member
Posts: 2273
Joined: Fri Jan 19, 2007 1:55 pm
Location: Syracuse, UT

Re: Plain Text of General Conference Talks

#4

Post by johnshaw »

You could work with this database from BYU, not only will it help with the text, but the analysis you are wanting will be facilitated in many through this site.

http://www.lds-general-conference.org/
“A long habit of not thinking a thing wrong, gives it a superficial appearance of being right, and raises at first a formidable outcry in defense of custom.”
― Thomas Paine, Common Sense
Post Reply

Return to “Main Church Website”