Is there a resource somewhere that compiles all of the General Conference talks in plain text?
I want to try my hand at some textual analysis of General Conference talks. I've tried scraping the text out of the HTML, and have actually had a fair amount of success doing it. The problem is that some of the characters being pulled off of the web pages aren't formatting well in my software (R) which makes it hard to match characters correctly. Most notably, I'm getting an odd character code for some spaces after names.
For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.
I would assume a plain text file would remove some of these problems (the HTML on lds.org appears to use directional brackets, I would prefer a plain text file non-directional brackets).
Plain Text of General Conference Talks
-
- Member
- Posts: 276
- Joined: Tue Feb 10, 2009 7:06 am
- Location: Berea, KY, USA
-
- Senior Member
- Posts: 2052
- Joined: Tue May 22, 2012 1:52 pm
- Location: California, USA
Re: Plain Text of General Conference Talks
You could try an HTML to plain text converter. Maybe your browser would even have a plugin that does this? You could also look for accessibility tools. I'm not sure if you're programming or doing cut and paste. You can search for such tools. They might help you get around your character problems. There are several HTML whitespace characters. A converter will transform all of these characters into a plain space.
For example, I took a look at one of Jeffrey R. Holland's talks and see that there's a non-breaking space between "Jeffrey" and "R.".
For example, I took a look at one of Jeffrey R. Holland's talks and see that there's a non-breaking space between "Jeffrey" and "R.".
- Mikerowaved
- Community Moderators
- Posts: 4744
- Joined: Sun Dec 23, 2007 12:56 am
- Location: Layton, UT
Re: Plain Text of General Conference Talks
You might want to read Wikipedia's page on non-breaking space, particularly the Encodings section. It shows that "C2 A0" is valid for a non-breaking space with UTF-8.nutterb wrote:For example, '"By Elder Jeffrey R. Holland"' is the name I get for talks by Elder Holland. The ' ' appears to be two raw characters that appear as a single space in the web browser. The raw character for a space should be 20, while the raw code for the ' ' is 'c2 a0'.
So we can better help you, please edit your Profile to include your general location.
- johnshaw
- Senior Member
- Posts: 2273
- Joined: Fri Jan 19, 2007 1:55 pm
- Location: Syracuse, UT
Re: Plain Text of General Conference Talks
You could work with this database from BYU, not only will it help with the text, but the analysis you are wanting will be facilitated in many through this site.
http://www.lds-general-conference.org/
http://www.lds-general-conference.org/
“A long habit of not thinking a thing wrong, gives it a superficial appearance of being right, and raises at first a formidable outcry in defense of custom.”
― Thomas Paine, Common Sense
― Thomas Paine, Common Sense