From HTML to a List of Words
Getting rid of HTML formatting
Often we're interested in keeping the textual content of an online source for processing, but we'd like to get rid of the HTML tags and metadata. We're going to start by doing this the quick and dirty way. In the HTML that you've seen so far, there have been a few basic kinds of tags. In each case, it looks as if we will be safe ignoring everything between a matching pair of angle brackets.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!-- This is a comment -->
<title>Title of page</title>
Our algorithm is going to be as follows
- Start with an empty string to store our text in
- Look at every character in the html string, one at a time
- If the character is a left angle bracket (<) we are now inside a tag so ignore the character
- If the character is a right angle bracket (>) we are now leaving the tag
- If we're inside a tag ignore the character, otherwise append it to the text string
An algorithm is a procedure that has been specified in enough detail that it can be implemented on a computer. We turn to the implementation now.
More about Python strings
So far you've seen two ways that strings can be delimited, using either a matching pair of single or double quotes:
message1 = 'hello world'
message2 = "hello world"
Python has a third kind of string that can span multiple lines. This will be useful later.
message3 = """hello
hello
hello world"""
Python includes a number of statements for manipulating strings. If you'd like to experiment with these statements, you can write and execute short programs as we've mostly been doing, or you can open up a Python shell.
You can concatenate strings (i.e., join them together) using the plus operator. Note that you have to be explicit about where you want blank spaces to occur. You can also create multiple copies of strings by using the multiplication operator.
message4 = 'hello' + ' ' + 'world'
print message4
-> hello world
message5a = 'hello ' * 3
message5b = 'world'
print message5a + message5b
-> hello hello hello world
What if you want to successively add material to the end of a string? There is a special operator for that.
message6 = 'howdy'
message6 += ' '
message6 += 'world'
print message6
-> howdy world
You can determine the number of characters in a string using len. Note that the blank space counts as a separate character.
message7 = 'hello' + ' ' + 'world'
print len(message7)
-> 11
Finally, you are occasionally in a situation where you need to include quotation marks of various kinds within a string, and you don't want the Python interpreter to get the wrong idea and end the string when it comes across one of these characters. In Python, you can put a backslash in front of a quotation mark so that it doesn't terminate the string. These are known as escape sequences.
print '\"'
-> "
print 'The program printed \"hello world\"'
-> The program printed "hello world"
Two other escape sequences allow you to print tabs and newlines:
print 'hello\thello\thello\nworld'
->hello hello hello
world
To return to our algorithm, we first have the problem of creating an empty string to store text in.
text = ''
OK, that was easy. We already know how we're going to append characters to this string when we need to:
text += char
Looping
Now we need a way to look at every character in the html string, one at a time. Like many programming languages, Python includes a number of looping mechanisms. The one that we want is called a for loop. The version below tells the interpreter to do something for each character in a string named html. In effect, it creates a one-character-long string named char, which will contain each character from html in succession.
for char in html:
# do something with char
Branching
Next we need a way of testing the contents of a string, and choosing a course of action based on that test. Again, like many programming languages, Python includes a number of branching mechanisms. The one that we want is called an if statement. The version below tests to see whether the string char contains a left angle bracket.
if char == '<':
# do something
A more general form of the if statement allows you to specify what to do in the event that your test is false.
if char == '<':
# do something
else:
# do something different
In Python you have the option of doing further tests after the first one, by using an elif statement (which is shorthand for "else if").
if char == '<':
# do something
elif char == '>':
# do another thing
else:
# do something completely different
Just to avoid confusion, note that Python uses a single equals sign (=) for assignment, that is for setting one thing equal to something else. In order to test for equality, use double equals signs (==) instead. Beginning programmers often confuse the two.
How will we keep track of whether or not we're inside a tag? We can use a number variable called inside which will be 1 (true) if we're inside a tag and 0 (false) if we're not.
The stripTags routine
Putting it all together, the final version of our routine is shown below. Copy this code and paste it into Komodo edit. Save it in a file called dh.py. This file is going to contain all of the code that we will wish to re-use. In other words, dh.py is a module. (More in the discussion page).
# Given a string containing HTML, remove all characters
# between matching pairs of angled brackets, inclusive.
def stripTags(html):
inside = 0
text = ''
for char in html:
if char == '<':
inside = 1
continue
elif (inside == 1 and char == '>'):
inside = 0
continue
elif inside == 1:
continue
else:
text += char
return text
As you look over this code, you will notice that we needed one final command to make it work. The Python continue statement tells the interpreter to jump back to the top of the enclosing loop. So if the character is a left angle bracket, once you've made a note that you're inside a tag, you're finished processing that character. You want to go get the next character in the html string, rather than continuing to process the one you've already dealt with.
Python lists
Now that we have the ability to extract raw text from web pages, we're going to want to get the text in a form that is easy to process. So far, when we've needed to store
information in our Python programs, we've usually used strings. There were a couple of exceptions, however. In the striptags routine, we also made use of an integer named "inside" to store a 1 when we were processing a tag and a 0 when we weren't.
inside = 1
And whenever we've needed to read from or write to a file, we've used a special file handle like f in the example below.
f = open('helloworld.txt','w')
f.write('hello world')
f.close()
One of the most useful types of object that Python provides, however, is the list, an ordered collection of other objects (including, potentially, other lists). The fact that lists can contain lists makes them ideal for storing tree-like structures, something that we will explain soon and come back to repeatedly. It is also straightforward to turn a string into a list of characters or a list of words, as shown in the following program. Copy it into Komodo Edit, save it as string-to-list.py and execute it. Compare the two lists that are printed to the "Command Output" pane.
# string-to-list.py
# some strings
s1 = 'hello world'
s2 = 'howdy world'
# list of characters
charlist = []
for char in s1:
charlist.append(char)
print charlist
# list of 'words'
wordlist = s2.split()
print wordlist
The first routine uses a for loop to step through each character in the string s1, and appends the character to the end of charlist. The second routine makes use of the split operation to break the string s2 apart wherever there is whitespace (spaces, tabs, returns and similar characters). Actually, it is a bit of a simplification to refer to the objects in the second list as 'words'. Try changing s2 in the above program to 'howdy world!' and running it again. What happened to the exclamation mark?
Given what you've learned so far, you can now open a URL, download the web page to a string, strip out the HTML and then split the text into a list of words. Try executing the following program.
# html-to-list-1.py
import urllib2
import dh
# note that we are using a copy of the old web page because the DCB site doesn't work the way it used to
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
response = urllib2.urlopen(url)
html = response.read()
text = dh.stripTags(html)
wordlist = text.split()
print wordlist[0:120]
You should get something like the following.
['Dictionary', 'of', 'Canadian', 'Biography', 'DOLLARD', 'DES',
'ORMEAUX', '(called', 'Daulat', 'in', 'his', 'death', 'certificate',
'and', 'Daulac', 'by', 'some', 'historians),', 'ADAM,', 'soldier,',
'\x93garrison', 'commander', 'of', 'the', 'fort', 'of',
'Ville-Marie', '[Montreal]\x94;', 'b.', '1635,', 'killed', 'by',
'the', 'Iroquois', 'at', 'the', 'Long', 'Sault', 'in',
'May 1660.', '\xa0\xa0\xa0\xa0\xa0', 'Nothing', 'is', 'known',
'of', 'Dollard\x92s', 'activities', 'prior', 'to', 'his', 'arrival',
'in', 'Canada', 'except', 'that', '\x93he', 'had', 'held', 'some',
'commands', 'in', 'the', 'armies', 'of', 'France.\x94', 'Having',
'come', 'to', 'Montreal', 'as', 'a', 'volunteer,', 'very',
'probably', 'in', '1658,', 'he', 'continued', 'his', 'military',
'career', 'there.', 'In', '1659', 'and', '1660', 'he', 'was',
'described', 'as', 'an', '\x93officer\x94', 'or', '\x93garrison',
'commander', 'of', 'the', 'fort', 'of', 'Ville-Marie,\x94', 'a',
'title', 'that', 'he', 'shared', 'with', 'Pierre', 'Picot\xe9',
'de', 'Belestre.', 'We', 'do', 'not', 'however', 'know', 'what',
'his', 'particular', 'responsibility', 'was.']
Simply having a list of words doesn't buy us much yet. As human beings, we already have the ability to read. We're getting much closer to a representation that our programs can process, however.
Suggested Readings
- Lutz, Learning Python
- Ch. 7: Strings
- Ch. 8: Lists and Dictionaries
- Ch. 10: Introducing Python Statements
- Ch. 15: Function Basics