Keyword in Context (KWIC)
N-grams
Now that you know how to harvest the textual content of a web page automatically with Python, and have begun to use strings, lists and dictionaries for text processing, there are many other things that you can do with the text besides counting frequencies. What we're going to do next is develop the ability to display keywords in context (often abbreviated as KWIC). Given a text and a keyword, your program will list every occurrence of the keyword in the text, showing it in the context of a fixed number of words on either side. As before, we will wrap the output so that it can be viewed in Firefox and added easily to Zotero. We will also use the output of our KWIC routine as the basis for a number of automatically generated Google searches.
As you work with digital representations, you will come to see that many of them have a hierarchical, or tree-like structure. This is also true of many elements of natural language, particularly at the level of phrases and sentences. People who study the statistical properties of language have found, however, that there is much to be learned by studying linear sequences of linguistic units. These are known as bigrams (2 units), trigrams (3 units), or more generally as n-grams. In any given text, different n-grams will occur with different frequencies. Since natural languages have a certain amount of built-in redundancy, even very large samples of text in a particular language will exhibit a distribution of different frequencies for different n-grams. We will unpack this idea as we go, but for now, we note that in English texts, q is almost always followed by u. In a sense that can be made mathematically precise, if you know that the current character is q, you can predict that the next one will be u with a high degree of certainty. Another way of saying this is that the bigram qu occurs much more frequently in English than qa, qb, and so on.
From text to n-grams
Suppose you have a string like 'it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness'. You already know how to turn a string into a list using the split operation. In Python, you can retrieve a subsequence of a list by using what is called a slice, represented as two indexes separated by a colon. If the first index is empty, it is assumed to be the beginning of the list. If the second index is empty, it is assumed to be the end. Study the following examples.
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
print wordlist[0:4]
-> ['it', 'was', 'the', 'best']
print wordlist[0:6]
-> ['it', 'was', 'the', 'best', 'of', 'times']
print wordlist[6:10]
-> ['it', 'was', 'the', 'worst']
print wordlist[0:12]
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']
print wordlist[:12]
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']
print wordlist[12:]
-> ['it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']
We can combine the slice with a list comprehension to create a function that slides an n-gram window across a word list to create a list of n-grams. Copy the following function to your dh.py module.
# Given a list of words and a number n, return a list
# of n-grams.
def getNGrams(wordlist, n):
return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]
If you want to test the function, you can paste the definition into a Python shell. Note that it returns an empty list if you ask for n-grams longer than the total length of your input.
test1 = 'here are four words'
test2 = 'this test sentence has eight words in it'
getNGrams(test1.split(), 5)
-> []
getNGrams(test2.split(), 5)
-> [['this', 'test', 'sentence', 'has', 'eight'],
['test', 'sentence', 'has', 'eight', 'words'],
['sentence', 'has', 'eight', 'words', 'in'],
['has', 'eight', 'words', 'in', 'it']]
Making an n-gram dictionary
Since we want to show our keyword in the context of neighboring terms, we are going to use an n-gram window with an odd-numbered length (3,5,7,...). The next step is to put each n-gram in a dictionary, using the middle word as the key. Since Python indexes start at 0, we can compute the index of this middle word by dividing n by two and losing the remainder. If we are working with 7-grams, for example, the left context will consist of terms indexed by 0, 1, 2, the keyword will be indexed by 3, and the right context terms indexed by 4, 5, 6. Study the following function before adding it to the dh.py module. In particular, you should be clear about what happens when a particular keyword appears in more than one context. Notice also that we determine n by measuring the length of the first item in the list of n-grams, and from n we calculate the index of the keyword using Python's floor division operator.
# Given a list of n-grams, return a dictionary of KWICs,
# indexed by keyword.
def nGramsToKWICDict(ngrams):
kwicdict = {}
keyindex = len(ngrams[0]) // 2
for k in ngrams:
if k[keyindex] not in kwicdict:
kwicdict[k[keyindex]] = [k]
else:
kwicdict[k[keyindex]].append(k)
return kwicdict
Pretty printing a KWIC
"Pretty printing" is the process of formatting output so that it can be easily read by human beings. In the case of our keywords in context, we want to have the keywords lined up in a column, with the terms in the left-hand context right justified, and the terms in the right-hand context left justified. In other words, we want our KWIC display to look something like this:
| killed by the | iroquois | at the long |
| episode in the | iroquois | wars it was |
| of 1660 the | iroquois | having sent forth |
| terror among the | iroquois | by such a |
| shape when 300 | iroquois | burst forth along |
| side of the | iroquois | dollard and his |
| ... |
To get this effect, we are going to need to do a number of list and string manipulations. At this point, you will probably want to open a Python shell so you can experiment a bit. Let's start by figuring out the slices of the list that give us the keyword, left-hand context and right-hand context.
kwic = 'killed by the iroquois at the long'.split()
n = len(kwic)
print n
-> 7
keyindex = n // 2
print keyindex
-> 3
print kwic[:keyindex]
-> ['killed', 'by', 'the']
print kwic[keyindex]
-> iroquois
print kwic[(keyindex+1):]
-> ['at', 'the', 'long']
Now we need to format each of the three columns of our display. The right-hand context is simply going to consist of a string of terms separated by blank spaces.
print ' '.join(kwic[(keyindex+1):])
-> at the long
We want the keywords to have a bit of whitespace padding around them. We can achieve this by using a string method called center and having the overall string be longer than the keyword itself. The expression below adds three blank spaces (6/2) to either side of the keyword. We've added hash marks at the beginning and end of the expression so you can see the leading and trailing blanks.
print '#' + str(kwic[keyindex]).center(len(kwic[keyindex])+6) + '#'
-> # iroquois #
Finally, we want the left-hand context to be right justified. Depending on how large n is, we are going to need the overall length of this column to increase. We do this by defining a variable called width and then making the column length a multiple of this variable (we used a width of 10 characters, but you can make it larger or smaller as desired). The rjust method handles right justification. Once again, we've added hash marks so you can see the leading blanks.
width = 10
print '#' + ' '.join(kwic[:keyindex]).rjust(width*keyindex) + '#'
-> # killed by the#
We can now combine these into a function that takes a KWIC and returns a pretty-printed string. Add this to the dh.py module.
# Given a KWIC, return a string that is formatted for
# pretty printing.
def prettyPrintKWIC(kwic):
n = len(kwic)
keyindex = n // 2
width = 10
outstring = ' '.join(kwic[:keyindex]).rjust(width*keyindex)
outstring += str(kwic[keyindex]).center(len(kwic[keyindex])+6)
outstring += ' '.join(kwic[(keyindex+1):])
return outstring
From HTML to KWIC
We can now create a program that, given a URL and a keyword, wraps a KWIC display in HTML and outputs it in Firefox. This program begins and ends in a similar fashion as the program that computed word frequencies. Copy the code to Komodo Edit, save it as html-to-kwic.py, and execute it.
# html-to-kwic.py
import dh
# create dictionary of n-grams
n = 7
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
text = dh.webPageToText(url)
fullwordlist = ('# ' * (n//2)).split()
fullwordlist += dh.stripNonAlphaNum(text)
fullwordlist += ('# ' * (n//2)).split()
ngrams = dh.getNGrams(fullwordlist, n)
worddict = dh.nGramsToKWICDict(ngrams)
# output KWIC and wrap with HTML
target = 'iroquois'
outstr = '<pre>'
if worddict.has_key(target):
for k in worddict[target]:
outstr += dh.prettyPrintKWIC(k)
outstr += '<br />'
else:
outstr += 'Keyword not found in source'
outstr += '</pre>'
dh.wrapStringInHTML('html-to-kwic', url, outstr)
There a few things in this code that need to be explained. We want the first few words and last few words in a text to show up when we generate a KWIC display. So we create some padding to put at the beginning and end of our word list. Say we are searching for the keyword 'dictionary' using a 7-gram context and it is the first word in the text. Then, since we are using hash marks for padding, the KWIC might be something like:
# # # dictionary of canadian biography
We're never going to need more than n//2 words of padding. The second thing we'd like to point out is that we've wrapped our display in the HTML pre tag to indicate that our material is preformatted... we don't want the browser monkeying around with our whitespace after we've gone to such lengths to get it right. Finally, notice that we use the has_key dictionary method to make sure that the keyword actually occurs in our text. If it doesn't, we can print a message for the user before sending the output to Firefox.
Turning each KWIC into a Google search link
As with the previous example, we can introduce one more useful revision at this point. Why not use each KWIC as the basis for a possible Google search? Add the following routine to dh.py. We will discuss the style property of the HTML a tag in the next section.
# Given a list of keywords and a link name, return an
# HTML link to a Google search for those terms.
def keywordListToGoogleSearchLink(keywords, linkname):
gsearch = '<a style=\"text-decoration:none\" '
gsearch += 'href=\"http://www.google.com/search?q='
gsearch += '+'.join(keywords)
gsearch += '\">'
gsearch += linkname
gsearch += '</a>'
return gsearch
Although Google tends to filter out stop words automatically, we can use our own routine to do this before creating the search link. The reason we want to filter stop words is because Google puts a limit on the number of search terms that it will pay attention to. If we have a long context for each of our keywords (say a 9-, 11-, or 13-gram) we want every meaningful word to count in our search. The revised version of our code is shown below. Copy it to Komodo Edit, save it as html-to-kwic-2.py and execute it.
# html-to-kwic-2.py
import dh
# create dictionary of n-grams
n = 7
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
text = dh.webPageToText(url)
fullwordlist = ('# ' * (n//2)).split()
fullwordlist += dh.stripNonAlphaNum(text)
fullwordlist += ('# ' * (n//2)).split()
ngrams = dh.getNGrams(fullwordlist, n)
worddict = dh.nGramsToKWICDict(ngrams)
# output KWIC and wrap with HTML
target = 'iroquois'
outstr = '<pre>'
if worddict.has_key(target):
for k in worddict[target]:
linkname = dh.prettyPrintKWIC(k)
keywords = dh.removeStopwords(k, dh.stopwords)
outstr += dh.keywordListToGoogleSearchLink(keywords, linkname)
outstr += '<br />'
else:
outstr += 'Keyword not found in source'
outstr += '</pre>'
dh.wrapStringInHTML('html-to-kwic-2', url, outstr)
Once the output opens in Firefox, try running the mouse over different lines in the display and checking the bottom bar of your browser (over to the left of where you click to open the Zotero pane). As you pick each hyperlink, you can compare the KWIC line to the Google search that your program constructed.
Try clicking on the KWIC line that reads "ambuscades for the iroquois when returning from"... it is the twentieth one from the top. You will find a number of potentially interesting documents, including links to JSTOR articles and to transcriptions of various volumes of the Jesuit Relations and the works of Francis Parkman. As a historian, you'll want to evaluate these sources critically. Do they add anything that you didn't already know? Are they authoritative? Can you find more reliable versions? Should you add them to Zotero and use them as the basis for further search and analysis? As a programming historian, you have more evidence that digital sources can be profitably 'read' by both human beings and machines.