| ||||||||||||||
| ||||||||||||||
|
||||||||||||||
Natural Language Processing Using Linux
This article will discuss how to perform simple textual analysis such as word counts, bigrams and trigrams, using standard Linux tools. The article is heavily based upon Unix For Poets by Kenneth Ward Church, and I would strongly recommend anyone interested in natural language processing on Unix/Linux download it. Firstly, you will want to find yourself an interesting corpus of text. These can be found all about the Internet, although the Oxford Text Archive is an excellent place to start. Many corpora are marked, with part-of-speech (POS) taggings and other information. For the moment, we are interested in only the text itself. For the purposes of this article, I will be using Charles Dickens' Christmas Carol. Tokens, Bigrams and TrigramsTokenizing TextThe first simple task we will look at is to tokenize the text, along with individual word counts. We will use three commands: tr, sort and uniq. tr is used to translate character sequences, sort will, well, sort and uniq removes duplicate lines of text. As with all Linux tools, each of these utilities has numerous options and modes—read the man pages for detailed information. Let us take a look at how we would use the three together: [jmatthews@Niobe nlp]$ tr -sc '[A-Z][a-z]' '[\012*]' < corpora/xmasCarol.txt | sort | uniq -c > xmasCarol.hist A breakdown: tr -sc '[A-Z][a-z]' '[\012]' translates any characters that do not fall between A-Z and a-z are replaced with a line feed. Redirection and piping is used to tokenize our corpus and piping it back into sort. As you might imagine, sort then sorts the tokenized list of words, this in turn is piped into uniq. uniq removes all duplicate lines, which the -c option appends the individual word counts. The output should look something similar to this:
1
43 A
1 Abels
...
202 you
27 young
3 younger
1 youngest
45 your
3 yours
4 yourself
1 youth
1 zeal
We can use the word histogram further to identify the most commonly used words within the next:
[jmatthews@Niobe nlp]$ sort -rb xmasCarol.hist | sed 5q
99 one
97 which
95 there
92 up
91 from
While looking at individual words can reveal some important details about the text, look at pairs or triplets of words can yield much more information. Bigrams and TrigramsBigrams are pairs of words. The most commonly used bigrams can be created using the following: [jmatthews@Niobe nlp]$ tr -sc '[A-Z][a-z]' '[\012*]' < corpora/xmasCarol.txt > xmasCarol.tr [jmatthews@Niobe nlp]$ tail +2 xmasCarol.tr > xmasCarol.nw [jmatthews@Niobe nlp]$ paste xmasCarol.tr xmasCarol.nw | sort | uniq -c | sort -rg > xmasCarol.bigram This works by simply tokenizing the text as we have seen. tail +2 simply creates another file with all but the first token. paste will append the two inputted text files next to each other, line-by-line. This is piped into sort then duplicates are removed and counted using uniq -c. This is once again is piped into sort, but using the -rg options so that count values are sorted in reverse. The output will look something like this:
141 in the
107 of the
96 said Scrooge
74 the Ghost
59 and the
...
Calculating trigrams is very similar, requiring another call to tail. This time though, make a shell script to simply matters. The shell script should look like this: tr -sc '[A-Z][a-z]' '[\012*]' > $$words tail +2 $$words > $$nextwords tail +3 $$words > $$nextwords2 paste $$words $$nextwords $$nextwords2 | sort | uniq -c | sort -rg rm $$words $$nextwords $$nextwords2 The shell script can then be run like this:
[jmatthews@Niobe nlp]$ sh trigram < corpora/xmasCarol.txt
20 Scrooge s nephew
20 I don t
14 said the Ghost
12 would have been
12 that it was
12 it was a
12 don t know
11 the Ghost of
11 Scrooge s niece
11 It was a
11 Ghost of Christmas
10 that he was
10 said Scrooge I
...
You can see how many of the trigrams consist of words with an apostrophe (for example, "Scrooge's Nephew"). This is a feature or a bug, depending on how you want to tokenize the original text. By altering the original call to tr, you can tokenize apostrophized words as one (I'd, Scrooge's etc.). ConclusionAlthough only basic, you can see how standard Linux utilities can be combined to create powerful textual analysis tools. Again, I strongly recommend anyone who found this interesting check out Unix for Poets. ReferencesChurch, Kenneth Ward. Unix for Poets. AT&T Research. Jurafsky, D., Martin, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. New Jersey: 2000. Market, Katja. AI31: Natural Language Processing Lecture Notes. University of Leeds, 2003/4.
Submitted: 24/10/2004 Article content copyright © James Matthews, 2004.
|
|
|||||||||||||
All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -