At the forefront of Artificial Intelligence
  Home Articles Reviews Interviews JDK Glossary Features Discussion Search

Natural Language Processing Using Linux

This article will discuss how to perform simple textual analysis such as word counts, bigrams and trigrams, using standard Linux tools. The article is heavily based upon Unix For Poets by Kenneth Ward Church, and I would strongly recommend anyone interested in natural language processing on Unix/Linux download it.

Firstly, you will want to find yourself an interesting corpus of text. These can be found all about the Internet, although the Oxford Text Archive is an excellent place to start. Many corpora are marked, with part-of-speech (POS) taggings and other information. For the moment, we are interested in only the text itself. For the purposes of this article, I will be using Charles Dickens' Christmas Carol.

Tokens, Bigrams and Trigrams

Tokenizing Text

The first simple task we will look at is to tokenize the text, along with individual word counts. We will use three commands: tr, sort and uniq. tr is used to translate character sequences, sort will, well, sort and uniq removes duplicate lines of text. As with all Linux tools, each of these utilities has numerous options and modes—read the man pages for detailed information.

Let us take a look at how we would use the three together:

[jmatthews@Niobe nlp]$ tr -sc '[A-Z][a-z]' '[\012*]' 
  < corpora/xmasCarol.txt | sort | uniq -c > xmasCarol.hist

A breakdown: tr -sc '[A-Z][a-z]' '[\012]' translates any characters that do not fall between A-Z and a-z are replaced with a line feed. Redirection and piping is used to tokenize our corpus and piping it back into sort. As you might imagine, sort then sorts the tokenized list of words, this in turn is piped into uniq. uniq removes all duplicate lines, which the -c option appends the individual word counts. The output should look something similar to this:

      1
     43 A
      1 Abels
        ...
    202 you
     27 young
      3 younger
      1 youngest
     45 your
      3 yours
      4 yourself
      1 youth
      1 zeal

We can use the word histogram further to identify the most commonly used words within the next:

[jmatthews@Niobe nlp]$ sort -rb xmasCarol.hist | sed 5q
     99 one
     97 which
     95 there
     92 up
     91 from

While looking at individual words can reveal some important details about the text, look at pairs or triplets of words can yield much more information.

Bigrams and Trigrams

Bigrams are pairs of words. The most commonly used bigrams can be created using the following:

[jmatthews@Niobe nlp]$ tr -sc '[A-Z][a-z]' '[\012*]' 
   < corpora/xmasCarol.txt > xmasCarol.tr
[jmatthews@Niobe nlp]$ tail +2 xmasCarol.tr > xmasCarol.nw 
[jmatthews@Niobe nlp]$ paste xmasCarol.tr xmasCarol.nw | sort 
   | uniq -c | sort -rg > xmasCarol.bigram

This works by simply tokenizing the text as we have seen. tail +2 simply creates another file with all but the first token. paste will append the two inputted text files next to each other, line-by-line. This is piped into sort then duplicates are removed and counted using uniq -c. This is once again is piped into sort, but using the -rg options so that count values are sorted in reverse. The output will look something like this:

    141 in      the
    107 of      the
     96 said    Scrooge
     74 the     Ghost
     59 and     the
     ...

Calculating trigrams is very similar, requiring another call to tail. This time though, make a shell script to simply matters. The shell script should look like this:

tr -sc '[A-Z][a-z]' '[\012*]' > $$words
tail +2 $$words > $$nextwords
tail +3 $$words > $$nextwords2

paste $$words $$nextwords $$nextwords2 | sort | uniq -c 
  | sort -rg

rm $$words $$nextwords $$nextwords2

The shell script can then be run like this:

[jmatthews@Niobe nlp]$ sh trigram < corpora/xmasCarol.txt
     20 Scrooge s       nephew
     20 I       don     t
     14 said    the     Ghost
     12 would   have    been
     12 that    it      was
     12 it      was     a
     12 don     t       know
     11 the     Ghost   of
     11 Scrooge s       niece
     11 It      was     a
     11 Ghost   of      Christmas
     10 that    he      was
     10 said    Scrooge I
     ...

You can see how many of the trigrams consist of words with an apostrophe (for example, "Scrooge's Nephew"). This is a feature or a bug, depending on how you want to tokenize the original text. By altering the original call to tr, you can tokenize apostrophized words as one (I'd, Scrooge's etc.).

Conclusion

Although only basic, you can see how standard Linux utilities can be combined to create powerful textual analysis tools. Again, I strongly recommend anyone who found this interesting check out Unix for Poets.

References

Church, Kenneth Ward. Unix for Poets. AT&T Research.

Jurafsky, D., Martin, J. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. New Jersey: 2000.

Market, Katja. AI31: Natural Language Processing Lecture Notes. University of Leeds, 2003/4.

Submitted: 24/10/2004

Article content copyright © James Matthews, 2004.
 Article Toolbar
Print
BibTeX entry

Search

Latest News
- Generation5 10-year Anniversary (03/09/2008)
- New Generation5 Design! (09/04/2007)
- Happy New Year 2007 (02/01/2007)
- Where has Generation5 Gone?! (04/11/2005)
- NeuroEvolving Robotic Operatives (NERO) (25/06/2005)

What's New?
- Back-propagation using the Generation5 JDK (07/04/2008)
- Hough Transforms (02/01/2008)
- Kohonen-based Image Analysis using the Generation5 JDK (11/12/2007)
- Modelling Bacterium using the JDK (19/03/2007)
- Modelling Bacterium using the JDK (19/03/2007)


All content copyright © 1998-2007, Generation5 unless otherwise noted.
- Privacy Policy - Legal - Terms of Use -