NLTK

Natural Language Toolkit, a Python package for NLP.

Corpus

nltk comes with several dozen corpora.

Data Structure

nltk.download() - is a cli that lets you list and download selected corpora. On my machine the corpora were downloaded to ~/nltk_data/corpora by default.

nltk.download('gutenberg') - downloaded a file gutenberg.zip and then unzipped it into a new folder ~/nltk_data/corpora/gutenberg. The file contained multiple .txt files, one for each document.

nltk.corpora.gutenberg.abspaths() - returns a list of full pathnames of the .txt files

nltk.corpora.gutenberg.fileids() - returns a list of filenames of the .txt files

Types

There are several types of corpora. “Type” refers to the format and structure of the data.

Each type has an associated Reader to give access to the corpora.

The gutenberg corpora is of type PlainText and is accessed via the PlainTextCorpusReader.

PlainText files assume paragraphs are separated by a blank line.

The brown corpora is of type Tagged

Plain text

Tagged

Chunked

Parsed

Word Lists and Lexicons

Corpus Reader methods

nltk.corpora.gutenberg.words() - using default tokenizer

nltk.corpora.gutenberg.sents() - using default tokenizer

Languages

English, Spanish, Indian, Japanese languages.

Curriculum

Table of Contents