Table of Contents
NLTK
Natural Language Toolkit, a Python package for NLP.
Corpus
nltk comes with several dozen corpora.
Data Structure
nltk.download() - is a cli that lets you list and download selected corpora. On my machine the corpora were downloaded to ~/nltk_data/corpora by default.
nltk.download('gutenberg') - downloaded a file gutenberg.zip and then unzipped it into a new folder ~/nltk_data/corpora/gutenberg. The file contained multiple .txt files, one for each document.
nltk.corpora.gutenberg.abspaths() - returns a list of full pathnames of the .txt files
nltk.corpora.gutenberg.fileids() - returns a list of filenames of the .txt files
Types
There are several types of corpora. “Type” refers to the format and structure of the data.
Each type has an associated Reader to give access to the corpora.
The gutenberg corpora is of type PlainText and is accessed via the PlainTextCorpusReader.
PlainText files assume paragraphs are separated by a blank line.
The brown corpora is of type Tagged
Plain text
Tagged
Chunked
Parsed
Word Lists and Lexicons
Corpus Reader methods
nltk.corpora.gutenberg.words() - using default tokenizer
nltk.corpora.gutenberg.sents() - using default tokenizer
Languages
English, Spanish, Indian, Japanese languages.