Package tml.corpus

Implements all the classes required for corpora management as Bags of Words, it also includes NLP for sentences.

See:
          Description

Class Summary
Corpus A Corpus is a set of TextPassages that are processed to build a SemanticSpace.
CorpusParameters Class that encapsulates all the parameters required to create a Corpus and its corresponding SemanticSpace.
Dictionary This class represents a group of Terms or words/symbols, usually obtained from a set of documents or text passages.
ParagraphCorpus Corpus that represents the paragraphs of a TextDocument
RepositoryCorpus This class represents a corpus with all the documents in the repository
SearchResultsCorpus This class represents a general corpus where any search criteria can be used
SentenceCorpus Class representing a corpus formed with the sentences of a document
SimpleCorpus SimpleCorpus is a simple corpus which contains a set of documents from a folder, it consider each document a vector.
Term The Term class represents a unique word within a Corpus.
TextDocument The TextDocument class represents a whole document, which comprises a content, a title and a url.
TextPassage This class represents a text passage, that is part of a Corpus.
 

Enum Summary
CorpusParameters.DimensionalityReduction Criteria by which a SemanticSpace will reduce (or not) the dimensions of the space.
CorpusParameters.TermSelection The criteria to select the terms that will be kept in the corpus
 

Package tml.corpus Description

Implements all the classes required for corpora management as Bags of Words, it also includes NLP for sentences.

Package Specification

This package implements the bag of words approach for documents at three levels: Document, paragraph and sentences. As grammatical information is available at the sentence level, it also includes the PennTree bank tree parse of each sentence.