Corpus

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.corpus
Class Corpus

java.lang.Object
  tml.corpus.Corpus

All Implemented Interfaces:: java.lang.Cloneable

Direct Known Subclasses:: ParagraphCorpus, RepositoryCorpus, SearchResultsCorpus, SentenceCorpus

public abstract class Corpus
extends java.lang.Object
implements java.lang.Cloneable
extends java.lang.Object
implements java.lang.Cloneable

A Corpus is a set of TextPassages that are processed to build a SemanticSpace.

Steps of this process are:

Tokenizing the document, i.e. recognizing terms, URLs, etc.
Removing stopwords, like prepositions
Stemming
Term selection

Once the Corpus is loaded, it can create a SemanticSpace using a particular dimensionality reduction technique. For the moment only SVD is implemented, but we expect to implement some others.

The following code show how to load a Corpus and create a SemanticSpace:

        ...
        corpus.setName("Structure of English"); // A human readable name for the corpus
        corpus.setTermSelectionCriteria(TermSelection.MIN_DF); // Every term must have a minimum document frequency
        corpus.setTermSelectionThreshold(1); // Terms must appear in at least 2 documents
        corpus.load(storage); // Load the corpus from the storage
        corpus.createSemanticSpace(); // Create an empty semanticSpace

        SemanticSpace space = corpus.getSemanticSpace();
        space.setTermWeightScheme(TermWeight.TF); // The term weight scheme will be the raw term frequency
        space.setNormalized(true); // The final vectors will be normalized
        space.setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_NUMBER);
        space.setDimensionalityReductionThreshold(2); // Number of dimensions to keep on the dimensionality reduction
        space.setDimensionsReduced(true); // The dimensions will be reduced
        space.calculate(); // Calculate the semantic space
        ...

Author:: Jorge Villalon

Nested Class Summary
`class`	`Corpus.PassageFreqs`

Constructor Summary
`Corpus()` Constructor for every `Corpus`.

Method Summary
`int`	`getDimensions()`
`Stats[]`	`getDocStats()`
`java.lang.String`	`getFilename()`
`int`	`getIndexOfTerm(java.lang.String term)` Retrieves the index of the term in the corpus
`java.lang.String`	`getLuceneQuery()` Returns the string representing the Lucene query used to create the `Corpus`
`java.lang.String`	`getName()`
`int`	`getNonzeros()`
`CorpusParameters`	`getParameters()`
`Corpus.PassageFreqs[]`	`getPassageFrequencies()`
`java.lang.String[]`	`getPassages()`
`int[]`	`getPassagesLuceneIds()`
`long`	`getProcessingTime()`
`Repository`	`getRepository()`
`SemanticSpace`	`getSemanticSpace()`
`Jama.Matrix`	`getTermDocMatrix()`
`double[]`	`getTermEntropies()`
`java.lang.String[]`	`getTerms()`
`Stats[]`	`getTermStats()`
`boolean`	`isDbAnnotations()`
`boolean`	`isProjection()`
`void`	`load(Repository repository)` Loads the content of the documents in the query and creates the term-doc matrix
`java.lang.String`	`parametersSummary()` Prints in the console the parameters used in this corpus
`java.lang.String`	`printFrequencies()`
`Corpus`	`projectCorpus(Corpus corpusToProject)` This method projects a `Corpus` into another one.
`void`	`setDbAnnotations(boolean dbAnnotations)`
`void`	`setDocStats(Stats[] docStats)`
`void`	`setName(java.lang.String name)`
`void`	`setParameters(CorpusParameters parameters)`
`void`	`setProjection(boolean projection)`
`void`	`setTermEntropies(double[] termEntropies)`
`void`	`setTermStats(Stats[] termStats)`
`java.lang.String`	`toString()` Returns the name of the `Corpus`.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

Corpus

public Corpus()

Constructor for every Corpus.

Parameters:: document - the TextDocument to which the belongs

Method Detail

isDbAnnotations

public boolean isDbAnnotations()

setDbAnnotations

public void setDbAnnotations(boolean dbAnnotations)

getPassagesLuceneIds

public int[] getPassagesLuceneIds()

Returns:: the passagesLuceneIds

isProjection

public boolean isProjection()

Returns:: the projection

getIndexOfTerm

public int getIndexOfTerm(java.lang.String term)

Retrieves the index of the term in the corpus

Parameters:: term -
Returns:: the term index or -1 if not found

getFilename

public java.lang.String getFilename()

getTermEntropies

public double[] getTermEntropies()

Returns:: the termEntropies

setTermEntropies

public void setTermEntropies(double[] termEntropies)

Parameters:: termEntropies - the termEntropies to set

getTermStats

public Stats[] getTermStats()

Returns:: the termStats

setTermStats

public void setTermStats(Stats[] termStats)

Parameters:: termStats - the termStats to set

getDocStats

public Stats[] getDocStats()

Returns:: the docStats

setDocStats

public void setDocStats(Stats[] docStats)

Parameters:: docStats - the docStats to set

setProjection

public void setProjection(boolean projection)

Parameters:: projection - the projection to set

getNonzeros

public int getNonzeros()

Returns:: the nonzeros

getLuceneQuery

public java.lang.String getLuceneQuery()

Returns the string representing the Lucene query used to create the Corpus

Returns:: the query used to create the Corpus

getName

public java.lang.String getName()

Returns:: the name of the Corpus

getParameters

public CorpusParameters getParameters()

Returns:: the parameters

getPassageFrequencies

public Corpus.PassageFreqs[] getPassageFrequencies()

Returns:: the passageFrequencies

getPassages

public java.lang.String[] getPassages()

Returns:: the passages

getProcessingTime

public long getProcessingTime()

Returns:: the time it took to load the Corpus

getRepository

public Repository getRepository()

Returns:: the repository

getSemanticSpace

public SemanticSpace getSemanticSpace()

Returns:: the SemanticSpace for the Corpus

getTermDocMatrix

public Jama.Matrix getTermDocMatrix()

Returns:: the raw matrix with the term frequencies for the Corpus

getTerms

public java.lang.String[] getTerms()

Returns:: the terms

load

public void load(Repository repository)
          throws NotEnoughTermsInCorpusException,
                 java.io.IOException,
                 NoDocumentsInCorpusException,
                 TermWeightingException

Loads the content of the documents in the query and creates the term-doc matrix

Parameters:: storage - the repository to search
Throws:: java.io.IOException; NotEnoughTermsInCorpusException; NoDocumentsInCorpusException; TermWeightingException

getDimensions

public int getDimensions()

Returns:: the dimensions

parametersSummary

public java.lang.String parametersSummary()

Prints in the console the parameters used in this corpus

printFrequencies

public java.lang.String printFrequencies()

projectCorpus

public Corpus projectCorpus(Corpus corpusToProject)
                     throws java.lang.Exception

This method projects a Corpus into another one. The Corpus to project is the parameter, and the projected Corpus is what the method returns. The returned Corpus will have the same Dictionary than this Corpus, and will use the same parameters to calculate its SemanticSpace.

Parameters:: corpusToProject - the Corpus to project
Returns:: the projected Corpus
Throws:: java.lang.Exception

setName

public void setName(java.lang.String name)

Parameters:: name - the name for the Corpus

setParameters

public void setParameters(CorpusParameters parameters)

Parameters:: parameters - the parameters to set

toString

public java.lang.String toString()

Returns the name of the Corpus.

Overrides:: toString in class java.lang.Object

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.corpus Class Corpus

Corpus

isDbAnnotations

setDbAnnotations

getPassagesLuceneIds

isProjection

getIndexOfTerm

getFilename

getTermEntropies

setTermEntropies

getTermStats

setTermStats

getDocStats

setDocStats

setProjection

getNonzeros

getLuceneQuery

getName

getParameters

getPassageFrequencies

getPassages

getProcessingTime

getRepository

getSemanticSpace

getTermDocMatrix

getTerms

load

getDimensions

parametersSummary

printFrequencies

projectCorpus

setName

setParameters

toString

tml.corpus
Class Corpus