tml.corpus
Class Corpus

java.lang.Object
  extended by tml.corpus.Corpus
All Implemented Interfaces:
java.lang.Cloneable
Direct Known Subclasses:
ParagraphCorpus, RepositoryCorpus, SearchResultsCorpus, SentenceCorpus

public abstract class Corpus
extends java.lang.Object
implements java.lang.Cloneable

A Corpus is a set of TextPassages that are processed to build a SemanticSpace.

Steps of this process are:

Once the Corpus is loaded, it can create a SemanticSpace using a particular dimensionality reduction technique. For the moment only SVD is implemented, but we expect to implement some others.

The following code show how to load a Corpus and create a SemanticSpace:

        ...
        corpus.setName("Structure of English"); // A human readable name for the corpus
        corpus.setTermSelectionCriteria(TermSelection.MIN_DF); // Every term must have a minimum document frequency
        corpus.setTermSelectionThreshold(1); // Terms must appear in at least 2 documents
        corpus.load(storage); // Load the corpus from the storage
        corpus.createSemanticSpace(); // Create an empty semanticSpace

        SemanticSpace space = corpus.getSemanticSpace();
        space.setTermWeightScheme(TermWeight.TF); // The term weight scheme will be the raw term frequency
        space.setNormalized(true); // The final vectors will be normalized
        space.setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_NUMBER);
        space.setDimensionalityReductionThreshold(2); // Number of dimensions to keep on the dimensionality reduction
        space.setDimensionsReduced(true); // The dimensions will be reduced
        space.calculate(); // Calculate the semantic space
        ...
 

Author:
Jorge Villalon

Nested Class Summary
 class Corpus.PassageFreqs
           
 
Constructor Summary
Corpus()
          Constructor for every Corpus.
 
Method Summary
 int getDimensions()
           
 Stats[] getDocStats()
           
 java.lang.String getFilename()
           
 int getIndexOfTerm(java.lang.String term)
          Retrieves the index of the term in the corpus
 java.lang.String getLuceneQuery()
          Returns the string representing the Lucene query used to create the Corpus
 java.lang.String getName()
           
 int getNonzeros()
           
 CorpusParameters getParameters()
           
 Corpus.PassageFreqs[] getPassageFrequencies()
           
 java.lang.String[] getPassages()
           
 int[] getPassagesLuceneIds()
           
 long getProcessingTime()
           
 Repository getRepository()
           
 SemanticSpace getSemanticSpace()
           
 Jama.Matrix getTermDocMatrix()
           
 double[] getTermEntropies()
           
 java.lang.String[] getTerms()
           
 Stats[] getTermStats()
           
 boolean isDbAnnotations()
           
 boolean isProjection()
           
 void load(Repository repository)
          Loads the content of the documents in the query and creates the term-doc matrix
 java.lang.String parametersSummary()
          Prints in the console the parameters used in this corpus
 java.lang.String printFrequencies()
           
 Corpus projectCorpus(Corpus corpusToProject)
          This method projects a Corpus into another one.
 void setDbAnnotations(boolean dbAnnotations)
           
 void setDocStats(Stats[] docStats)
           
 void setName(java.lang.String name)
           
 void setParameters(CorpusParameters parameters)
           
 void setProjection(boolean projection)
           
 void setTermEntropies(double[] termEntropies)
           
 void setTermStats(Stats[] termStats)
           
 java.lang.String toString()
          Returns the name of the Corpus.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Corpus

public Corpus()
Constructor for every Corpus.

Parameters:
document - the TextDocument to which the belongs
Method Detail

isDbAnnotations

public boolean isDbAnnotations()

setDbAnnotations

public void setDbAnnotations(boolean dbAnnotations)

getPassagesLuceneIds

public int[] getPassagesLuceneIds()
Returns:
the passagesLuceneIds

isProjection

public boolean isProjection()
Returns:
the projection

getIndexOfTerm

public int getIndexOfTerm(java.lang.String term)
Retrieves the index of the term in the corpus

Parameters:
term -
Returns:
the term index or -1 if not found

getFilename

public java.lang.String getFilename()

getTermEntropies

public double[] getTermEntropies()
Returns:
the termEntropies

setTermEntropies

public void setTermEntropies(double[] termEntropies)
Parameters:
termEntropies - the termEntropies to set

getTermStats

public Stats[] getTermStats()
Returns:
the termStats

setTermStats

public void setTermStats(Stats[] termStats)
Parameters:
termStats - the termStats to set

getDocStats

public Stats[] getDocStats()
Returns:
the docStats

setDocStats

public void setDocStats(Stats[] docStats)
Parameters:
docStats - the docStats to set

setProjection

public void setProjection(boolean projection)
Parameters:
projection - the projection to set

getNonzeros

public int getNonzeros()
Returns:
the nonzeros

getLuceneQuery

public java.lang.String getLuceneQuery()
Returns the string representing the Lucene query used to create the Corpus

Returns:
the query used to create the Corpus

getName

public java.lang.String getName()
Returns:
the name of the Corpus

getParameters

public CorpusParameters getParameters()
Returns:
the parameters

getPassageFrequencies

public Corpus.PassageFreqs[] getPassageFrequencies()
Returns:
the passageFrequencies

getPassages

public java.lang.String[] getPassages()
Returns:
the passages

getProcessingTime

public long getProcessingTime()
Returns:
the time it took to load the Corpus

getRepository

public Repository getRepository()
Returns:
the repository

getSemanticSpace

public SemanticSpace getSemanticSpace()
Returns:
the SemanticSpace for the Corpus

getTermDocMatrix

public Jama.Matrix getTermDocMatrix()
Returns:
the raw matrix with the term frequencies for the Corpus

getTerms

public java.lang.String[] getTerms()
Returns:
the terms

load

public void load(Repository repository)
          throws NotEnoughTermsInCorpusException,
                 java.io.IOException,
                 NoDocumentsInCorpusException,
                 TermWeightingException
Loads the content of the documents in the query and creates the term-doc matrix

Parameters:
storage - the repository to search
Throws:
java.io.IOException
NotEnoughTermsInCorpusException
NoDocumentsInCorpusException
TermWeightingException

getDimensions

public int getDimensions()
Returns:
the dimensions

parametersSummary

public java.lang.String parametersSummary()
Prints in the console the parameters used in this corpus


printFrequencies

public java.lang.String printFrequencies()

projectCorpus

public Corpus projectCorpus(Corpus corpusToProject)
                     throws java.lang.Exception
This method projects a Corpus into another one. The Corpus to project is the parameter, and the projected Corpus is what the method returns. The returned Corpus will have the same Dictionary than this Corpus, and will use the same parameters to calculate its SemanticSpace.

Parameters:
corpusToProject - the Corpus to project
Returns:
the projected Corpus
Throws:
java.lang.Exception

setName

public void setName(java.lang.String name)
Parameters:
name - the name for the Corpus

setParameters

public void setParameters(CorpusParameters parameters)
Parameters:
parameters - the parameters to set

toString

public java.lang.String toString()
Returns the name of the Corpus.

Overrides:
toString in class java.lang.Object