|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objecttml.corpus.Corpus
public abstract class Corpus
A Corpus is a set of TextPassages
that are processed to build a SemanticSpace.
Steps of this process are:
Once the Corpus is loaded, it can create a SemanticSpace
using a particular dimensionality reduction technique. For the moment only
SVD is implemented, but we expect to implement some others.
The following code show how to load a Corpus and create a
SemanticSpace:
...
corpus.setName("Structure of English"); // A human readable name for the corpus
corpus.setTermSelectionCriteria(TermSelection.MIN_DF); // Every term must have a minimum document frequency
corpus.setTermSelectionThreshold(1); // Terms must appear in at least 2 documents
corpus.load(storage); // Load the corpus from the storage
corpus.createSemanticSpace(); // Create an empty semanticSpace
SemanticSpace space = corpus.getSemanticSpace();
space.setTermWeightScheme(TermWeight.TF); // The term weight scheme will be the raw term frequency
space.setNormalized(true); // The final vectors will be normalized
space.setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_NUMBER);
space.setDimensionalityReductionThreshold(2); // Number of dimensions to keep on the dimensionality reduction
space.setDimensionsReduced(true); // The dimensions will be reduced
space.calculate(); // Calculate the semantic space
...
| Nested Class Summary | |
|---|---|
class |
Corpus.PassageFreqs
|
| Constructor Summary | |
|---|---|
Corpus()
Constructor for every Corpus. |
|
| Method Summary | |
|---|---|
int |
getDimensions()
|
Stats[] |
getDocStats()
|
java.lang.String |
getFilename()
|
int |
getIndexOfTerm(java.lang.String term)
Retrieves the index of the term in the corpus |
java.lang.String |
getLuceneQuery()
Returns the string representing the Lucene query used to create the Corpus |
java.lang.String |
getName()
|
int |
getNonzeros()
|
CorpusParameters |
getParameters()
|
Corpus.PassageFreqs[] |
getPassageFrequencies()
|
java.lang.String[] |
getPassages()
|
int[] |
getPassagesLuceneIds()
|
long |
getProcessingTime()
|
Repository |
getRepository()
|
SemanticSpace |
getSemanticSpace()
|
Jama.Matrix |
getTermDocMatrix()
|
double[] |
getTermEntropies()
|
java.lang.String[] |
getTerms()
|
Stats[] |
getTermStats()
|
boolean |
isDbAnnotations()
|
boolean |
isProjection()
|
void |
load(Repository repository)
Loads the content of the documents in the query and creates the term-doc matrix |
java.lang.String |
parametersSummary()
Prints in the console the parameters used in this corpus |
java.lang.String |
printFrequencies()
|
Corpus |
projectCorpus(Corpus corpusToProject)
This method projects a Corpus into another one. |
void |
setDbAnnotations(boolean dbAnnotations)
|
void |
setDocStats(Stats[] docStats)
|
void |
setName(java.lang.String name)
|
void |
setParameters(CorpusParameters parameters)
|
void |
setProjection(boolean projection)
|
void |
setTermEntropies(double[] termEntropies)
|
void |
setTermStats(Stats[] termStats)
|
java.lang.String |
toString()
Returns the name of the Corpus. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public Corpus()
Corpus.
document - the TextDocument to which the belongs| Method Detail |
|---|
public boolean isDbAnnotations()
public void setDbAnnotations(boolean dbAnnotations)
public int[] getPassagesLuceneIds()
public boolean isProjection()
public int getIndexOfTerm(java.lang.String term)
term -
public java.lang.String getFilename()
public double[] getTermEntropies()
public void setTermEntropies(double[] termEntropies)
termEntropies - the termEntropies to setpublic Stats[] getTermStats()
public void setTermStats(Stats[] termStats)
termStats - the termStats to setpublic Stats[] getDocStats()
public void setDocStats(Stats[] docStats)
docStats - the docStats to setpublic void setProjection(boolean projection)
projection - the projection to setpublic int getNonzeros()
public java.lang.String getLuceneQuery()
Corpus
Corpuspublic java.lang.String getName()
Corpuspublic CorpusParameters getParameters()
public Corpus.PassageFreqs[] getPassageFrequencies()
public java.lang.String[] getPassages()
public long getProcessingTime()
Corpuspublic Repository getRepository()
public SemanticSpace getSemanticSpace()
SemanticSpace for the Corpuspublic Jama.Matrix getTermDocMatrix()
Corpuspublic java.lang.String[] getTerms()
public void load(Repository repository)
throws NotEnoughTermsInCorpusException,
java.io.IOException,
NoDocumentsInCorpusException,
TermWeightingException
storage - the repository to search
java.io.IOException
NotEnoughTermsInCorpusException
NoDocumentsInCorpusException
TermWeightingExceptionpublic int getDimensions()
public java.lang.String parametersSummary()
public java.lang.String printFrequencies()
public Corpus projectCorpus(Corpus corpusToProject)
throws java.lang.Exception
Corpus into another one. The Corpus
to project is the parameter, and the projected Corpus is what the
method returns.
The returned Corpus will have the same Dictionary than
this Corpus, and will use the same parameters to calculate its
SemanticSpace.
corpusToProject - the Corpus to project
Corpus
java.lang.Exceptionpublic void setName(java.lang.String name)
name - the name for the Corpuspublic void setParameters(CorpusParameters parameters)
parameters - the parameters to setpublic java.lang.String toString()
Corpus.
toString in class java.lang.Object
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||