|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objecttml.corpus.Corpus
public abstract class Corpus
A Corpus
is a set of TextPassage
s
that are processed to build a SemanticSpace
.
Steps of this process are:
Once the Corpus
is loaded, it can create a SemanticSpace
using a particular dimensionality reduction technique. For the moment only
SVD is implemented, but we expect to implement some others.
The following code show how to load a Corpus
and create a
SemanticSpace
:
... corpus.setName("Structure of English"); // A human readable name for the corpus corpus.setTermSelectionCriteria(TermSelection.MIN_DF); // Every term must have a minimum document frequency corpus.setTermSelectionThreshold(1); // Terms must appear in at least 2 documents corpus.load(storage); // Load the corpus from the storage corpus.createSemanticSpace(); // Create an empty semanticSpace SemanticSpace space = corpus.getSemanticSpace(); space.setTermWeightScheme(TermWeight.TF); // The term weight scheme will be the raw term frequency space.setNormalized(true); // The final vectors will be normalized space.setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_NUMBER); space.setDimensionalityReductionThreshold(2); // Number of dimensions to keep on the dimensionality reduction space.setDimensionsReduced(true); // The dimensions will be reduced space.calculate(); // Calculate the semantic space ...
Nested Class Summary | |
---|---|
class |
Corpus.PassageFreqs
|
Constructor Summary | |
---|---|
Corpus()
Constructor for every Corpus . |
Method Summary | |
---|---|
int |
getDimensions()
|
Stats[] |
getDocStats()
|
java.lang.String |
getFilename()
|
int |
getIndexOfTerm(java.lang.String term)
Retrieves the index of the term in the corpus |
java.lang.String |
getLuceneQuery()
Returns the string representing the Lucene query used to create the Corpus |
java.lang.String |
getName()
|
int |
getNonzeros()
|
CorpusParameters |
getParameters()
|
Corpus.PassageFreqs[] |
getPassageFrequencies()
|
java.lang.String[] |
getPassages()
|
int[] |
getPassagesLuceneIds()
|
long |
getProcessingTime()
|
Repository |
getRepository()
|
SemanticSpace |
getSemanticSpace()
|
Jama.Matrix |
getTermDocMatrix()
|
double[] |
getTermEntropies()
|
java.lang.String[] |
getTerms()
|
Stats[] |
getTermStats()
|
boolean |
isDbAnnotations()
|
boolean |
isProjection()
|
void |
load(Repository repository)
Loads the content of the documents in the query and creates the term-doc matrix |
java.lang.String |
parametersSummary()
Prints in the console the parameters used in this corpus |
java.lang.String |
printFrequencies()
|
Corpus |
projectCorpus(Corpus corpusToProject)
This method projects a Corpus into another one. |
void |
setDbAnnotations(boolean dbAnnotations)
|
void |
setDocStats(Stats[] docStats)
|
void |
setName(java.lang.String name)
|
void |
setParameters(CorpusParameters parameters)
|
void |
setProjection(boolean projection)
|
void |
setTermEntropies(double[] termEntropies)
|
void |
setTermStats(Stats[] termStats)
|
java.lang.String |
toString()
Returns the name of the Corpus . |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public Corpus()
Corpus
.
document
- the TextDocument
to which the belongs
Method Detail |
---|
public boolean isDbAnnotations()
public void setDbAnnotations(boolean dbAnnotations)
public int[] getPassagesLuceneIds()
public boolean isProjection()
public int getIndexOfTerm(java.lang.String term)
term
-
public java.lang.String getFilename()
public double[] getTermEntropies()
public void setTermEntropies(double[] termEntropies)
termEntropies
- the termEntropies to setpublic Stats[] getTermStats()
public void setTermStats(Stats[] termStats)
termStats
- the termStats to setpublic Stats[] getDocStats()
public void setDocStats(Stats[] docStats)
docStats
- the docStats to setpublic void setProjection(boolean projection)
projection
- the projection to setpublic int getNonzeros()
public java.lang.String getLuceneQuery()
Corpus
Corpus
public java.lang.String getName()
Corpus
public CorpusParameters getParameters()
public Corpus.PassageFreqs[] getPassageFrequencies()
public java.lang.String[] getPassages()
public long getProcessingTime()
Corpus
public Repository getRepository()
public SemanticSpace getSemanticSpace()
SemanticSpace
for the Corpus
public Jama.Matrix getTermDocMatrix()
Corpus
public java.lang.String[] getTerms()
public void load(Repository repository) throws NotEnoughTermsInCorpusException, java.io.IOException, NoDocumentsInCorpusException, TermWeightingException
storage
- the repository to search
java.io.IOException
NotEnoughTermsInCorpusException
NoDocumentsInCorpusException
TermWeightingException
public int getDimensions()
public java.lang.String parametersSummary()
public java.lang.String printFrequencies()
public Corpus projectCorpus(Corpus corpusToProject) throws java.lang.Exception
Corpus
into another one. The Corpus
to project is the parameter, and the projected Corpus
is what the
method returns.
The returned Corpus
will have the same Dictionary
than
this Corpus
, and will use the same parameters to calculate its
SemanticSpace
.
corpusToProject
- the Corpus
to project
Corpus
java.lang.Exception
public void setName(java.lang.String name)
name
- the name for the Corpus
public void setParameters(CorpusParameters parameters)
parameters
- the parameters to setpublic java.lang.String toString()
Corpus
.
toString
in class java.lang.Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |