tml.corpus
Class SimpleCorpus
java.lang.Object
tml.corpus.SimpleCorpus
public class SimpleCorpus
- extends java.lang.Object
SimpleCorpus is a simple corpus which contains a set of documents from a
folder, it consider each document a vector. It automatically loads the
documents and creates a weighted matrix.
You can change the parameters for the term loading by accessing the internal
corpus. See more details in Corpus
.
- Author:
- Jorge Villalon
- See Also:
Corpus
Constructor Summary |
SimpleCorpus(java.lang.String pathToDocuments,
java.lang.String pathToRepository)
|
SimpleCorpus(java.lang.String pathToDocuments,
java.lang.String pathToRepository,
boolean load)
|
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SimpleCorpus
public SimpleCorpus(java.lang.String pathToDocuments,
java.lang.String pathToRepository)
throws org.apache.lucene.index.CorruptIndexException,
org.apache.lucene.store.LockObtainFailedException,
java.io.IOException,
org.apache.lucene.queryParser.ParseException,
NotEnoughTermsInCorpusException,
NoDocumentsInCorpusException,
TermWeightingException,
java.sql.SQLException
- Parameters:
pathToDocuments
- pathToRepository
-
- Throws:
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
org.apache.lucene.queryParser.ParseException
NoDocumentsInCorpusException
NotEnoughTermsInCorpusException
NormalizationException
TermWeightingException
java.sql.SQLException
SimpleCorpus
public SimpleCorpus(java.lang.String pathToDocuments,
java.lang.String pathToRepository,
boolean load)
throws org.apache.lucene.index.CorruptIndexException,
org.apache.lucene.store.LockObtainFailedException,
java.io.IOException,
org.apache.lucene.queryParser.ParseException,
NotEnoughTermsInCorpusException,
NoDocumentsInCorpusException,
TermWeightingException,
java.sql.SQLException
- Parameters:
pathToDocuments
- pathToRepository
- load
-
- Throws:
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
org.apache.lucene.queryParser.ParseException
NoDocumentsInCorpusException
NotEnoughTermsInCorpusException
NormalizationException
TermWeightingException
java.sql.SQLException
getCorpus
public Corpus getCorpus()
- Returns:
- the internal corpus
getDocuments
public java.lang.String[] getDocuments()
- Returns:
- the list of documents in the corpus
getMatrix
public double[][] getMatrix()
- Returns:
- a double array of Doubles with the weighted term/doc matrix
getPathToDocuments
public java.lang.String getPathToDocuments()
- Returns:
- the folder from where the documents where processed
getPathToRepository
public java.lang.String getPathToRepository()
- Returns:
- the folder where the Lucene index is stored
getTerms
public java.lang.String[] getTerms()
- Returns:
- the list of terms in the corpus
load
public void load()
throws NotEnoughTermsInCorpusException,
java.io.IOException,
NoDocumentsInCorpusException,
TermWeightingException
- Loads the corpus (if not loaded automatically).
- Throws:
NotEnoughTermsInCorpusException
java.io.IOException
NoDocumentsInCorpusException
NormalizationException
TermWeightingException
loadTfIdfNormalised
public void loadTfIdfNormalised()
throws NotEnoughTermsInCorpusException,
java.io.IOException,
NoDocumentsInCorpusException,
TermWeightingException
- Throws:
NotEnoughTermsInCorpusException
java.io.IOException
NoDocumentsInCorpusException
TermWeightingException
NormalizationException