tml.corpus
Class SimpleCorpus

java.lang.Object
  extended by tml.corpus.SimpleCorpus

public class SimpleCorpus
extends java.lang.Object

SimpleCorpus is a simple corpus which contains a set of documents from a folder, it consider each document a vector. It automatically loads the documents and creates a weighted matrix. You can change the parameters for the term loading by accessing the internal corpus. See more details in Corpus.

Author:
Jorge Villalon
See Also:
Corpus

Constructor Summary
SimpleCorpus(java.lang.String pathToDocuments, java.lang.String pathToRepository)
           
SimpleCorpus(java.lang.String pathToDocuments, java.lang.String pathToRepository, boolean load)
           
 
Method Summary
 Corpus getCorpus()
           
 java.lang.String[] getDocuments()
           
 double[][] getMatrix()
           
 java.lang.String getPathToDocuments()
           
 java.lang.String getPathToRepository()
           
 java.lang.String[] getTerms()
           
 void load()
          Loads the corpus (if not loaded automatically).
 void loadTfIdfNormalised()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleCorpus

public SimpleCorpus(java.lang.String pathToDocuments,
                    java.lang.String pathToRepository)
             throws org.apache.lucene.index.CorruptIndexException,
                    org.apache.lucene.store.LockObtainFailedException,
                    java.io.IOException,
                    org.apache.lucene.queryParser.ParseException,
                    NotEnoughTermsInCorpusException,
                    NoDocumentsInCorpusException,
                    TermWeightingException,
                    java.sql.SQLException
Parameters:
pathToDocuments -
pathToRepository -
Throws:
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
org.apache.lucene.queryParser.ParseException
NoDocumentsInCorpusException
NotEnoughTermsInCorpusException
NormalizationException
TermWeightingException
java.sql.SQLException

SimpleCorpus

public SimpleCorpus(java.lang.String pathToDocuments,
                    java.lang.String pathToRepository,
                    boolean load)
             throws org.apache.lucene.index.CorruptIndexException,
                    org.apache.lucene.store.LockObtainFailedException,
                    java.io.IOException,
                    org.apache.lucene.queryParser.ParseException,
                    NotEnoughTermsInCorpusException,
                    NoDocumentsInCorpusException,
                    TermWeightingException,
                    java.sql.SQLException
Parameters:
pathToDocuments -
pathToRepository -
load -
Throws:
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
org.apache.lucene.queryParser.ParseException
NoDocumentsInCorpusException
NotEnoughTermsInCorpusException
NormalizationException
TermWeightingException
java.sql.SQLException
Method Detail

getCorpus

public Corpus getCorpus()
Returns:
the internal corpus

getDocuments

public java.lang.String[] getDocuments()
Returns:
the list of documents in the corpus

getMatrix

public double[][] getMatrix()
Returns:
a double array of Doubles with the weighted term/doc matrix

getPathToDocuments

public java.lang.String getPathToDocuments()
Returns:
the folder from where the documents where processed

getPathToRepository

public java.lang.String getPathToRepository()
Returns:
the folder where the Lucene index is stored

getTerms

public java.lang.String[] getTerms()
Returns:
the list of terms in the corpus

load

public void load()
          throws NotEnoughTermsInCorpusException,
                 java.io.IOException,
                 NoDocumentsInCorpusException,
                 TermWeightingException
Loads the corpus (if not loaded automatically).

Throws:
NotEnoughTermsInCorpusException
java.io.IOException
NoDocumentsInCorpusException
NormalizationException
TermWeightingException

loadTfIdfNormalised

public void loadTfIdfNormalised()
                         throws NotEnoughTermsInCorpusException,
                                java.io.IOException,
                                NoDocumentsInCorpusException,
                                TermWeightingException
Throws:
NotEnoughTermsInCorpusException
java.io.IOException
NoDocumentsInCorpusException
TermWeightingException
NormalizationException