tml.corpus
Class TextDocument

java.lang.Object
  extended by tml.corpus.TextDocument

public class TextDocument
extends java.lang.Object

The TextDocument class represents a whole document, which comprises a content, a title and a url. Each document is identified by an id, known as the externalId. It also has an internal id, the Lucene Id, which identifies the document within the underlying Lucene index.

A TextDocument contains two corpora, a sentence based Corpus and a paragraph based Corpus. The TextDocument is responsible for loading both and assigning the necessary parameters for their creation. This means that the construction of the Corpus and the SemanticSpace are defined on a per document basis.

The TextDocument contains a duplicate of its content, this can cause scalability problems with long documents (more than 2000 terms, aprox. 10000 words)

The most basic way to use a TextDocument is to perform operations to its corpora. Operations can be calculating semantic distances between sentences or extracting the most important paragraphs (based on variance) to give some examples.

The following example shows how to obtain a TextDocument from a Repository and then how to extract the key sentences.

 Repository repository = new Repository("path/to/repository");
 TextDocument document = repository.getTextDocument("foo");
 if (document != null) {
        System.out.println("Document " + document.getTitle() + " found");
 }
 

Now we are going to set the parameters to load the document's corpora and load them.

 document.setTermSelection(TermSelection.MIN_DF);
 document.setTermSelectionThreshold(1);
 document.setTermLocalWeight(LocalWeight.TF);
 document.setTermGlobalWeight(GlobalWeight.Idf);
 document
                .setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_PERCENTAGE);
 document.setDimensionalityReductionThreshold(50);
 document.setDimensionsReduced(true);
 document.setNormalized(true);
 document.load(repository);
 

Finally we can perform an operation and show the results.

 KeyTextPassages operation = new KeyTextPassages();
 operation.setCorpus(document.getSentenceCorpus());
 operation.start();
 
 for (KeyTextPassagesResult result : operation.getResults()) {
        System.out.println("Sentence id: " + result.getTextPassageId()
                        + " from eigenvector:" + result.getEigenVectorIndex()
                        + " with load:" + result.getLoad() + " content:"
                        + result.getTextPassageContent());
 }
 

Author:
Jorge Villalon
See Also:
AbstractOperation Corpus

Constructor Summary
TextDocument(int luceneId, java.lang.String title, java.lang.String url, java.lang.String externalId, java.lang.String content)
          Constructor of TextDocument.
 
Method Summary
 java.lang.String getContent()
          Gets the content of the document
 java.lang.String getExternalId()
          Gets the external id used when the document was inserted.
 int getLuceneId()
          Gets the Lucene internal id of the document
 ParagraphCorpus getParagraphCorpus()
          Gets the ParagraphCorpus created with the paragraphs of the TextDocument
 CorpusParameters getParameters()
           
 SentenceCorpus getSentenceCorpus()
          Gets the SentenceCorpus created with the sentences of the TextDocument
 java.lang.String getTitle()
          Gets the title of the document
 java.lang.String getUrl()
          Gets the url of the document
 void load(Repository repository)
          Loads the corpora for the TextDocument with all the parameters that the document has set.
 void setParameters(CorpusParameters parameters)
           
 java.lang.String toString()
          The default view of a TextDocument is its title
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TextDocument

public TextDocument(int luceneId,
                    java.lang.String title,
                    java.lang.String url,
                    java.lang.String externalId,
                    java.lang.String content)
Constructor of TextDocument. It creates a new instance of a TextDocument. It should be used only by the Repository.

Parameters:
luceneId - the id within the Lucene index
title - the title of the document
url - the url of the document
externalId - the external id
content - the content of the document
Method Detail

getContent

public java.lang.String getContent()
Gets the content of the document

Returns:
the content

getExternalId

public java.lang.String getExternalId()
Gets the external id used when the document was inserted.

Returns:
the external id

getLuceneId

public int getLuceneId()
Gets the Lucene internal id of the document

Returns:
the Lucene id

getParagraphCorpus

public ParagraphCorpus getParagraphCorpus()
Gets the ParagraphCorpus created with the paragraphs of the TextDocument

Returns:
A ParagraphCorpus object or null.

getSentenceCorpus

public SentenceCorpus getSentenceCorpus()
Gets the SentenceCorpus created with the sentences of the TextDocument

Returns:
A SentenceCorpus object or null.

getTitle

public java.lang.String getTitle()
Gets the title of the document

Returns:
the title

getUrl

public java.lang.String getUrl()
Gets the url of the document

Returns:
the url

load

public void load(Repository repository)
          throws java.lang.Exception,
                 java.io.IOException,
                 org.apache.lucene.queryParser.ParseException,
                 NotEnoughTermsInCorpusException,
                 NoDocumentsInCorpusException,
                 TermWeightingException
Loads the corpora for the TextDocument with all the parameters that the document has set. To load the term frequency vectors, a pointer to the repository is necessary.

Parameters:
repository -
Throws:
java.lang.Exception
java.io.IOException
org.apache.lucene.queryParser.ParseException
NotEnoughTermsInCorpusException
NoDocumentsInCorpusException
TermWeightingException
NormalizationException

getParameters

public CorpusParameters getParameters()
Returns:
the parameters

setParameters

public void setParameters(CorpusParameters parameters)
Parameters:
parameters - the parameters to set

toString

public java.lang.String toString()
The default view of a TextDocument is its title

Overrides:
toString in class java.lang.Object