TextDocument

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.corpus
Class TextDocument

java.lang.Object
  tml.corpus.TextDocument

public class TextDocument
extends java.lang.Object
extends java.lang.Object

The TextDocument class represents a whole document, which comprises a content, a title and a url. Each document is identified by an id, known as the externalId. It also has an internal id, the Lucene Id, which identifies the document within the underlying Lucene index.

A TextDocument contains two corpora, a sentence based Corpus and a paragraph based Corpus. The TextDocument is responsible for loading both and assigning the necessary parameters for their creation. This means that the construction of the Corpus and the SemanticSpace are defined on a per document basis.

The TextDocument contains a duplicate of its content, this can cause scalability problems with long documents (more than 2000 terms, aprox. 10000 words)

The most basic way to use a TextDocument is to perform operations to its corpora. Operations can be calculating semantic distances between sentences or extracting the most important paragraphs (based on variance) to give some examples.

The following example shows how to obtain a TextDocument from a Repository and then how to extract the key sentences.

 Repository repository = new Repository("path/to/repository");
 TextDocument document = repository.getTextDocument("foo");
 if (document != null) {
        System.out.println("Document " + document.getTitle() + " found");
 }

Now we are going to set the parameters to load the document's corpora and load them.

 document.setTermSelection(TermSelection.MIN_DF);
 document.setTermSelectionThreshold(1);
 document.setTermLocalWeight(LocalWeight.TF);
 document.setTermGlobalWeight(GlobalWeight.Idf);
 document
                .setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_PERCENTAGE);
 document.setDimensionalityReductionThreshold(50);
 document.setDimensionsReduced(true);
 document.setNormalized(true);
 document.load(repository);

Finally we can perform an operation and show the results.

 KeyTextPassages operation = new KeyTextPassages();
 operation.setCorpus(document.getSentenceCorpus());
 operation.start();
 
 for (KeyTextPassagesResult result : operation.getResults()) {
        System.out.println("Sentence id: " + result.getTextPassageId()
                        + " from eigenvector:" + result.getEigenVectorIndex()
                        + " with load:" + result.getLoad() + " content:"
                        + result.getTextPassageContent());
 }

Author:: Jorge Villalon
See Also:: AbstractOperation Corpus

Constructor Summary
`TextDocument(int luceneId, java.lang.String title, java.lang.String url, java.lang.String externalId, java.lang.String content)` Constructor of `TextDocument`.

Method Summary
`java.lang.String`	`getContent()` Gets the content of the document
`java.lang.String`	`getExternalId()` Gets the external id used when the document was inserted.
`int`	`getLuceneId()` Gets the Lucene internal id of the document
`ParagraphCorpus`	`getParagraphCorpus()` Gets the `ParagraphCorpus` created with the paragraphs of the `TextDocument`
`CorpusParameters`	`getParameters()`
`SentenceCorpus`	`getSentenceCorpus()` Gets the `SentenceCorpus` created with the sentences of the `TextDocument`
`java.lang.String`	`getTitle()` Gets the title of the document
`java.lang.String`	`getUrl()` Gets the url of the document
`void`	`load(Repository repository)` Loads the corpora for the `TextDocument` with all the parameters that the document has set.
`void`	`setParameters(CorpusParameters parameters)`
`java.lang.String`	`toString()` The default view of a TextDocument is its title

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

TextDocument

public TextDocument(int luceneId,
                    java.lang.String title,
                    java.lang.String url,
                    java.lang.String externalId,
                    java.lang.String content)

Constructor of TextDocument. It creates a new instance of a TextDocument. It should be used only by the Repository.

Parameters:: luceneId - the id within the Lucene index; title - the title of the document; url - the url of the document; externalId - the external id; content - the content of the document

Method Detail

getContent

public java.lang.String getContent()

Gets the content of the document

Returns:: the content

getExternalId

public java.lang.String getExternalId()

Gets the external id used when the document was inserted.

Returns:: the external id

getLuceneId

public int getLuceneId()

Gets the Lucene internal id of the document

Returns:: the Lucene id

getParagraphCorpus

public ParagraphCorpus getParagraphCorpus()

Gets the ParagraphCorpus created with the paragraphs of the TextDocument

Returns:: A ParagraphCorpus object or null.

getSentenceCorpus

public SentenceCorpus getSentenceCorpus()

Gets the SentenceCorpus created with the sentences of the TextDocument

Returns:: A SentenceCorpus object or null.

getTitle

public java.lang.String getTitle()

Gets the title of the document

Returns:: the title

getUrl

public java.lang.String getUrl()

Gets the url of the document

Returns:: the url

load

public void load(Repository repository)
          throws java.lang.Exception,
                 java.io.IOException,
                 org.apache.lucene.queryParser.ParseException,
                 NotEnoughTermsInCorpusException,
                 NoDocumentsInCorpusException,
                 TermWeightingException

Loads the corpora for the TextDocument with all the parameters that the document has set. To load the term frequency vectors, a pointer to the repository is necessary.

Parameters:: repository -
Throws:: java.lang.Exception; java.io.IOException; org.apache.lucene.queryParser.ParseException; NotEnoughTermsInCorpusException; NoDocumentsInCorpusException; TermWeightingException; NormalizationException

getParameters

public CorpusParameters getParameters()

Returns:: the parameters

setParameters

public void setParameters(CorpusParameters parameters)

Parameters:: parameters - the parameters to set

toString

public java.lang.String toString()

The default view of a TextDocument is its title

Overrides:: toString in class java.lang.Object

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.corpus Class TextDocument

TextDocument

getContent

getExternalId

getLuceneId

getParagraphCorpus

getSentenceCorpus

getTitle

getUrl

load

getParameters

setParameters

toString

tml.corpus
Class TextDocument