|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objecttml.corpus.TextDocument
public class TextDocument
The TextDocument class represents a whole document, which comprises a content, a title and a url. Each document is identified by an id, known as the externalId. It also has an internal id, the Lucene Id, which identifies the document within the underlying Lucene index.
A TextDocument contains two corpora, a sentence based Corpus
and a
paragraph based Corpus
. The TextDocument is responsible for loading
both and assigning the necessary parameters for their creation. This means
that the construction of the Corpus
and the SemanticSpace
are
defined on a per document basis.
The TextDocument contains a duplicate of its content, this can cause scalability problems with long documents (more than 2000 terms, aprox. 10000 words)
The most basic way to use a TextDocument is to perform operations to its corpora. Operations can be calculating semantic distances between sentences or extracting the most important paragraphs (based on variance) to give some examples.
The following example shows how to obtain a TextDocument
from a
Repository
and then how to extract the key sentences.
Repository repository = new Repository("path/to/repository"); TextDocument document = repository.getTextDocument("foo"); if (document != null) { System.out.println("Document " + document.getTitle() + " found"); }
Now we are going to set the parameters to load the document's corpora and load them.
document.setTermSelection(TermSelection.MIN_DF); document.setTermSelectionThreshold(1); document.setTermLocalWeight(LocalWeight.TF); document.setTermGlobalWeight(GlobalWeight.Idf); document .setDimensionalityReduction(DimensionalityReduction.DIMENSIONS_MAX_PERCENTAGE); document.setDimensionalityReductionThreshold(50); document.setDimensionsReduced(true); document.setNormalized(true); document.load(repository);
Finally we can perform an operation and show the results.
KeyTextPassages operation = new KeyTextPassages(); operation.setCorpus(document.getSentenceCorpus()); operation.start(); for (KeyTextPassagesResult result : operation.getResults()) { System.out.println("Sentence id: " + result.getTextPassageId() + " from eigenvector:" + result.getEigenVectorIndex() + " with load:" + result.getLoad() + " content:" + result.getTextPassageContent()); }
AbstractOperation Corpus
Constructor Summary | |
---|---|
TextDocument(int luceneId,
java.lang.String title,
java.lang.String url,
java.lang.String externalId,
java.lang.String content)
Constructor of TextDocument . |
Method Summary | |
---|---|
java.lang.String |
getContent()
Gets the content of the document |
java.lang.String |
getExternalId()
Gets the external id used when the document was inserted. |
int |
getLuceneId()
Gets the Lucene internal id of the document |
ParagraphCorpus |
getParagraphCorpus()
Gets the ParagraphCorpus created with the paragraphs of the
TextDocument |
CorpusParameters |
getParameters()
|
SentenceCorpus |
getSentenceCorpus()
Gets the SentenceCorpus created with the sentences of the
TextDocument |
java.lang.String |
getTitle()
Gets the title of the document |
java.lang.String |
getUrl()
Gets the url of the document |
void |
load(Repository repository)
Loads the corpora for the TextDocument with all the parameters
that the document has set. |
void |
setParameters(CorpusParameters parameters)
|
java.lang.String |
toString()
The default view of a TextDocument is its title |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public TextDocument(int luceneId, java.lang.String title, java.lang.String url, java.lang.String externalId, java.lang.String content)
TextDocument
. It creates a new instance of a
TextDocument. It should be used only by the Repository
.
luceneId
- the id within the Lucene indextitle
- the title of the documenturl
- the url of the documentexternalId
- the external idcontent
- the content of the documentMethod Detail |
---|
public java.lang.String getContent()
public java.lang.String getExternalId()
public int getLuceneId()
public ParagraphCorpus getParagraphCorpus()
ParagraphCorpus
created with the paragraphs of the
TextDocument
ParagraphCorpus
object or null.public SentenceCorpus getSentenceCorpus()
SentenceCorpus
created with the sentences of the
TextDocument
SentenceCorpus
object or null.public java.lang.String getTitle()
public java.lang.String getUrl()
public void load(Repository repository) throws java.lang.Exception, java.io.IOException, org.apache.lucene.queryParser.ParseException, NotEnoughTermsInCorpusException, NoDocumentsInCorpusException, TermWeightingException
TextDocument
with all the parameters
that the document has set. To load the term frequency vectors, a pointer
to the repository is necessary.
repository
-
java.lang.Exception
java.io.IOException
org.apache.lucene.queryParser.ParseException
NotEnoughTermsInCorpusException
NoDocumentsInCorpusException
TermWeightingException
NormalizationException
public CorpusParameters getParameters()
public void setParameters(CorpusParameters parameters)
parameters
- the parameters to setpublic java.lang.String toString()
toString
in class java.lang.Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |