Repository

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.storage
Class Repository

java.lang.Object
  tml.storage.Repository

public class Repository
extends java.lang.Object
extends java.lang.Object

This class represents a documents repository. Documents can be inserted, deleted and searched from a Repository. All documents that were successfully inserted in a repository can then later be used to create a Corpus and perform operations on them.

At the heart of a repository lies a TextDocument, that represents a text document and is accessible using any id of your choice (e.g. from a database, or from the filesystem). The content of a new documents is expected to be just plain text. Importers from different formats will be provided in time, for the moment we have only a Wiki cleaner.

All the documents, once inserted in the Repository can then be searched using the searchTextDocuments method. Queries are made using the syntax from Apache's Lucene.

Code examples

Initialising a Repository:

 Repository repository = new Repository("path/to/repository/folder");

Obtaining all the documents in a Repository

 ...
 List<TextDocument> documents = repository.getAllTextDocuments();
 for(TextDocument doc : documents) {
   System.out.println("Document:" + doc.getTitle());
 }
 ...

Inserting a document

 String content = "The content of my document";
 String title = "A title";
 String url = "http://www/mydoc.txt";
 String id = "TheIdOfMyDoc";
 repository.addDocument(id, content, title, url);

Obtaining a document from the repository

 String id = "TheIdOfMyDoc";
 TextDocument doc = repository.getTextDocument(id);

Removing a document from the repository

 TextDocument doc = repository.getTextDocument("someId");
 repository.deleteDocument(doc);

Searching for documents containing "foo"

 String query = "foo";
 List<TextDocument> documents = repository.searchTextDocuments(query);
 for (TextDocument doc : documents) {
        System.out.println("Document found:" + doc.getTitle());
 }

Author:: Jorge Villalon
See Also:: TextDocument, Corpus

Constructor Summary
`Repository()`
`Repository(java.lang.String luceneIndexPath)` Creates a new instance of the class `Repository` using a Standard Analyzer without stop words removal.
`Repository(java.lang.String luceneIndexPath, java.util.Locale locale)`

Method Summary
`void`	`addAnnotator(Annotator annotator)` Adds an annotator to the repository
`void`	`addDocument(java.lang.String externalId, java.lang.String content, java.lang.String title, java.lang.String url, Importer importer)` Adds a new document to the repository
`void`	`addDocumentsInFolder(java.lang.String folder)` Add all the files in a folder into the Lucene Index.
`void`	`addDocumentsInFolder(java.lang.String folder, int maxDocs)` Add all the files in a folder into the Lucene Index.
`void`	`addDocumentsInList(java.io.File[] fileList)` Adds all the files in the list to the repository.
`void`	`addRepositoryListener(RepositoryListener l)` This method allows to add a listener so the Repository can report asynchronously the state of the prcessing
`java.lang.Thread`	`annotateDocuments()`
`static java.lang.String`	`cleanIdForLucene(java.lang.String id)` Cleans an id (typically a file name) to suits the syntax of Lucene
`static void`	`cleanStorage(java.lang.String indexPath)` Deletes all the files of the `Repository`.
`java.lang.Thread`	`cleanup()`
`static java.lang.String`	`cleanWord(java.lang.String word)` This method is necessary due to problems on processing UTF-8 encoded text that comes from a paste from word.
`void`	`deleteTextDocument(TextDocument document)` Deletes a document from the repository.
`java.lang.String[][]`	`getAllDocuments()`
`java.util.List<TextDocument>`	`getAllTextDocuments()` Returns a list with all the documents in the repository in `TextDocument` form
`org.apache.lucene.analysis.Analyzer`	`getAnalyzer()` Gets the Lucene analyzer that the `Repository` is using
`java.lang.String`	`getAnnotations(org.apache.lucene.document.Document luceneDocument, java.lang.String documentId, java.lang.String fieldName)`
`java.util.List<Annotator>`	`getAnnotators()`
`DbConnection`	`getDbConnection()`
`java.lang.String`	`getDocumentField(java.lang.String externalId, java.lang.String fieldname)` Gets the content of a field for a document, using its external id.
`java.lang.String`	`getEncoding()`
`java.lang.String`	`getExecPath()`
`static java.lang.String`	`getFileContent(java.io.File file, java.lang.String charset)` Obtains the content of a text file.
`java.lang.String`	`getIndexPath()`
`org.apache.lucene.index.IndexReader`	`getIndexReader()` Obtains an IndexReader of the Lucene index
`org.apache.lucene.search.IndexSearcher`	`getIndexSearcher()` Obtains an IndexSearcher for the Lucene index
`java.util.Locale`	`getLocale()`
`java.lang.String`	`getLuceneContentField()` Gets the name of the field used by the underlying Lucene index for the content
`java.lang.String`	`getLuceneExternalIdField()` Gets the name of the field used by the underlying Lucene index for the external id
`java.lang.String`	`getLuceneParentDocumentField()`
`java.lang.String`	`getLuceneParentField()` Gets the name of the field used by the underlying Lucene index for the parent
`java.lang.String`	`getLucenePenntreeField()`
`java.lang.String`	`getLuceneTitleField()` Gets the name of the field used by the underlying Lucene index for the title
`java.lang.String`	`getLuceneTypeField()`
`java.lang.String`	`getLuceneUrlField()` Gets the name of the field used by the underlying Lucene index for the url
`int`	`getMaxDocumentsToIndex()`
`Importer`	`getParser()` Gets the `Importer` used to transform the content before inserting into the `Repository`
`java.lang.String`	`getProcessedPath()`
`java.lang.String[]`	`getStopwords()`
`java.lang.String`	`getSvdStoragePath()`
`TextDocument`	`getTextDocument(java.lang.String externalId)` Gets a document from the repository by its external id.
`java.lang.String`	`getTmpPath()`
`boolean`	`isBibliographyTitle(java.lang.String sentence)` Add reference
`void`	`removeAnnotator(Annotator annotator)` Removes an annotator to the repository
`void`	`removeRepositoryListener(RepositoryListener l)` Removes a listener that was previously added if exists
`void`	`setEncoding(java.lang.String encoding)` Sets the character encoding that will be used in this repository
`void`	`setExecPath(java.lang.String execPath)`
`void`	`setMaxDocumentsToIndex(int maxDocumentsToIndex)`

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Repository

public Repository()
           throws java.io.IOException,
                  java.sql.SQLException

Throws:: java.io.IOException; java.sql.SQLException

Repository

public Repository(java.lang.String luceneIndexPath)
           throws java.io.IOException,
                  java.sql.SQLException

Creates a new instance of the class Repository using a Standard Analyzer without stop words removal.

Parameters:: luceneIndexPath - an absolute path to the folder that stores the Lucene Index
Throws:: java.io.IOException; java.sql.SQLException

Repository

public Repository(java.lang.String luceneIndexPath,
                  java.util.Locale locale)
           throws java.io.IOException,
                  java.sql.SQLException

Parameters:: luceneIndexPath -; locale -
Throws:: java.io.IOException; java.sql.SQLException

Method Detail

cleanIdForLucene

public static java.lang.String cleanIdForLucene(java.lang.String id)

Cleans an id (typically a file name) to suits the syntax of Lucene

Parameters:: id - the external id of a document
Returns:: the id clean of special characters that Lucene uses

cleanStorage

public static void cleanStorage(java.lang.String indexPath)
                         throws org.apache.lucene.index.CorruptIndexException,
                                org.apache.lucene.store.LockObtainFailedException,
                                java.io.IOException,
                                java.sql.SQLException

Deletes all the files of the Repository.

Parameters:: indexPath - The path to the folder where the LuceneIndex files are stored
Throws:: java.io.IOException; org.apache.lucene.store.LockObtainFailedException; org.apache.lucene.index.CorruptIndexException; java.sql.SQLException

cleanWord

public static java.lang.String cleanWord(java.lang.String word)

This method is necessary due to problems on processing UTF-8 encoded text that comes from a paste from word. Usually quotations and double quotations come with weird characters that do not correspond to those of quotations. That makes it impossible to detect for the parsers.

Parameters:: word -
Returns:

getFileContent

public static java.lang.String getFileContent(java.io.File file,
                                              java.lang.String charset)
                                       throws java.io.UnsupportedEncodingException,
                                              java.io.FileNotFoundException,
                                              java.io.IOException

Obtains the content of a text file. Basically it uses readline and then writes only a \n for newlines so it removes any \r to make further process easier.

Parameters:: file - an absolute path to the file; charset - the charset used (default is UTF-8)
Returns:: the text content of the file
Throws:: java.io.UnsupportedEncodingException; java.io.FileNotFoundException; java.io.IOException

getTmpPath

public java.lang.String getTmpPath()

getProcessedPath

public java.lang.String getProcessedPath()

getExecPath

public java.lang.String getExecPath()

setExecPath

public void setExecPath(java.lang.String execPath)

getDbConnection

public DbConnection getDbConnection()

getLuceneParentDocumentField

public java.lang.String getLuceneParentDocumentField()

Returns:: the luceneParentDocumentField

getAllDocuments

public java.lang.String[][] getAllDocuments()

addAnnotator

public void addAnnotator(Annotator annotator)

Adds an annotator to the repository

Parameters:: annotator - the annotator

addRepositoryListener

public void addRepositoryListener(RepositoryListener l)

This method allows to add a listener so the Repository can report asynchronously the state of the prcessing

Parameters:: l - the listener to add

removeRepositoryListener

public void removeRepositoryListener(RepositoryListener l)

Removes a listener that was previously added if exists

Parameters:: l - the listener to remove

addDocument

public void addDocument(java.lang.String externalId,
                        java.lang.String content,
                        java.lang.String title,
                        java.lang.String url,
                        Importer importer)
                 throws java.io.IOException,
                        java.sql.SQLException

Adds a new document to the repository

Parameters:: externalId - an external id to identify the document; content - the content of the document; title - the title of the document; url - a url to find the document (optional); importer - an importer (how to decode the content)
Throws:: java.io.IOException; java.sql.SQLException

addDocumentsInFolder

public void addDocumentsInFolder(java.lang.String folder)
                          throws java.io.IOException

Add all the files in a folder into the Lucene Index. It can only process .txt files.

Parameters:: folder - an absolute path to the folder that contains the files
Throws:: java.io.IOException

addDocumentsInFolder

public void addDocumentsInFolder(java.lang.String folder,
                                 int maxDocs)
                          throws java.io.IOException

Add all the files in a folder into the Lucene Index. Up to a maximum. It can only process .txt files.

Parameters:: folder - an absolute path to the folder that contains the files; maxDocs - the maximum number of documents to index
Throws:: java.io.IOException

addDocumentsInList

public void addDocumentsInList(java.io.File[] fileList)
                        throws org.apache.lucene.index.CorruptIndexException,
                               java.io.IOException

Adds all the files in the list to the repository. It will filter by extension and only load files finishing with ".txt". It also ignores files starting with a dot ".".

Parameters:: fileList -
Throws:: org.apache.lucene.index.CorruptIndexException; java.io.IOException

annotateDocuments

public java.lang.Thread annotateDocuments()

deleteTextDocument

public void deleteTextDocument(TextDocument document)
                        throws java.io.IOException

Deletes a document from the repository. A TextDocument object must be used so the document must be first obtained from the repository.

Parameters:: document -
Throws:: java.io.IOException

getAllTextDocuments

public java.util.List<TextDocument> getAllTextDocuments()
                                                 throws java.lang.Exception

Returns a list with all the documents in the repository in TextDocument form

Returns:: a list of TextDocument
Throws:: java.lang.Exception

getAnalyzer

public org.apache.lucene.analysis.Analyzer getAnalyzer()

Gets the Lucene analyzer that the Repository is using

Returns:: the Analyzer

getAnnotators

public java.util.List<Annotator> getAnnotators()

Returns:: the annotators available for this repository

getDocumentField

public java.lang.String getDocumentField(java.lang.String externalId,
                                         java.lang.String fieldname)
                                  throws java.io.IOException

Gets the content of a field for a document, using its external id.

Parameters:: externalId - the id of the document; fieldname - the name of the field to retrieve
Returns:: the content of the field
Throws:: java.io.IOException

getEncoding

public java.lang.String getEncoding()

Returns:: the encoding used by TML

getIndexPath

public java.lang.String getIndexPath()

Returns:: the path to the Lucene index

getIndexReader

public org.apache.lucene.index.IndexReader getIndexReader()
                                                   throws java.io.IOException

Obtains an IndexReader of the Lucene index

Returns:: the IndexReader
Throws:: java.io.IOException

getIndexSearcher

public org.apache.lucene.search.IndexSearcher getIndexSearcher()
                                                        throws java.io.IOException

Obtains an IndexSearcher for the Lucene index

Returns:: the IndexSearcher
Throws:: java.io.IOException

getLocale

public java.util.Locale getLocale()

Returns:: the Locale being used by TML

getLuceneContentField

public java.lang.String getLuceneContentField()

Gets the name of the field used by the underlying Lucene index for the content

Returns:: the name of the content field

getLuceneExternalIdField

public java.lang.String getLuceneExternalIdField()

Gets the name of the field used by the underlying Lucene index for the external id

Returns:: the name of the external id field

getLuceneParentField

public java.lang.String getLuceneParentField()

Gets the name of the field used by the underlying Lucene index for the parent

Returns:: the name of the parent field

getLucenePenntreeField

public java.lang.String getLucenePenntreeField()

Returns:: the name of the field used to store the PennTree bank string

getLuceneTitleField

public java.lang.String getLuceneTitleField()

Gets the name of the field used by the underlying Lucene index for the title

Returns:: the name of the title field

getLuceneTypeField

public java.lang.String getLuceneTypeField()

Returns:: the name of the field that stores the type of the Lucene document (document, paragraph or sentence)

getLuceneUrlField

public java.lang.String getLuceneUrlField()

Gets the name of the field used by the underlying Lucene index for the url

Returns:: the name of the url field

getMaxDocumentsToIndex

public int getMaxDocumentsToIndex()

Returns:: the maxDocumentsToIndex

getParser

public Importer getParser()

Gets the Importer used to transform the content before inserting into the Repository

Returns:: the Importer being used by TML

getStopwords

public java.lang.String[] getStopwords()

Returns:: the list of stopwords used to analyse and parse documents

getSvdStoragePath

public java.lang.String getSvdStoragePath()

Returns:: the svdStoragePath

getTextDocument

public TextDocument getTextDocument(java.lang.String externalId)
                             throws java.io.IOException

Gets a document from the repository by its external id. Returns a TextDocument object with basic information about the document, like title and url. In order to perform operations on the documents, it must be loaded, which means that a Corpus and its inner SemanticSpace will be created.

Parameters:: externalId - the id of the document
Returns:: a TextDocument
Throws:: java.io.IOException

isBibliographyTitle

public boolean isBibliographyTitle(java.lang.String sentence)

Add reference

Parameters:: sentence - the sentence to evaluate
Returns:: if the sentence corresponds to the title of the references section

removeAnnotator

public void removeAnnotator(Annotator annotator)

Removes an annotator to the repository

Parameters:: annotator - the annotator

setEncoding

public void setEncoding(java.lang.String encoding)

Sets the character encoding that will be used in this repository

Parameters:: encoding -

setMaxDocumentsToIndex

public void setMaxDocumentsToIndex(int maxDocumentsToIndex)

Parameters:: maxDocumentsToIndex - the maxDocumentsToIndex to set

getAnnotations

public java.lang.String getAnnotations(org.apache.lucene.document.Document luceneDocument,
                                       java.lang.String documentId,
                                       java.lang.String fieldName)

cleanup

public java.lang.Thread cleanup()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

tml.storage Class Repository

Repository

Repository

Repository

cleanIdForLucene

cleanStorage

cleanWord

getFileContent

getTmpPath

getProcessedPath

getExecPath

setExecPath

getDbConnection

getLuceneParentDocumentField

getAllDocuments

addAnnotator

addRepositoryListener

removeRepositoryListener

addDocument

addDocumentsInFolder

addDocumentsInFolder

addDocumentsInList

annotateDocuments

deleteTextDocument

getAllTextDocuments

getAnalyzer

getAnnotators

getDocumentField

getEncoding

getIndexPath

getIndexReader

getIndexSearcher

getLocale

getLuceneContentField

getLuceneExternalIdField

getLuceneParentField

getLucenePenntreeField

getLuceneTitleField

getLuceneTypeField

getLuceneUrlField

getMaxDocumentsToIndex

getParser

getStopwords

getSvdStoragePath

getTextDocument

isBibliographyTitle

removeAnnotator

setEncoding

setMaxDocumentsToIndex

getAnnotations

cleanup

tml.storage
Class Repository