tml.storage
Class Repository

java.lang.Object
  extended by tml.storage.Repository

public class Repository
extends java.lang.Object

This class represents a documents repository. Documents can be inserted, deleted and searched from a Repository. All documents that were successfully inserted in a repository can then later be used to create a Corpus and perform operations on them.

At the heart of a repository lies a TextDocument, that represents a text document and is accessible using any id of your choice (e.g. from a database, or from the filesystem). The content of a new documents is expected to be just plain text. Importers from different formats will be provided in time, for the moment we have only a Wiki cleaner.

All the documents, once inserted in the Repository can then be searched using the searchTextDocuments method. Queries are made using the syntax from Apache's Lucene.

Code examples

Initialising a Repository:

 Repository repository = new Repository("path/to/repository/folder");
 

Obtaining all the documents in a Repository

 ...
 List<TextDocument> documents = repository.getAllTextDocuments();
 for(TextDocument doc : documents) {
   System.out.println("Document:" + doc.getTitle());
 }
 ...
 

Inserting a document

 String content = "The content of my document";
 String title = "A title";
 String url = "http://www/mydoc.txt";
 String id = "TheIdOfMyDoc";
 repository.addDocument(id, content, title, url);
 

Obtaining a document from the repository

 String id = "TheIdOfMyDoc";
 TextDocument doc = repository.getTextDocument(id);
 

Removing a document from the repository

 TextDocument doc = repository.getTextDocument("someId");
 repository.deleteDocument(doc);
 

Searching for documents containing "foo"

 String query = "foo";
 List<TextDocument> documents = repository.searchTextDocuments(query);
 for (TextDocument doc : documents) {
        System.out.println("Document found:" + doc.getTitle());
 }
 

Author:
Jorge Villalon
See Also:
TextDocument, Corpus

Constructor Summary
Repository()
           
Repository(java.lang.String luceneIndexPath)
          Creates a new instance of the class Repository using a Standard Analyzer without stop words removal.
Repository(java.lang.String luceneIndexPath, java.util.Locale locale)
           
 
Method Summary
 void addAnnotator(Annotator annotator)
          Adds an annotator to the repository
 void addDocument(java.lang.String externalId, java.lang.String content, java.lang.String title, java.lang.String url, Importer importer)
          Adds a new document to the repository
 void addDocumentsInFolder(java.lang.String folder)
          Add all the files in a folder into the Lucene Index.
 void addDocumentsInFolder(java.lang.String folder, int maxDocs)
          Add all the files in a folder into the Lucene Index.
 void addDocumentsInList(java.io.File[] fileList)
          Adds all the files in the list to the repository.
 void addRepositoryListener(RepositoryListener l)
          This method allows to add a listener so the Repository can report asynchronously the state of the prcessing
 java.lang.Thread annotateDocuments()
           
static java.lang.String cleanIdForLucene(java.lang.String id)
          Cleans an id (typically a file name) to suits the syntax of Lucene
static void cleanStorage(java.lang.String indexPath)
          Deletes all the files of the Repository.
 java.lang.Thread cleanup()
           
static java.lang.String cleanWord(java.lang.String word)
          This method is necessary due to problems on processing UTF-8 encoded text that comes from a paste from word.
 void deleteTextDocument(TextDocument document)
          Deletes a document from the repository.
 java.lang.String[][] getAllDocuments()
           
 java.util.List<TextDocument> getAllTextDocuments()
          Returns a list with all the documents in the repository in TextDocument form
 org.apache.lucene.analysis.Analyzer getAnalyzer()
          Gets the Lucene analyzer that the Repository is using
 java.lang.String getAnnotations(org.apache.lucene.document.Document luceneDocument, java.lang.String documentId, java.lang.String fieldName)
           
 java.util.List<Annotator> getAnnotators()
           
 DbConnection getDbConnection()
           
 java.lang.String getDocumentField(java.lang.String externalId, java.lang.String fieldname)
          Gets the content of a field for a document, using its external id.
 java.lang.String getEncoding()
           
 java.lang.String getExecPath()
           
static java.lang.String getFileContent(java.io.File file, java.lang.String charset)
          Obtains the content of a text file.
 java.lang.String getIndexPath()
           
 org.apache.lucene.index.IndexReader getIndexReader()
          Obtains an IndexReader of the Lucene index
 org.apache.lucene.search.IndexSearcher getIndexSearcher()
          Obtains an IndexSearcher for the Lucene index
 java.util.Locale getLocale()
           
 java.lang.String getLuceneContentField()
          Gets the name of the field used by the underlying Lucene index for the content
 java.lang.String getLuceneExternalIdField()
          Gets the name of the field used by the underlying Lucene index for the external id
 java.lang.String getLuceneParentDocumentField()
           
 java.lang.String getLuceneParentField()
          Gets the name of the field used by the underlying Lucene index for the parent
 java.lang.String getLucenePenntreeField()
           
 java.lang.String getLuceneTitleField()
          Gets the name of the field used by the underlying Lucene index for the title
 java.lang.String getLuceneTypeField()
           
 java.lang.String getLuceneUrlField()
          Gets the name of the field used by the underlying Lucene index for the url
 int getMaxDocumentsToIndex()
           
 Importer getParser()
          Gets the Importer used to transform the content before inserting into the Repository
 java.lang.String getProcessedPath()
           
 java.lang.String[] getStopwords()
           
 java.lang.String getSvdStoragePath()
           
 TextDocument getTextDocument(java.lang.String externalId)
          Gets a document from the repository by its external id.
 java.lang.String getTmpPath()
           
 boolean isBibliographyTitle(java.lang.String sentence)
          Add reference
 void removeAnnotator(Annotator annotator)
          Removes an annotator to the repository
 void removeRepositoryListener(RepositoryListener l)
          Removes a listener that was previously added if exists
 void setEncoding(java.lang.String encoding)
          Sets the character encoding that will be used in this repository
 void setExecPath(java.lang.String execPath)
           
 void setMaxDocumentsToIndex(int maxDocumentsToIndex)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Repository

public Repository()
           throws java.io.IOException,
                  java.sql.SQLException
Throws:
java.io.IOException
java.sql.SQLException

Repository

public Repository(java.lang.String luceneIndexPath)
           throws java.io.IOException,
                  java.sql.SQLException
Creates a new instance of the class Repository using a Standard Analyzer without stop words removal.

Parameters:
luceneIndexPath - an absolute path to the folder that stores the Lucene Index
Throws:
java.io.IOException
java.sql.SQLException

Repository

public Repository(java.lang.String luceneIndexPath,
                  java.util.Locale locale)
           throws java.io.IOException,
                  java.sql.SQLException
Parameters:
luceneIndexPath -
locale -
Throws:
java.io.IOException
java.sql.SQLException
Method Detail

cleanIdForLucene

public static java.lang.String cleanIdForLucene(java.lang.String id)
Cleans an id (typically a file name) to suits the syntax of Lucene

Parameters:
id - the external id of a document
Returns:
the id clean of special characters that Lucene uses

cleanStorage

public static void cleanStorage(java.lang.String indexPath)
                         throws org.apache.lucene.index.CorruptIndexException,
                                org.apache.lucene.store.LockObtainFailedException,
                                java.io.IOException,
                                java.sql.SQLException
Deletes all the files of the Repository.

Parameters:
indexPath - The path to the folder where the LuceneIndex files are stored
Throws:
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
java.sql.SQLException

cleanWord

public static java.lang.String cleanWord(java.lang.String word)
This method is necessary due to problems on processing UTF-8 encoded text that comes from a paste from word. Usually quotations and double quotations come with weird characters that do not correspond to those of quotations. That makes it impossible to detect for the parsers.

Parameters:
word -
Returns:

getFileContent

public static java.lang.String getFileContent(java.io.File file,
                                              java.lang.String charset)
                                       throws java.io.UnsupportedEncodingException,
                                              java.io.FileNotFoundException,
                                              java.io.IOException
Obtains the content of a text file. Basically it uses readline and then writes only a \n for newlines so it removes any \r to make further process easier.

Parameters:
file - an absolute path to the file
charset - the charset used (default is UTF-8)
Returns:
the text content of the file
Throws:
java.io.UnsupportedEncodingException
java.io.FileNotFoundException
java.io.IOException

getTmpPath

public java.lang.String getTmpPath()

getProcessedPath

public java.lang.String getProcessedPath()

getExecPath

public java.lang.String getExecPath()

setExecPath

public void setExecPath(java.lang.String execPath)

getDbConnection

public DbConnection getDbConnection()

getLuceneParentDocumentField

public java.lang.String getLuceneParentDocumentField()
Returns:
the luceneParentDocumentField

getAllDocuments

public java.lang.String[][] getAllDocuments()

addAnnotator

public void addAnnotator(Annotator annotator)
Adds an annotator to the repository

Parameters:
annotator - the annotator

addRepositoryListener

public void addRepositoryListener(RepositoryListener l)
This method allows to add a listener so the Repository can report asynchronously the state of the prcessing

Parameters:
l - the listener to add

removeRepositoryListener

public void removeRepositoryListener(RepositoryListener l)
Removes a listener that was previously added if exists

Parameters:
l - the listener to remove

addDocument

public void addDocument(java.lang.String externalId,
                        java.lang.String content,
                        java.lang.String title,
                        java.lang.String url,
                        Importer importer)
                 throws java.io.IOException,
                        java.sql.SQLException
Adds a new document to the repository

Parameters:
externalId - an external id to identify the document
content - the content of the document
title - the title of the document
url - a url to find the document (optional)
importer - an importer (how to decode the content)
Throws:
java.io.IOException
java.sql.SQLException

addDocumentsInFolder

public void addDocumentsInFolder(java.lang.String folder)
                          throws java.io.IOException
Add all the files in a folder into the Lucene Index. It can only process .txt files.

Parameters:
folder - an absolute path to the folder that contains the files
Throws:
java.io.IOException

addDocumentsInFolder

public void addDocumentsInFolder(java.lang.String folder,
                                 int maxDocs)
                          throws java.io.IOException
Add all the files in a folder into the Lucene Index. Up to a maximum. It can only process .txt files.

Parameters:
folder - an absolute path to the folder that contains the files
maxDocs - the maximum number of documents to index
Throws:
java.io.IOException

addDocumentsInList

public void addDocumentsInList(java.io.File[] fileList)
                        throws org.apache.lucene.index.CorruptIndexException,
                               java.io.IOException
Adds all the files in the list to the repository. It will filter by extension and only load files finishing with ".txt". It also ignores files starting with a dot ".".

Parameters:
fileList -
Throws:
org.apache.lucene.index.CorruptIndexException
java.io.IOException

annotateDocuments

public java.lang.Thread annotateDocuments()

deleteTextDocument

public void deleteTextDocument(TextDocument document)
                        throws java.io.IOException
Deletes a document from the repository. A TextDocument object must be used so the document must be first obtained from the repository.

Parameters:
document -
Throws:
java.io.IOException

getAllTextDocuments

public java.util.List<TextDocument> getAllTextDocuments()
                                                 throws java.lang.Exception
Returns a list with all the documents in the repository in TextDocument form

Returns:
a list of TextDocument
Throws:
java.lang.Exception

getAnalyzer

public org.apache.lucene.analysis.Analyzer getAnalyzer()
Gets the Lucene analyzer that the Repository is using

Returns:
the Analyzer

getAnnotators

public java.util.List<Annotator> getAnnotators()
Returns:
the annotators available for this repository

getDocumentField

public java.lang.String getDocumentField(java.lang.String externalId,
                                         java.lang.String fieldname)
                                  throws java.io.IOException
Gets the content of a field for a document, using its external id.

Parameters:
externalId - the id of the document
fieldname - the name of the field to retrieve
Returns:
the content of the field
Throws:
java.io.IOException

getEncoding

public java.lang.String getEncoding()
Returns:
the encoding used by TML

getIndexPath

public java.lang.String getIndexPath()
Returns:
the path to the Lucene index

getIndexReader

public org.apache.lucene.index.IndexReader getIndexReader()
                                                   throws java.io.IOException
Obtains an IndexReader of the Lucene index

Returns:
the IndexReader
Throws:
java.io.IOException

getIndexSearcher

public org.apache.lucene.search.IndexSearcher getIndexSearcher()
                                                        throws java.io.IOException
Obtains an IndexSearcher for the Lucene index

Returns:
the IndexSearcher
Throws:
java.io.IOException

getLocale

public java.util.Locale getLocale()
Returns:
the Locale being used by TML

getLuceneContentField

public java.lang.String getLuceneContentField()
Gets the name of the field used by the underlying Lucene index for the content

Returns:
the name of the content field

getLuceneExternalIdField

public java.lang.String getLuceneExternalIdField()
Gets the name of the field used by the underlying Lucene index for the external id

Returns:
the name of the external id field

getLuceneParentField

public java.lang.String getLuceneParentField()
Gets the name of the field used by the underlying Lucene index for the parent

Returns:
the name of the parent field

getLucenePenntreeField

public java.lang.String getLucenePenntreeField()
Returns:
the name of the field used to store the PennTree bank string

getLuceneTitleField

public java.lang.String getLuceneTitleField()
Gets the name of the field used by the underlying Lucene index for the title

Returns:
the name of the title field

getLuceneTypeField

public java.lang.String getLuceneTypeField()
Returns:
the name of the field that stores the type of the Lucene document (document, paragraph or sentence)

getLuceneUrlField

public java.lang.String getLuceneUrlField()
Gets the name of the field used by the underlying Lucene index for the url

Returns:
the name of the url field

getMaxDocumentsToIndex

public int getMaxDocumentsToIndex()
Returns:
the maxDocumentsToIndex

getParser

public Importer getParser()
Gets the Importer used to transform the content before inserting into the Repository

Returns:
the Importer being used by TML

getStopwords

public java.lang.String[] getStopwords()
Returns:
the list of stopwords used to analyse and parse documents

getSvdStoragePath

public java.lang.String getSvdStoragePath()
Returns:
the svdStoragePath

getTextDocument

public TextDocument getTextDocument(java.lang.String externalId)
                             throws java.io.IOException
Gets a document from the repository by its external id. Returns a TextDocument object with basic information about the document, like title and url. In order to perform operations on the documents, it must be loaded, which means that a Corpus and its inner SemanticSpace will be created.

Parameters:
externalId - the id of the document
Returns:
a TextDocument
Throws:
java.io.IOException

isBibliographyTitle

public boolean isBibliographyTitle(java.lang.String sentence)
Add reference

Parameters:
sentence - the sentence to evaluate
Returns:
if the sentence corresponds to the title of the references section

removeAnnotator

public void removeAnnotator(Annotator annotator)
Removes an annotator to the repository

Parameters:
annotator - the annotator

setEncoding

public void setEncoding(java.lang.String encoding)
Sets the character encoding that will be used in this repository

Parameters:
encoding -

setMaxDocumentsToIndex

public void setMaxDocumentsToIndex(int maxDocumentsToIndex)
Parameters:
maxDocumentsToIndex - the maxDocumentsToIndex to set

getAnnotations

public java.lang.String getAnnotations(org.apache.lucene.document.Document luceneDocument,
                                       java.lang.String documentId,
                                       java.lang.String fieldName)

cleanup

public java.lang.Thread cleanup()