|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object tml.storage.Repository
public class Repository
This class represents a documents repository. Documents can be inserted,
deleted and searched from a Repository. All documents that were successfully
inserted in a repository can then later be used to create a Corpus
and perform operations on them.
At the heart of a repository lies a TextDocument
, that represents a
text document and is accessible using any id of your choice (e.g. from a
database, or from the filesystem). The content of a new documents is expected
to be just plain text. Importers from different formats will be provided in
time, for the moment we have only a Wiki cleaner.
All the documents, once inserted in the Repository can then be searched using the searchTextDocuments method. Queries are made using the syntax from Apache's Lucene.
Code examples
Initialising a Repository
:
Repository repository = new Repository("path/to/repository/folder");
Obtaining all the documents in a Repository
... List<TextDocument> documents = repository.getAllTextDocuments(); for(TextDocument doc : documents) { System.out.println("Document:" + doc.getTitle()); } ...
Inserting a document
String content = "The content of my document"; String title = "A title"; String url = "http://www/mydoc.txt"; String id = "TheIdOfMyDoc"; repository.addDocument(id, content, title, url);
Obtaining a document from the repository
String id = "TheIdOfMyDoc"; TextDocument doc = repository.getTextDocument(id);
Removing a document from the repository
TextDocument doc = repository.getTextDocument("someId"); repository.deleteDocument(doc);
Searching for documents containing "foo"
String query = "foo"; List<TextDocument> documents = repository.searchTextDocuments(query); for (TextDocument doc : documents) { System.out.println("Document found:" + doc.getTitle()); }
TextDocument
,
Corpus
Constructor Summary | |
---|---|
Repository()
|
|
Repository(java.lang.String luceneIndexPath)
Creates a new instance of the class Repository using a Standard
Analyzer without stop words removal. |
|
Repository(java.lang.String luceneIndexPath,
java.util.Locale locale)
|
Method Summary | |
---|---|
void |
addAnnotator(Annotator annotator)
Adds an annotator to the repository |
void |
addDocument(java.lang.String externalId,
java.lang.String content,
java.lang.String title,
java.lang.String url,
Importer importer)
Adds a new document to the repository |
void |
addDocumentsInFolder(java.lang.String folder)
Add all the files in a folder into the Lucene Index. |
void |
addDocumentsInFolder(java.lang.String folder,
int maxDocs)
Add all the files in a folder into the Lucene Index. |
void |
addDocumentsInList(java.io.File[] fileList)
Adds all the files in the list to the repository. |
void |
addRepositoryListener(RepositoryListener l)
This method allows to add a listener so the Repository can report asynchronously the state of the prcessing |
java.lang.Thread |
annotateDocuments()
|
static java.lang.String |
cleanIdForLucene(java.lang.String id)
Cleans an id (typically a file name) to suits the syntax of Lucene |
static void |
cleanStorage(java.lang.String indexPath)
Deletes all the files of the Repository . |
java.lang.Thread |
cleanup()
|
static java.lang.String |
cleanWord(java.lang.String word)
This method is necessary due to problems on processing UTF-8 encoded text that comes from a paste from word. |
void |
deleteTextDocument(TextDocument document)
Deletes a document from the repository. |
java.lang.String[][] |
getAllDocuments()
|
java.util.List<TextDocument> |
getAllTextDocuments()
Returns a list with all the documents in the repository in TextDocument form |
org.apache.lucene.analysis.Analyzer |
getAnalyzer()
Gets the Lucene analyzer that the Repository is using |
java.lang.String |
getAnnotations(org.apache.lucene.document.Document luceneDocument,
java.lang.String documentId,
java.lang.String fieldName)
|
java.util.List<Annotator> |
getAnnotators()
|
DbConnection |
getDbConnection()
|
java.lang.String |
getDocumentField(java.lang.String externalId,
java.lang.String fieldname)
Gets the content of a field for a document, using its external id. |
java.lang.String |
getEncoding()
|
java.lang.String |
getExecPath()
|
static java.lang.String |
getFileContent(java.io.File file,
java.lang.String charset)
Obtains the content of a text file. |
java.lang.String |
getIndexPath()
|
org.apache.lucene.index.IndexReader |
getIndexReader()
Obtains an IndexReader of the Lucene index |
org.apache.lucene.search.IndexSearcher |
getIndexSearcher()
Obtains an IndexSearcher for the Lucene index |
java.util.Locale |
getLocale()
|
java.lang.String |
getLuceneContentField()
Gets the name of the field used by the underlying Lucene index for the content |
java.lang.String |
getLuceneExternalIdField()
Gets the name of the field used by the underlying Lucene index for the external id |
java.lang.String |
getLuceneParentDocumentField()
|
java.lang.String |
getLuceneParentField()
Gets the name of the field used by the underlying Lucene index for the parent |
java.lang.String |
getLucenePenntreeField()
|
java.lang.String |
getLuceneTitleField()
Gets the name of the field used by the underlying Lucene index for the title |
java.lang.String |
getLuceneTypeField()
|
java.lang.String |
getLuceneUrlField()
Gets the name of the field used by the underlying Lucene index for the url |
int |
getMaxDocumentsToIndex()
|
Importer |
getParser()
Gets the Importer used to transform the content before inserting
into the Repository |
java.lang.String |
getProcessedPath()
|
java.lang.String[] |
getStopwords()
|
java.lang.String |
getSvdStoragePath()
|
TextDocument |
getTextDocument(java.lang.String externalId)
Gets a document from the repository by its external id. |
java.lang.String |
getTmpPath()
|
boolean |
isBibliographyTitle(java.lang.String sentence)
Add reference |
void |
removeAnnotator(Annotator annotator)
Removes an annotator to the repository |
void |
removeRepositoryListener(RepositoryListener l)
Removes a listener that was previously added if exists |
void |
setEncoding(java.lang.String encoding)
Sets the character encoding that will be used in this repository |
void |
setExecPath(java.lang.String execPath)
|
void |
setMaxDocumentsToIndex(int maxDocumentsToIndex)
|
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Repository() throws java.io.IOException, java.sql.SQLException
java.io.IOException
java.sql.SQLException
public Repository(java.lang.String luceneIndexPath) throws java.io.IOException, java.sql.SQLException
Repository
using a Standard
Analyzer without stop words removal.
luceneIndexPath
- an absolute path to the folder that stores the Lucene Index
java.io.IOException
java.sql.SQLException
public Repository(java.lang.String luceneIndexPath, java.util.Locale locale) throws java.io.IOException, java.sql.SQLException
luceneIndexPath
- locale
-
java.io.IOException
java.sql.SQLException
Method Detail |
---|
public static java.lang.String cleanIdForLucene(java.lang.String id)
id
- the external id of a document
public static void cleanStorage(java.lang.String indexPath) throws org.apache.lucene.index.CorruptIndexException, org.apache.lucene.store.LockObtainFailedException, java.io.IOException, java.sql.SQLException
Repository
.
indexPath
- The path to the folder where the LuceneIndex files are stored
java.io.IOException
org.apache.lucene.store.LockObtainFailedException
org.apache.lucene.index.CorruptIndexException
java.sql.SQLException
public static java.lang.String cleanWord(java.lang.String word)
word
-
public static java.lang.String getFileContent(java.io.File file, java.lang.String charset) throws java.io.UnsupportedEncodingException, java.io.FileNotFoundException, java.io.IOException
file
- an absolute path to the filecharset
- the charset used (default is UTF-8)
java.io.UnsupportedEncodingException
java.io.FileNotFoundException
java.io.IOException
public java.lang.String getTmpPath()
public java.lang.String getProcessedPath()
public java.lang.String getExecPath()
public void setExecPath(java.lang.String execPath)
public DbConnection getDbConnection()
public java.lang.String getLuceneParentDocumentField()
public java.lang.String[][] getAllDocuments()
public void addAnnotator(Annotator annotator)
annotator
- the annotatorpublic void addRepositoryListener(RepositoryListener l)
l
- the listener to addpublic void removeRepositoryListener(RepositoryListener l)
l
- the listener to removepublic void addDocument(java.lang.String externalId, java.lang.String content, java.lang.String title, java.lang.String url, Importer importer) throws java.io.IOException, java.sql.SQLException
externalId
- an external id to identify the documentcontent
- the content of the documenttitle
- the title of the documenturl
- a url to find the document (optional)importer
- an importer (how to decode the content)
java.io.IOException
java.sql.SQLException
public void addDocumentsInFolder(java.lang.String folder) throws java.io.IOException
folder
- an absolute path to the folder that contains the files
java.io.IOException
public void addDocumentsInFolder(java.lang.String folder, int maxDocs) throws java.io.IOException
folder
- an absolute path to the folder that contains the filesmaxDocs
- the maximum number of documents to index
java.io.IOException
public void addDocumentsInList(java.io.File[] fileList) throws org.apache.lucene.index.CorruptIndexException, java.io.IOException
fileList
-
org.apache.lucene.index.CorruptIndexException
java.io.IOException
public java.lang.Thread annotateDocuments()
public void deleteTextDocument(TextDocument document) throws java.io.IOException
document
-
java.io.IOException
public java.util.List<TextDocument> getAllTextDocuments() throws java.lang.Exception
TextDocument
form
TextDocument
java.lang.Exception
public org.apache.lucene.analysis.Analyzer getAnalyzer()
Repository
is using
Analyzer
public java.util.List<Annotator> getAnnotators()
public java.lang.String getDocumentField(java.lang.String externalId, java.lang.String fieldname) throws java.io.IOException
externalId
- the id of the documentfieldname
- the name of the field to retrieve
java.io.IOException
public java.lang.String getEncoding()
public java.lang.String getIndexPath()
public org.apache.lucene.index.IndexReader getIndexReader() throws java.io.IOException
java.io.IOException
public org.apache.lucene.search.IndexSearcher getIndexSearcher() throws java.io.IOException
java.io.IOException
public java.util.Locale getLocale()
Locale
being used by TMLpublic java.lang.String getLuceneContentField()
public java.lang.String getLuceneExternalIdField()
public java.lang.String getLuceneParentField()
public java.lang.String getLucenePenntreeField()
public java.lang.String getLuceneTitleField()
public java.lang.String getLuceneTypeField()
public java.lang.String getLuceneUrlField()
public int getMaxDocumentsToIndex()
public Importer getParser()
Importer
used to transform the content before inserting
into the Repository
Importer
being used by TMLpublic java.lang.String[] getStopwords()
public java.lang.String getSvdStoragePath()
public TextDocument getTextDocument(java.lang.String externalId) throws java.io.IOException
TextDocument
object with basic information about the document,
like title and url. In order to perform operations on the documents, it
must be loaded, which means that a Corpus
and its inner
SemanticSpace
will be created.
externalId
- the id of the document
TextDocument
java.io.IOException
public boolean isBibliographyTitle(java.lang.String sentence)
sentence
- the sentence to evaluate
public void removeAnnotator(Annotator annotator)
annotator
- the annotatorpublic void setEncoding(java.lang.String encoding)
encoding
- public void setMaxDocumentsToIndex(int maxDocumentsToIndex)
maxDocumentsToIndex
- the maxDocumentsToIndex to setpublic java.lang.String getAnnotations(org.apache.lucene.document.Document luceneDocument, java.lang.String documentId, java.lang.String fieldName)
public java.lang.Thread cleanup()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |