TML - Text Mining Library for LSA (Latent Semantic Analysis)

TML is a TM library for LSA written in Java which is focused on ease of use, scalability and extensibility. TML is maintained by Jorge Villalón and is part of a development effort by the University of Sydney, Learning and Affect Technologies Group.

Ease of use

TML aims to help developers write applications that use TM techniques, without having to be an expert in the area and with no licensing problems (TML is Apache v2.0). TML also aims to help researchers to speed up their experimenting providing a platform they can trust (validated using academic papers) so they can focus on their new ideas.

Scalability

One of the biggest problems in TM is that many algorithms are computationally expensive. TML doesn't solve this problem, however it tackles scalability by decoupling the most complicated processes.

TML is integrated with the high performance Apache's Lucene search engine for high speed document indexing and corpus definition (the documents you'll work on). Lucene can be scaled to eat the whole WWW so it has no limits, and TML defines a corpus as a set of search results so document selection happens incredibly fast.

TML has a parallel process that adds annotations on demand, for example if you want to use Part Of Speech tags (POS), you can run the annotator offline and only when you know the server will be ok. In this way TML will always respond, and will use new data as it becomes available.

Finally, TML caches models (SVD and NMF decompositions) for faster execution.

Extensibility

TML can be easily extended in many ways at every step.

New analyzers can be added to Lucene for different tokenizing, stemming, etc.
New term weighting schemes can be added to TML when building a VSM (term-doc matrix).
New annotators can be easily added to extract information from documents.
New factorisations can be added for LSA style research.
New operations can be easily added to put them all together.

Implemented operations

TML already implements several operations:

LSA based distances between passages
Topic extraction and clustering
Automatic extraction of Concept Maps

It is able to create semantic spaces from a corpus of documents, and use that space as background knowledge to calculate semantic distances within the same corpus or on a different one. TML processes all documents at three levels: Document, paragraph and sentence. This means that corpora can be created using whole documents, its parts or a combination of both.

TML is built on top of Lucene therefore it can perform any search to create a corpus. In other words, you can build a corpus with all the sentences of all the documents that contain the word dog.

TML also uses grammatical information from the Stanford parser at the sentence level, so each sentence contains its own PennTree string. This allows to reconstruct the grammatical tree in a fast way to perform grammatical operations.

Download

You can download the latest version of tml here.

Quick start guide

In order to use TML, the easiest way is to use it as a command line tool. In order to do this you need:

Have java properly installed, version 1.6 or above is required. To check your version you can run:

java -version
Have MySql properly installed and working. You can check this by running:

mysql -u root

Download the latest TML distribution from a zip file in the download page.
Unzip the content and check the README.txt for installation instructions.

Using TML from the command line

You can execute TML from the command line with the following command:

java -jar tml-xxx.jar

Adding documents to a repository

java -jar tml-xxx.jar -I -repo /path/to/repository --idocs /path/to/txt/files

Executing operations on a corpus

java -jar tml-xxx.jar -O -repo /path/to/repository --ocorpus type:document --odim NUM --odimth 2 --operations PassagesSimilarity

For a full list of the available operations, check the package tml.vectorspace.operations in the API docs.

Using TML from a Java program

To use TML from another java program you have to include TML in your classpath. You can use the provided tml-xxx-core.jar that does not include dependencies to avoid conflicting jars and save disk space.

Simple program that adds documents to a repository:

import tml.storage.*;

public class AddingFilesToRepository {

    public static void main(String[] args) throws Exception {
        Repository repository = new Repository("path/to/repository");

        repository.addDocumentsInFolder("path/to/txt/files");

        System.out.println("Documents added to repository successfully!");
    }
}

Simple program that runs an operation with the documents in the repository:

import tml.vectorspace.TermWeighting.GlobalWeight;
import tml.vectorspace.TermWeighting.LocalWeight;
import tml.annotators.PennTreeAnnotator;
import tml.corpus.SearchResultsCorpus;
import tml.corpus.CorpusParameters.DimensionalityReduction;
import tml.corpus.CorpusParameters.TermSelection;
import tml.storage.Repository;

public class PerformingOperationOnCorpus {

    public static void main(String[] args) throws Exception {
        Repository repository = new Repository("path/to/repository");

        SearchResultsCorpus corpus = new SearchResultsCorpus("type:document");
        corpus.getParameters().setTermSelectionCriterion(TermSelection.DF);
        corpus.getParameters().setTermSelectionThreshold(0);
        corpus.getParameters().setDimensionalityReduction(DimensionalityReduction.NUM);
        corpus.getParameters().setDimensionalityReductionThreshold(50);
        corpus.getParameters().setTermWeightGlobal(GlobalWeight.Entropy);
        corpus.getParameters().setTermWeightLocal(LocalWeight.LOGTF);
        corpus.load(repository);

        System.out.println("Corpus loaded and Semantic space calculated");
        System.out.println("Total documents:" + corpus.getPassages().length);

        PassagesSimilarity distances = new PassagesSimilarity();
        distances.setCorpus(corpus);
        distances.start();

        distances.printResults();
    }
}