TML is a TM library for LSA written in Java which is focused on ease of use, scalability and extensibility. TML is maintained by Jorge Villalón and is part of a development effort by the University of Sydney, Learning and Affect Technologies Group.
TML aims to help developers write applications that use TM techniques, without having to be an expert in the area and with no licensing problems (TML is Apache v2.0). TML also aims to help researchers to speed up their experimenting providing a platform they can trust (validated using academic papers) so they can focus on their new ideas.
One of the biggest problems in TM is that many algorithms are computationally expensive. TML doesn't solve this problem, however it tackles scalability by decoupling the most complicated processes.
TML is integrated with the high performance Apache's Lucene search engine for high speed document indexing and corpus definition (the documents you'll work on). Lucene can be scaled to eat the whole WWW so it has no limits, and TML defines a corpus as a set of search results so document selection happens incredibly fast.
TML has a parallel process that adds annotations on demand, for example if you want to use Part Of Speech tags (POS), you can run the annotator offline and only when you know the server will be ok. In this way TML will always respond, and will use new data as it becomes available.
Finally, TML caches models (SVD and NMF decompositions) for faster execution.
TML can be easily extended in many ways at every step.
TML already implements several operations:
It is able to create semantic spaces from a corpus of documents, and use that space as background knowledge to calculate semantic distances within the same corpus or on a different one. TML processes all documents at three levels: Document, paragraph and sentence. This means that corpora can be created using whole documents, its parts or a combination of both.
TML is built on top of Lucene therefore it can perform any search to create a corpus. In other words, you can build a corpus with all the sentences of all the documents that contain the word dog.
TML also uses grammatical information from the Stanford parser at the sentence level, so each sentence contains its own PennTree string. This allows to reconstruct the grammatical tree in a fast way to perform grammatical operations.
You can download the latest version of tml here.
In order to use TML, the easiest way is to use it as a command line tool. In order to do this you need:
You can execute TML from the command line with the following command:java -jar tml-xxx.jar
Adding documents to a repositoryjava -jar tml-xxx.jar -I -repo /path/to/repository --idocs /path/to/txt/files
Executing operations on a corpusjava -jar tml-xxx.jar -O -repo /path/to/repository --ocorpus type:document --odim NUM --odimth 2 --operations PassagesSimilarity
For a full list of the available operations, check the package tml.vectorspace.operations in the API docs.
To use TML from another java program you have to include TML in your classpath. You can use the provided tml-xxx-core.jar that does not include dependencies to avoid conflicting jars and save disk space.