Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Content mining with Apache Tika

  
  
  

Apache Tika is a content-mining library that allows you to pull both metadata and text content out of documents of many different types. Instead of having to turn to a variety of different parser libraries, each offering slightly different options, you can learn how to use Tika and its API once and apply it to any format that Tika supports, including RTF, Microsoft Office formats, and PDF – and you can write parsers for new, currently unsupported, formats and hook them into Tika. That makes handling searching or indexing an existing corpus of documents an easy prospect.

Tika started life as part of Apache Lucene but is now a top-level project in its own right. As a toolkit, it uses existing parser libraries rather than reinventing the wheel, so it provides a single standard interface for existing parsers. It's written in Java, so you can either run it from the command line or hook it into your Java project. We'll try running Tika both ways.

Tika as command-line tool

You can download Tika either as a runnable Java archive or as source code (which you then have to compile yourself) from the download page. (You can also check out the current source from version control if you want to try out the very latest code, but it may not be stable.)

To extract metadata or content by running Tika from the command line, use the prepackaged jar file. For example, this command outputs the contents of the file test.doc to standard output in text format:

java -jar tika-app-1.4.jar --text test.doc

If you just want the file's metadata, again in text format, try:

java -jar tika-app-1.4.jar --metadata test.doc

Short forms of these commands are also available; run java -jar tika-app-1.4.jar --help to get the full list of available options. You can output the content information in HTML (replace --text with --html) or XHTML (replace --text with --xml) if you prefer. You can output the metadata as JSON (replace --metadata with --json) or XMP (replace --metadata with --xmp).

You can also hook Tika into a standard Unix pipeline, as with any other Unix-style command. For example, you can use cURL to fetch a file, parse its content into HTML using Tika, and then send that HTML output to a file:

curl http://example.com/test.doc | java -jar tika-app-1.4.jar --html > test.html

With just a little scripting, you could wrangle a whole set of PDF, Microsoft Office, OpenDocument, or RTF documents into HTML for ease of online access. You could even set up a job to run automatically overnight to index new documents if you regularly acquire new ones. If you do something like this, the metadata may be useful for an index or tag library, allowing you to improve document searching. You could then provide links to the original document types and/or an HTML translation.

Inevitably, the quality of the HTML or XML that Tika produces is in part a reflection of the structure of the original document. You're likely to have a few extraneous italic or bold tags if font attributes are used in, say, a Word file, and the HTML will not reflect font choices or sizes. Tika is more about getting the content out, and not so much about the presentation.

Other command-line uses

In addition to working with metadata and content, Tika can also detect the file type and even the language that a file is written in. This can be useful if metadata is lacking:

$ java -jar tika-app-1.4.jar --detect test.doc 
application/rtf
$ java -jar tika-app-1.4.jar --language test_french.doc 
fr

You could also use the filetype detection output to hook a file into another pipeline or another part of a Java app. Tika can even handle metadata from files that contain EXIF information.

Another command-line option lets you fire Tika up with a GUI, enabling you to drag a file into it for decoding. You can also start it as a server, listening to a particular port, and then feed files into that port.

Apache Tika GUI

Tika in your Java project

Now let's move away from the command line and look at a bit of code that enables you to use Tika in your own Java programs. This example is very basic – you'll want to explore the Tika API to get the most out of it for your needs. This code simply reads in a specific file and outputs the metadata:

import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileNotFoundException; 
import java.io.InputStream; 
import java.io.IOException; 
import org.apache.tika.Tika; 
import org.apache.tika.exception.TikaException; 
import org.apache.tika.metadata.Metadata; 
import org.apache.tika.parser.AutoDetectParser; 
import org.apache.tika.parser.Parser; 
import org.apache.tika.parser.ParseContext; 
import org.apache.tika.sax.BodyContentHandler; 
import org.xml.sax.ContentHandler; 
import org.xml.sax.SAXException; 
import org.xml.sax.helpers.DefaultHandler;

class TestTika {

  public static void main(String[] args) throws FileNotFoundException, 
      IOException, org.xml.sax.SAXException, TikaException {

    File file = new File("testing.rtf");
    InputStream is = new FileInputStream(file);
    Metadata metadata = new Metadata();
    BodyContentHandler ch = new BodyContentHandler();
    AutoDetectParser parser = new AutoDetectParser();

    String mimeType = new Tika().detect(file);
    metadata.set(Metadata.CONTENT_TYPE, mimeType);

    parser.parse(is, ch, metadata, new ParseContext());
    is.close();

    foreach (int i = 0; i < metadata.names().length; i++) {
        String item = metadata.names()[i];
        System.out.println(item + " -- " + metadata.get(item));
    }

    System.out.println(ch.toString());
  }
}

This code starts by setting up the file to parse, an InputStream for it, and the Tika Metadata, BodyContentHandler, and AutoDetectParser objects. The useful AutoDetectParser class allows you to forgo specifying what type of file you are expecting; Tika will work it out for you and use the correct parser library. (If you prefer, you can specify one of the many Tika parsers. This may give you more options if you know, for example, that all of your files are PDF or RTF.)

The code then detects the file type (it's important to use the File here, not the InputStream) and sets the metadata accordingly, then parses the file. It loops over the metadata list (accessed with names()) and prints out each item, then prints out the body with ch.toString().

Use this basic way of accessing the body contents with caution. If you have a very large file, you might have memory problems. Tika offers other options for managing the content in a more sophisticated way – check out the full API.

If this tutorial whets your appetite, Tika may be useful for your environment. For more on creating Tika-based Java apps, read the documentation to the Parser interface. To make the most of Tika's capabilities, in particular if you have a large corpus of documents that you want to access, check out the book Tika in Action.

Want to write for Wazi? Suggest an article!




This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.

Comments

Chanel may be a common become a common relating to fashionable cartier replica. Nowadays your charms have grown cool together with shopping bags. You will find models donning your Chanel apparatus which include Paris, europe , Hilton. However , for those who stay in among the many substantial spots, maybe you have difficulty choosing Actual Chanel charms or simply designer purses. It is actually quite possibly problematic to getting a pretend lol. On the plus side, because of the daily life within the Online world you have certain realistic solutions for decreased hublot replica. However , you should not look at cheap in your own search for choosing those incredibly hot charms. A natural part of that could be nothing more than low priced Far eastern imports have grown less-than-perfect quality and definitely will own few calendar months and would quite possibly break an individual's hublot replica watches. In all probability anywhere from around $ 45 that will $ seven hundred. 00 over pay money for male Chanel charisma charms at discount. Charms extremely copied stands out as the low priced Chanel charms. That’s given that right here is the most desired together with ordinarily a minimal expense charms. For those who stay in Manhattan including also, you come up with your drive that will Broadway path, ideal driving 50 together with streets thirty-two, in reality numerous low priced rolex replica. Similar is rue Canal Path during Manhattan, more effective also known as Chinatown. Choosing strategies of choosing realistic Chanel business logo charms, shopping bags, eyeglasses together with designer purses with Online world shops. CHANEL CHARMS – replica cartier SHOULD NOT FILL OUT AN APPLICATION Chanel business logo charms are definitely the treasure designed to enable you to during the doorstep together with always keep everyone certainly, there, my super cool buddy. For those who forgot a good unique birthday, birthday or simply were definitely conspicuously omitted within the birth and labor to your firstborn boy or girl, Chanel charms could serene any lakes and rivers 
Posted @ Wednesday, July 30, 2014 1:38 AM by maymay
Thanks this was very helpful
Posted @ Tuesday, August 12, 2014 8:16 AM by Michael Mackey
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics