8 Apr 2009

Gutting a PDF

Posted by khk

[ Due to a huge amount of comment spam, comments on this post are now disabled ]

OK, I have to admit, the title is just for show 🙂 I don’t really want to gut a PDF – that would mean to kill it, and PDFs are pretty useful, so we should treat them well…

He's Dead, Jim!

What I’m after is to extract arbitrary information from a PDF file – information that may not be accessible in any other way. Some 3rd party Acrobat plug-ins save information in a PDF file so that once the document is opened again, the plug-in “knows” that the current file was already processed, or that a user interface window can be populated with the previously saved settings, or … There are many reasons why that could come in handy.

If you take a look at the PDF Reference document, you can find all the information necessary to understand how data can be saved in a PDF file. Adobe does allow 3rd party developers to store information in a PDF file as long as it is clear that the data is private. The developer can make sure that nobody else reads that information by accident by using a four letter developer prefix for all such data.

I’ve mentioned before that there are tools that allow us to look at the structure of a PDF file (e.g. the Enfocus Browser, or with Acrobat’s own Preflight tool). For now let’s assume that the data we are interested in is actually saved in the PDF’s metadata stream – if you don’t know what that means, please go back to the PDF Reference document.

[more after the jump] For this example, let’s try something simple that just illustrates the process and the tools we need. With that knowledge and background, it is easy to perform more sophisticated tasks with PDF files.

Every newer PDF files does not only contain the meta data in form of the document info dictionary, but also as XMP meta data – this is a XML based format. Let’s try to extract that XML data stream from a PDF file.

Because I don’t want to hide the interesting parts of the solution by infrastructure, I am hardcoding the PDF filename and the text output file name, I’m also assuming that the iText library JAR file is in the same directory as the source code.

Here is the complete program, I’ll go through the different sections once we’ve compiled and executed it:

/* based on some sample code from iText library */

import java.io.*;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfDictionary;
import com.lowagie.text.pdf.PdfName;
import com.lowagie.text.pdf.PdfStream;
import com.lowagie.text.pdf.PRStream;

public class MetaData {

    public static void main(String[] args) {

    System.out.println("Trying to extract XML metadata");

        try {

            PdfReader reader = new PdfReader("first.pdf");

            PdfDictionary dict = reader.getCatalog();

            PdfDictionary metaData = dict.getAsStream(new PdfName("Metadata"));
            if (metaData == null)
            {
                System.out.println("Cannot get metaData");
                return;
            }

            if (metaData.isStream())
            {
                OutputStream f = new FileOutputStream("metaData.txt");

                byte[] data = PdfReader.getStreamBytes((PRStream) metaData);
                f.write(data);

                f.close();

            }
            else
            {
                System.out.println("Metadata is not a stream object");
            }

        }
        catch (Exception de) {
               de.printStackTrace();
        }
    }
}

Let’s assume that the source code is in a file named MetaData.java, to compile and run the program, we would need to execute the following commands:

javac MetaData.java
java MetaData

Before you execute the program, make sure that there is a PDF file named “first.pdf” in the same directory as the program and the iText library.

How Does It Work?


At first we need to import a bunch of “stuff” – we need the Java IO system, and then a few classes from the iText library:
import java.io.*;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfDictionary;
import com.lowagie.text.pdf.PdfName;
import com.lowagie.text.pdf.PdfStream;
import com.lowagie.text.pdf.PRStream;

In the next few lines, we define our class and declare our main function. Also, the whole iText related code is in one try/catch block. For a real application, you want to create smaller try/catch blocks so that you can recover from problems.

public class MetaData {

    public static void main(String[] args) {

    System.out.println("Trying to extract XML metadata");

        try {

Now we are ready to create a new PdfReader object – this is how iText accesses the data in a PDF file. From that PdfReader object we can then get the “Catalog” which is the root object of all COS objects that are used in this dococument:

            PdfReader reader = new PdfReader("first.pdf");
            PdfDictionary dict = reader.getCatalog();

The meta data is stored as a stream (if you don’t know what that is, read up on it in the PDF spec) as a direct child of the Catalog dictionary (again, if you don’t know what that is, read up on it). This means that we can access it without any further navigating through the COS objects. For a real project, chances are that you have to go a few more levels deep into the COS structure. The PdfDictionary object has a method to get a stream:

            PdfDictionary metaData = dict.getAsStream(new PdfName("Metadata"));
            if (metaData == null)
            {
                System.out.println("Cannot get metaData");
                return;
            }

When it comes to software, I’m not a very trusting person, so I want to make sure that we are indeed dealing with a stream and nothing else. Therefore I will call the isStream() method to find out if that’s the case. If we are dealing with a stream, I’m creating a new FileOutputStream (a text file that will receive the XML data), and then I am reading the actual COS stream data and writing it to the output file. iText will take care of any filters that were applied to the stream data (e.g. compression), so I don’t have to deal with that directly.

           if (metaData.isStream())
            {
                OutputStream f = new FileOutputStream("metaData.txt");

                byte[] data = PdfReader.getStreamBytes((PRStream) metaData);
                f.write(data);

                f.close();

            }
            else
            {
                System.out.println("Metadata is not a stream object");
            }

That’s it. Now we just need to make sure that we do have catch block for our exception handler:

        catch (Exception de) {
               de.printStackTrace();
        }
    }
}

Instead of saving the XML meta data to a file, we could have used a Java based XML parser and extracted data from it.

The same technique can also be used to read other data from a PDF file (e.g. names, numbers, …).

Let me know if you have more questions about either iText, or how to access information in a PDF file with it.

Tags: , , , , , , , ,

Comments are closed.