<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Karl Heinz Kremer&#039;s Ramblings &#187; cos</title>
	<atom:link href="http://www.khk.net/wordpress/tag/cos/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.khk.net/wordpress</link>
	<description>Stuff, stuff and more stuff</description>
	<lastBuildDate>Sun, 25 Sep 2011 18:38:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>Gutting a PDF</title>
		<link>http://www.khk.net/wordpress/2009/04/08/gutting-a-pdf/</link>
		<comments>http://www.khk.net/wordpress/2009/04/08/gutting-a-pdf/#comments</comments>
		<pubDate>Wed, 08 Apr 2009 15:23:31 +0000</pubDate>
		<dc:creator>khk</dc:creator>
				<category><![CDATA[PDF]]></category>
		<category><![CDATA[Photos]]></category>
		<category><![CDATA[cos]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[fish]]></category>
		<category><![CDATA[itext]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[meta data]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xmp]]></category>

		<guid isPermaLink="false">http://khk.net/wordpress/?p=314</guid>
		<description><![CDATA[OK, I have to admit, the title is just for show I don&#8217;t really want to gut a PDF &#8211; that would mean to kill it, and PDFs are pretty useful, so we should treat them well&#8230; What I&#8217;m after is to extract arbitrary information from a PDF file &#8211; information that may not be [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: left; margin-right: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.khk.net%2Fwordpress%2F2009%2F04%2F08%2Fgutting-a-pdf%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.khk.net%2Fwordpress%2F2009%2F04%2F08%2Fgutting-a-pdf%2F&amp;source=khkremer&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>OK, I have to admit, the title is just for show <img src='http://www.khk.net/wordpress/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I don&#8217;t really want to gut a PDF &#8211; that would mean to kill it, and PDFs are pretty useful, so we should treat them well&#8230;</p>
<p><a href="http://www.flickr.com/photos/68335338@N00/2651668890" title="View 'He's Dead, Jim!' on Flickr.com">
<div style="text-align:center;"><img src="http://farm4.static.flickr.com/3146/2651668890_89bd1c3971.jpg" alt="He's Dead, Jim!" class="flickr" /></div>
<p></a></p>
<p>What I&#8217;m after is to extract arbitrary information from a PDF file &#8211; information that may not be accessible in any other way. Some 3rd party Acrobat plug-ins save information in a PDF file so that once the document is opened again, the plug-in &#8220;knows&#8221; that the current file was already processed, or that a user interface window can be populated with the previously saved settings, or &#8230; There are many reasons why that could come in handy. </p>
<p>If you take a look at the <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html">PDF Reference document</a>, you can find all the information necessary to understand how data can be saved in a PDF file. Adobe does allow 3rd party developers to store information in a PDF file as long as it is clear that the data is private. The developer can make sure that nobody else reads that information by accident by using a four letter developer prefix for all such data. </p>
<p>I&#8217;ve mentioned before that there are tools that allow us to look at the structure of a PDF file (e.g. the Enfocus Browser, or with Acrobat&#8217;s own Preflight tool). For now let&#8217;s assume that the data we are interested in is actually saved in the PDF&#8217;s metadata stream &#8211; if you don&#8217;t know what that means, please go back to the PDF Reference document. </p>
<p>[more after the jump]  <span id="more-314"></span>For this example, let&#8217;s try something simple that just illustrates the process and the tools we need. With that knowledge and background, it is easy to perform more sophisticated tasks with PDF files. </p>
<p>Every newer PDF files does not only contain the meta data in form of the document info dictionary, but also as XMP meta data &#8211; this is a XML based format. Let&#8217;s try to extract that XML data stream from a PDF file. </p>
<p>Because I don&#8217;t want to hide the interesting parts of the solution by infrastructure, I am hardcoding the PDF filename and the text output file name, I&#8217;m also assuming that the iText library JAR file is in the same directory as the source code. </p>
<p>Here is the complete program, I&#8217;ll go through the different sections once we&#8217;ve compiled and executed it:<br />
<pre><pre>
/* based on some sample code from iText library */

import java.io.*;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfDictionary;
import com.lowagie.text.pdf.PdfName;
import com.lowagie.text.pdf.PdfStream;
import com.lowagie.text.pdf.PRStream;

public class MetaData {

&nbsp;&nbsp;&nbsp;&nbsp;public static void main(String[] args) {

&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Trying to extract XML metadata&quot;);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;try {

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfReader reader = new PdfReader(&quot;first.pdf&quot;);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfDictionary dict = reader.getCatalog();

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfDictionary metaData = dict.getAsStream(new PdfName(&quot;Metadata&quot;));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (metaData == null)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Cannot get metaData&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (metaData.isStream())
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OutputStream f = new FileOutputStream(&quot;metaData.txt&quot;);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;byte[] data = PdfReader.getStreamBytes((PRStream) metaData);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;f.write(data);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;f.close();

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Metadata is not a stream object&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;catch (Exception de) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; de.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;}
}
</pre></pre></p>
<p>Let&#8217;s assume that the source code is in a file named MetaData.java, to compile and run the program, we would need to execute the following commands:<br />
<pre><pre>
javac MetaData.java
java MetaData
</pre></pre></p>
<p>Before you execute the program, make sure that there is a PDF file named &#8220;first.pdf&#8221; in the same directory as the program and the iText library. </p>
<p><H3>How Does It Work?</H3><br />
At first we need to import a bunch of &#8220;stuff&#8221; &#8211; we need the Java IO system, and then a few classes from the iText library:<br />
<pre><pre>
import java.io.*;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfDictionary;
import com.lowagie.text.pdf.PdfName;
import com.lowagie.text.pdf.PdfStream;
import com.lowagie.text.pdf.PRStream;
</pre></pre></p>
<p>In the next few lines, we define our class and declare our main function. Also, the whole iText related code is in one try/catch block. For a real application, you want to create smaller try/catch blocks so that you can recover from problems.<br />
<pre><pre>
public class MetaData {

&nbsp;&nbsp;&nbsp;&nbsp;public static void main(String[] args) {

&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Trying to extract XML metadata&quot;);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;try {
</pre></pre></p>
<p>Now we are ready to create a new PdfReader object &#8211; this is how iText accesses the data in a PDF file. From that PdfReader object we can then get the &#8220;Catalog&#8221; which is the root object of all COS objects that are used in this dococument:<br />
<pre><pre>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfReader reader = new PdfReader(&quot;first.pdf&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfDictionary dict = reader.getCatalog();
</pre></pre></p>
<p>The meta data is stored as a stream (if you don&#8217;t know what that is, read up on it in the PDF spec) as a direct child of the Catalog dictionary (again, if you don&#8217;t know what that is, read up on it). This means that we can access it without any further navigating through the COS objects. For a real project, chances are that you have to go a few more levels deep into the COS structure. The PdfDictionary object has a method to get a stream:<br />
<pre><pre>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PdfDictionary metaData = dict.getAsStream(new PdfName(&quot;Metadata&quot;));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (metaData == null)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Cannot get metaData&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
</pre></pre></p>
<p>When it comes to software, I&#8217;m not a very trusting person, so I want to make sure that we are indeed dealing with a stream and nothing else. Therefore I will call the isStream() method to find out if that&#8217;s the case. If we are dealing with a stream, I&#8217;m creating a new FileOutputStream (a text file that will receive the XML data), and then I am reading the actual COS stream data and writing it to the output file. iText will take care of any filters that were applied to the stream data (e.g. compression), so I don&#8217;t have to deal with that directly.<br />
<pre><pre>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (metaData.isStream())
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;OutputStream f = new FileOutputStream(&quot;metaData.txt&quot;);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;byte[] data = PdfReader.getStreamBytes((PRStream) metaData);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;f.write(data);

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;f.close();

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(&quot;Metadata is not a stream object&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
</pre></pre></p>
<p>That&#8217;s it. Now we just need to make sure that we do have  catch block for our exception handler:<br />
<pre><pre>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;catch (Exception de) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; de.printStackTrace();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;}
}

</pre></pre></p>
<p>Instead of saving the XML meta data to a file, we could have used a Java based XML parser and extracted data from it. </p>
<p>The same technique can also be used to read other data from a PDF file (e.g. names, numbers, &#8230;).</p>
<p>Let me know if you have more questions about either iText, or how to access information in a PDF file with it. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.khk.net/wordpress/2009/04/08/gutting-a-pdf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

