Sitecore PDF indexing errors

Some time ago, we noticed our Sitecore 8.0 install was showing errors in our production Crawling logs similar to this:

ManagedPoolThread #16 08:02:41 ERROR Could not compute value for ComputedIndexField: _content for indexable: sitecore://web/{8767E7A4-171A-478B-B421-ACF46B60D4F1}?lang=en&ver=1 Exception: System.Runtime.InteropServices.COMException

Using Ryan Bailey’s blog, we found this was a problem with indexing the PDFs in our media library. Apparently, Sitecore’s Lucene crawler defaults to indexing PDFs, but cannot extract the contents of the PDFs on its own.

In general, there are two solutions to this problem: Allow Sitecore to index PDFs, or remove PDFs from the indexing. To enable indexing, have a look at this post by John West and this Stack Overflow answer. The solution involves adding a series of DLLs to your Sitecore installation, which extract the PDF contents for processing by Lucene. You could also have a look at these two posts by Ryan Bailey.

We, however, opted to exclude PDF indexing for now, as Lucene was not (yet) used by our site visitors. To do this, remove the mimetype application/pdf from the list of indexed mimetypes:

<?xml version="1.0"?> <configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> <sitecore> <contentSearch> <indexConfigurations> <defaultLuceneIndexConfiguration> <mediaIndexing> <mimeTypes> <includes> <mimeType> <patch:delete> application/pdf </patch:delete> </mimeType> </includes> </mimeTypes> </mediaIndexing> </defaultLuceneIndexConfiguration> </indexConfigurations> </contentSearch> </sitecore> </configuration>

Personal note: As I’ll be switching companies next month, fewer posts may be about Sitecore and more about .NET development in general. But not to worry, Sitecore is certainly not out of the picture!

Leave a Reply Cancel reply