Blog

Read About What We Love: Web Development & Technologies

Searching PDF with Umbraco

Umbraco had been using the powerful Lucene.Net indexing/search engine for a long time to drive its back-office search. So, when the need arose to build the search module, Umbraco used Examine, which is a provider-based Indexer/Searcher API and works easily with Lucene indexes. Examine enables users to search or index data quickly across any type of content (pdf, docx, doc etc.) and wraps the Lucene.Net indexing/searching engine. Lucene works very fast even on huge volume of data; hence, it was natural for Umbraco to implement ‘Umbraco Examine’ - a combination of Umbraco, Examine and Lucene.Net.

Umbraco Examine uses Umbraco as the data source for its Lucene index and provides site search that gives results by keyword. The default Examine search will search only Umbraco nodes but it will not search content under Umbraco media files like PDF files. To search content under pdf files, one needs to install an additional NuGet package called  UmbracoExamine.PDF (source: http://www.wiliam.com.au/wiliam-blog/searching-pdfs-with-umbraco).

Before exploring the method to achieve PDF search, let’s first see how the Umbraco Examine can be configured easily to perform indexing and search tasks on Umbraco.

Basics of Umbraco Examine

We all know that Examine is not exclusive to Umbraco. It can be used as a standalone component on any project that needs a fast Index. Since Examine is a provider-based API, it is extensible and one can configure as many indexes as they want to, each being configured individually.

By default, Umbraco Examine comes along with Umbraco package as out-of-the-box feature with basic configurations. As Examine is configuration driven, developers need to manage the settings and configurations as per their need.

Managing Settings and Configurations

  1. Index Sets

Usually, developers start with an index set, which is nothing but an index definition (outlining the fields and field types that are included in the index).

To configure an index set, you must go to the /config/ ExamineIndex.config file.

Search index configurations and settings can be changed using below configuration files:

~\config\ExamineIndex.config

~\config\ExamineSettings.config

The ~\config\ExamineIndex.config file by default has 3 IndexSets – InternalIndexSet, InternalMemberIndexSet and ExternalIndexSet.

The default index set looks like this:

<IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/"/>

You can change the configurations of IndexSet by specifying which document types and properties are used for indexing.

A sample customized IndexSet will look like this:

<IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/">

    <IndexAttributeFields>

      <!-- Set here all page properties that we want to be indexed. -->

      <add Name="id" />

      <add Name="version" />

      <add Name="parentID" />

      <add Name="writerID" />

      <add Name="creatorID" />

    </IndexAttributeFields>

    <IndexUserFields>

      <!-- Set here all site custom properties that we want to be indexed. -->

      <add Name="testTitle" EnableSorting="true" />

      <add Name="testDescription" EnableSorting="true" />

    </IndexUserFields>

    <IncludeNodeTypes>

      <!-- Set here all site document types that we want to be indexed. -->

      <add Name="Test"/>

    </IncludeNodeTypes>

    <ExcludeNodeTypes>

      <!-- Set here all site document types that we want to NOT be indexed. -->

    </ExcludeNodeTypes>

  </IndexSet>

For more details on configuring IndexSet, visit https://github.com/Shazwazza/Examine/wiki/IndexSet

  1. Examine Index Providers and Examine Search Providers

Once the Index has been defined, you must instruct Examine exactly what it needs to do with this configuration. For this, you must first configure your Indexer and Searcher.

To configure the Indexer and Searcher, you must go to the /Config/ExamineSettings.config file.

The /Config/ExamineSettings.config file contains two main sections i.e. the settings of ExamineIndexProviders and ExamineSearchProviders.

A sample Index Providers element looks like this:

<add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

You can extend it with additional properties such as:

  1. dataService - the type that this provider will instantiate to query Umbraco for the data that it requires. Generally, this wouldn’t need to change unless you want to use test data from a non-Umbraco source or you have specific custom requirements.
  2. indexSet - explicitly specifies the index set that needs to be used. Generally, this is wired up based on naming conventions.
  3. supportUnpublished – this is used if you want the indexer to index content that is not published.
  4. supportProtected – this is used if you want the indexer to index content that is protected.
  5. runAsync = this will process the queue files into the index asynchronously, unless you are testing, and this should always be true.
  6. interval = this defines how often the asynchronous service will process the file queue in seconds.

A Default External searcher element looks like this:

<ExamineSearchProviders defaultProvider="ExternalSearcher">

    <providers>

<add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

</providers>

  </ExamineSearchProviders>

You can customize search provider setting by providing values for properties like enableLeadingWildcard, analyzer etc.

Viewing Examine Index Configurations on Umbraco

To view the Examine index configurations on Umbraco back-office, you have to go to the Developers section as shown in the screen shot below:

Configuring to Search Content Under PDF Files

As mentioned earlier, the default Examine search will search only Umbraco nodes but it will not search content under Umbraco media files like PDF files. To search content under PDF files, one needs to install an additional NuGet package called  UmbracoExamine.PDF

While installing the package, if there are any errors related to ‘Newtonsoft.Json’ or ‘System.Threading.Tasks.Dataflow’, then delete reference for these from the packages.config file.

After the successful installation of the Nuget package, you will see that the ~\config\ExamineIndex.config and ~\config\ExamineSettings.config files are updated with below mentioned configuration elements:

  • ~\config\ExamineIndex.config

<IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/PDFs"/>

  • ~\config\ExamineSettings.config

Index Provider

<add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" extensions=".pdf" umbracoFileProperty="umbracoFile"/>

Search Provider

<add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>

Ideally, we don’t want to create separate search engines for searching Umbraco nodes and PDF files. To combine both searches into a single search result we need to create MultiIndexSearcher in the ~\config\ExamineSettings.config file.

A sample MultiIndexSearcher element looks like this:

<add name="ContentAndPdfSearcher" type="Examine.LuceneEngine.Providers.MultiIndexSearcher,Examine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer,Lucene.Net" enableLeadingWildcards="true" indexSets="ExternalIndexSet,PDFIndexSet"/>

Now, change the code within the search controller by using below sample code:

var searcher = ExamineManager.Instance["ContentAndPdfSearcher"];

var searchCriteria = searcher.CreateSearchCriteria();

var query = searchCriteria.GroupedOr(new[] { "nodeName", "name", "title", "body", "FileTextContent" }, filter.Keyword).Compile();

var searchResults = searcher.Search(query);

This code will look for search keyword in the node name field, name field, title field, body element and PDF content (assuming ‘name’, ‘title’, ’body’ properties exists). PDF content is appended to the ‘FileTextContent’ property.

Since properties of PDF search results and regular content search results are different, we could use the below sample to code to create results set for summary page:

Items = new List<SearchResultItem>();

foreach (var item in pages)

{

    if (item.Fields.ContainsKey("FileTextContent"))

    {

        var node = helper.TypedMedia(item.Fields["__NodeId"]);

        Items.Add(new SearchResultItem()

        {

            Title = node.Name,

            Url = node.Url,

            Summary = StringHelpers.Truncate(item.Fields["FileTextContent"] ?? string.Empty, 300)

        });

    }

    else

    {

        var node = helper.TypedContent(item.Fields["id"]);

        Items.Add(new SearchResultItem()

        {

            Title = item.Fields["title"],

            Url = node.Url,

            Summary = item.Fields.ContainsKey("body") ? StringHelpers.Truncate(item.Fields["body"] ?? string.Empty, 300) : MvcHtmlString.Empty

        });

    }

}

 

By installing the additional NuGet package and using simple configuration settings such as those shown above, you can enjoy a fully-functional site search that gives results by keyword, not only for Umbraco node content but also for stored PDF files by the textual content that they contain!

Shwetha Bhat | Blogger

Manjunath Govindappa | ASP.NET Technical Lead

Leave A Comment

Transform your business

We'll get in touch with you ASAP!


Contact us