Solr Data Import Handler for Search in XML Files

By Raghavendran Pedapati

Apache Solr is a web application and is built around Lucene.  Lucene has a powerful search library to provide full-text indexing . The significant aspect of Lucene search is its inverted index, meaning keyword-centric data structure i.e.  word -> Pages rather than page -> words.

Solr not only  takes advantage of all good features  like inverted  index  of search , spellchecking, hit highlighting and advanced analysis/tokenization capabilities in Lucene ;   but empowers itself as one of the powerful search application with SolrAPI . One of the advanced features  of Solr is  faceting , i.e. arranging search results in the form of  columns and numerical counts of the  key terms.

Thus Solr is the paradise of programmers to develop sophisticated and efficient search applications as it provides easier scaling and distribution.

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPath Entity Processor. We can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. We could translate our arbitrary XML to Solr’s standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

I will discuss here how to deploy Solr DIH for search XML files.  (more…)

Read More