Solr Data Import Handler for Search in XML Files

By Raghavendran Pedapati

Apache Solr is a web application and is built around Lucene.  Lucene has a powerful search library to provide full-text indexing . The significant aspect of Lucene search is its inverted index, meaning keyword-centric data structure i.e.  word -> Pages rather than page -> words.

Solr not only  takes advantage of all good features  like inverted  index  of search , spellchecking, hit highlighting and advanced analysis/tokenization capabilities in Lucene ;   but empowers itself as one of the powerful search application with SolrAPI . One of the advanced features  of Solr is  faceting , i.e. arranging search results in the form of  columns and numerical counts of the  key terms.

Thus Solr is the paradise of programmers to develop sophisticated and efficient search applications as it provides easier scaling and distribution.

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPath Entity Processor. We can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. We could translate our arbitrary XML to Solr’s standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

I will discuss here how to deploy Solr DIH for search XML files. 

Ensure the JDK installed in your system and Java_Home is set appropriately. Then install Solr- 5.0.0 or above version . Once the installation is completed go to the solr root directory and go to bin folder.

Step 1:

Start the solr using command

Step 2:
Create a collection with name Manufactures

Now the Manufactures core is being populated in the core selector . We also see the statistics of the core in solr Admin UI

Step 3:
Create a .xml file and place the XML file in root directory of solr

The content of example file (manufactures.xml) is


Step 4:

Configure solrconfg.xml

Open (vim solrconfg.xml) the solrconfig.xml and place the following code


Step 5:

Data Import configuration

Create a file named Manufacturesconfig.xml with the following content


Note 1: The source folder should contain only one file that is the .xml file to index. If we want to index more than one file , we have to give specific path of each file and  name of the file  OR  we provide in data config file give the exact file name or group of the files )

Note 2: The base directory should be the solr root installation directory (<solr_installtion_root_dir>)

Note 3: The forEach path has to be changed according to the structure of the .xml file.

Step 6:

Configure managed-schema


Step 7:

Restart the solr
Go to bin folder and use the command

Step 8:
Go to Admin console URL and click on DataImportHandler

Step 9:
Select full import and then execute.
Now we see the data in admin UI console . Now do the search operations.

Note: To import the data from any relational database, place jdbc driver configuration in dataconfig.xml file.
Ex: for hsql database