Wednesday 28 September 2011

Indexing database with Solr 3.4 from Oracle Server DATABASE and integration of solr with TIKA.



  1. Download the Tomcat-5.5.33 from here.
  2. Install Tomcat (no special instructions here--just run the install and select directory wherever you wish to install)
  3. Start Tomcat by startup.sh in bin dir.
  4. Verify the installation of Tomcat by going to http://localhost:8080
  5. Download SOLR from one of the mirrors found here (downloaded the apache-solr-3.4.0-src.tgz package) and unzip the package. e.g. Solr is extracted at /home/abashetti/Downloads/apache-solr-3.4.0/
  6. Open the Terminal. Go to the extracted apache solr folder. e.g. cd /home/abashetti/Downloads/apache-solr-3.4.0/solr
  7. Create the solr war. Run the ant commands – ant clean , ant compile and ant dist.
  8. Ant dist will create the *solr*.war in */solr/dist/ folder. e.g. path for the war file is(/home/abashetti/Downloads/apache-solr-3.4.0/solr/dist).
  9. To avail the dataimporter functionality add the apache-solr-dataimporthandler , apache-solr-dataimporthandler-extras jars to solr lib.
  10. The apache-solr-dataimporthandler , apache-solr-dataimporthandler-extras jars are available at */apache-solr-3.4.0/solr/contrib/dataimporthandler/target/
    e.g. path is from where I copied the jar files is (/home/abashetti/Downloads/apache-solr-3.4.0/solr/contrib/dataimporthandler/target/)
    & solr lib path is (/home/abashetti/Downloads/apache-solr-3.4.0/solr/lib).
  11. To extract the text from various document Apache Tika is used. Download Apache Tika from here.
  12. Build the source code of Apache Tika using maven. For maven set up read here.
  13. Copy the jar files named tika-app , tika-bundle , tika-core , tika-parsers from target to solr lib. In my case solr lib path is (/home/abashetti/Downloads/apache-solr-3.4.0/solr/lib).
  14. Create the solr war again after adding the jars. Run the ant commands – ant clean , ant compile and ant dist.

  1. Create a directory SOLR. It is the SOLR HOME, where SOLR will be hosted from
    (e.g. /home/abashetti/Downloads/solr).
  2. Copy the files and folder from path /home/abashetti/Downloads/apache-solr-3.4.0/solr/example/solr to your SOLR HOME. e.g destination path is
    (/home/abashetti/Downloads/solr/).
  3. Visit http://localhost:8080/solr/admin to make sure everything is still running.
  4. Go to the path /home/abashetti/Downloads/apache-solr-3.4.0/solr/example/solr/conf.
  5. Create a file data-config.xml. Add the database connection information and the query
    in this file.
  6. Configuring the datasource in the data-config.xml.

<dataConfig>
<dataSource name="ds-db" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@127.0.0.1:1521:test" user="root" password="root"/>
<dataSource name="ds-file" type="BinFileDataSource"/>
<document name="documents">
<entity name="document" dataSource="ds-db" query="select distinct
doc.document_id as id,
doc.title,
doc.author,
doc.publisher,
(case when doc.content_format_code not in('doc','pdf','xml','txt','ppt','xls') then
( select path.document_path from document_path path where path.doc_id = doc.id )
else
''
end)contentpath
from ds_document_c doc
where doc.index_state_modification_date >= to_date($ {dataimporter.request.lastIndexDate}, 'DD/MM/YYYY HH24:MI:SS')))" transformer="DateFormatTransformer">

<field column="id" name="id"/>
<field column="title" name="title"/>
<field column="author" name="author"/>
<field column="publisher" name="publisher"/>
</entity>
<entity name="textEntity" processor="TikaEntityProcessor" url="$ {document.CONTENTPATH}" dataSource="ds-file" format="text" onError="continue">

<field column="text" name="text"/>
</entity>
</document>
</dataConfig>

Substitute the database username and password with your database credentials.
  1. Add the location of data-config in solrconfig.xml under the DataImortHandler Section.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>


  1. Edit the schema.xml file. The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

<field name=”id” type=”integer” indexed=”true” stored=”true” />
<field name=”title” type=”string” indexed=”true” stored=”true” /> <field name=”author” type=”string” indexed=”true” stored=”true” /> <field name=”publisher” type=”string” indexed=”true” stored=”true” /> <field name=”text” type=”text” indexed=”true” stored=”true” />

Find the “<uniqueKey>” node and change it to: <uniqueKey>id</uniqueKey>

Find the “<defaultSearchField>” node and change it to: <defaultSearchField>text</defaultSearchField>;

Delete all the “<copyField>” nodes.
  1. Copy the *solr*.war file from the dist directory in the unzipped SOLR package to your Tomcat webapps folder.
  2. Rename the *solr*.war file to solr.war
  3. Specify the solr home in the catlina.sh
    JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/home/abashetti/Downloads/solr"
  4. Add the above line just below the JAVA_OPTS="$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
  5. Copy the jar ojdbc6.jar to the path : */apache-tomcat-5.5.33/common/lib
  6. Now go the http://localhost:8080/solr/admin/dataimport.jsp. Click on the /DATAIMPORT link. You will see the dataimporter console. Click on the button “Full Import With Cleaning” . It will start indexing. Clicking the on the status button you will know the progress of the indexing. If indexing is in progress it will show the status as “busy” otherwise “Indexing completed for “number” of documents”
  7. Once the indexing is completed, go the http://localhost:8080/solr/admin
click on the search button to check the result.

5 comments:

  1. there's no folder named as target : Help plz

    ReplyDelete
    Replies
    1. I have used solr 3.4... and this setup is for the same version. which version you are using....?

      Delete
    2. if you are using solr 4.3.1 then the data-import handler jars would be available at path "*/solr-4.3.1/solr/build/contrib".

      Delete
    3. Are you still facing the same issue..?

      Delete
  2. Which version of solr you are using ...?

    ReplyDelete