Index and Search External Data
Notice: This tutorial is complemented by the previous one, please read both of them (see Index and Search your Models).
When the data you want to search is not managed by your app, you can index and search it as easily as it were in your own DB. Indeed you can use elasticsearch as it were a sort of DB itself, managing its indices and types with your models as you do with databases and tables.
In this tutorial we will create a small Rails app that will crawl this very documentation and index all its content with the elasticsearch-mapper-attachment
plugin. It will do so without using any DB, using only one elasticsearch index, managed by a simple model. Then we will add a search form that will search in the elasticsearch index, highlight the results and link to the original documentation page, so you will learn also how to use the results that you pull from your searches.
Prerequisite
Elasticsearch must be installed and running. If you are on a Mac you can just install it with homebrew
:
$ brew install elasticsearch
If you are on any other OS, read elasticsearch installation
For the purpose of this tutorial, you need also to install the elasticsearch-mapper-attachment plugin. That’s very easy:
The command should be something like the following (but please, check the latest version in its github page)
$ bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.7.0
Setup
Create a rails app:
$ rails new flex_doc_search --skip-active-record --skip-bundle
we don’t need any database for this tutorial, and we will run
bundle install
later
Now open the Gemfile
and add a few gems that we will use in this tutorial in order to crawl and index the flex-doc site, search it and paginate the results:
gem 'rest-client'
gem 'flex-rails'
gem 'anemone'
gem 'kaminari'
Now run the bundle install command:
$ bundle install
When it finishes, run the generator:
$ rails generate flex:setup
# press return when asked
Model and Index
First, we add a Flex::ActiveModel
model, that will manage the pages we crawl. It will use the elasticsearch index as it were a database. You can just add the file app/models/flex_doc_page.rb
:
class FlexDocPage
include Flex::ActiveModel
attribute :url
attribute_created_at
attribute_attachment
end
The
attribute :url
is a custom attribute that we will use to store the original url of any crawled page
Theattribute_created_at
will automatically add a field with the creation date
Theattribute_attachment
is a special attribute that integrates the elasticsearch-mapper-attachment plugin: it can index various type of contenTs, like html, pdf, word, excel etc.
(see Flex::ActiveModel Attributes)
Now we must add the FlexDocPage
to the flex_active_models
array in the config/initializers/flex.rb
, because flex needs to know it in order to create the index/indices:
config.flex_active_models |= [ FlexDocPage ]
Create the index with the rake task:
$ rake flex:index:create
Crawler
Now we can create the rake task that will crawl the flex-doc
site and will index its content. Just create the file lib/tasks/crawler.rake
and paste the following content in it:
desc 'Crawl and index the Flex Doc Site'
task :index_flex_doc => :environment do
puts "Crawling The Flex Doc site:"
# we want to delete all the pages we eventually already have in the index, so they will be fresh each time we re-crawl
FlexDocPage.delete
Anemone.crawl('http://ddnexus.github.io/flex/', :verbose => true) do |anemone|
anemone.on_every_page do |page|
# index only the successful html pages with some content
if page.code == 200 && page.url.to_s =~ /\.html$/ && page.body.length > 0
# remove the common parts not useful for searching
%w[#page-nav #footer .breadcrumb].each{|css| page.doc.css(css).remove}
# we index the page content by just passing it as a Base64 encoded string
FlexDocPage.create :url => page.url.to_s,
:attachment => Base64.encode64(page.doc.to_s)
end
end
end
end
Most of the code in the task is related to the
anemone
crawling, which will fetch the pages. The important thing is the last line, where we create a newFlexDocPage
document, as we would do in order to store the content in a DB: that will analyze and store the page content into the index.
Run the task, that will crawl and index the whole Flex-Doc site:
$ rake index_flex_doc
Check the Index
Now we have all the content in the index managed by our application, so let’s play with it in the console.
Let’s start just with checking the existence of the index:
>> Flex.exist?
FLEX-INFO: Rendered Flex.exist? from: (irb):1:in `irb_binding'
FLEX-DEBUG: :request:
FLEX-DEBUG: :method: HEAD
FLEX-DEBUG: :path: /flex_doc_search_development
=> true
As you see, each time flex sends any query, it prints the request on your screen. That’s useful to check what is the actual request that has been sent to the elasticsearch server. That is just the basic logging default: you may want to disable it or you may need more info. You can experiment with different settings of the logger by just changing them in the console. For example you may want to whatch the raw result returned by elasticsearch: to enable that you have just to type
Flex::Conf.logger.debug_result = true
right in the console (see Configuration).
>> FlexDocPage.count
FLEX-INFO Rendered Flex::Scope::Query.get from: (irb):2:in `irb_binding'
FLEX-DEBUG :request:
FLEX-DEBUG :method: GET
FLEX-DEBUG :path: /flex_doc_search_development/flex_doc_page/_search?version=true&search_type=count
FLEX-DEBUG :data:
FLEX-DEBUG query:
FLEX-DEBUG query_string:
FLEX-DEBUG query: ! '*'
=> 30
You can also get an indexed page, for example:
>> FlexDocPage.first
... long Base64 encoded content ...
And if you want to avoid the encoded content, just use the attachment_scope
:
>> FlexDocPage.attachment_scope.first
...
>> FlexDocPage.attachment_scope.all
...
>> FlexDocPage.attachment_scope.all(:page => 2)
...
Searching
Now that we have an index, let’s add a custom scope to our FlexDocPage
model: we will use it to search the index easily. Let’s call it :searchable
. Here is how the model will appear after adding our scope:
class FlexDocPage
include Flex::ActiveModel
attribute :url
attribute_created_at
attribute_attachment
scope :searchable do |q|
attachment_scope
.highlight(:fields => { :attachment => {},
:'attachment.title' => {} })
.query(q)
end
end
In the block of our scope we have chained together a few predefined scopes (see flex-scopes).
The attribute_attachment
declaration (that we added to the FlexDocScope
) added the attachment_scope
to our class: we can use it to include in our result also the meta-fields of the page (like title, author, content-type, etc.), and - as we have just experimented in the console session above - it will also exclude the attachment field itself from the result, so it will be easier to handle (see attribute_attachment).
The
attachment
field contains the Base64-encoded content, and unless we need to display it in some view, we can omit it from the result (indeed in this tutorial we will link the original page). However, although we omit it from the result, it is searched by our queries.
The highlight
scope will add the highlights for the attachment
and the attachment.title
fields to our results and the query
scope will handle the query_string that we will use to search (passed to the scope as an argument).
Let’s try it in the console:
>> my_scope = FlexDocPage.searchable('flex')
=> #<Flex::Scope ...>
Notice that the searchable
scope is like any other scope (predefined or custom), so it is chainable with other scopes as well if we want to modify our search criteria. If we need the actual results from that scope, we should call one of the query-scopes, like all
, first
, last
, etc. (see Query Scopes):
>> result = my_scope.all
=> ...
User Interface
So let’s add the app/controllers/searches_controller.rb
with just one action in it:
class SearchesController < ApplicationController
def search
return if params[:q].blank?
@result = FlexDocPage.searchable(params[:q]).all(:page => params[:page])
end
end
We will search the pages by passing the query string to our searchable
scope. We pass also the special variable :page
that will get the right page of the pagination.
We make the search
action the root route in config/routes.rb
root 'searches#search'
Then we need to create the app/views/searches/search.html.erb
, with a form, a loop to display all the paginated result, and the pagination header and footer. Let’s paste the following content in it:
<h3> Search the FlexDoc Site:</h3>
<%= form_tag(root_path, :method => 'GET', :id=>'search-form') do %>
<%= text_field_tag(:q, params[:q]) %>
<%= submit_tag('Search', :id => 'submit-search') %>
<%= submit_tag('Reset', :name => 'reset', :id => 'reset-button' ) %>
<% end %>
<% if @result %>
<div class="pagination">
<%= page_entries_info(@result.collection, :entry_name => 'result') %>
</div>
<% @result.collection.each do |page| %>
<div class="hit">
<div class="attachment_title">
<%= link_to page.highlighted_attachment_title, page.url, :target => '_blank' %>
</div>
<div class="attachment">
<%= page.highlighted_attachment %>
</div>
</div>
<% end %>
<%= paginate(@result.collection) %>
<% end %>
Notice: the
highlighted_*
methods are special helpers generated by theflex-rails
gem. They return the joined string from the elasticsearchhighlights
, if there are no highlights and the attribute exists they return the attribute, or an empty string in any other case (see document.highlighted_*).
Now let’s add just a minimum of CSS rules to make our search page easy to read. Let’s create the file app/assets/stylesheets/search.sass
and paste the following content in it:
body
margin: 3em
em
background-color: yellow
padding
left: .2em
right: .2em
.hit
margin-top: 2em
.attachment_title a
font-weight: bold
.pagination
margin-top: 2em
border-top: 1px solid gray
padding-top: .5em
span
margin-right: .5em
Job done! Now we can start the rails server, and point the browser to our new search app.
Notice: This tutorial is complemented by the previous one, please read both of them (see Index and Search your Models).