Kennis Blogs Searching parts of words with elasticsearch

Searching parts of words with elasticsearch

I figured out how to search parts of words with elasticsearch. If you are not familiar with the terms used in the documentation this can be quite a challenge. I will show you the key parts of the configuration you need.

 

First, you need to know a little about the search engine. When you store documents, the text is indexed. This is done by tokenizing the text, basically chopping it up and then running those chops through one or more filters. When searching the same happens to the search terms (chop / filter / used for search).

 

The tokenization and filtering for indexing and searching can be configured independently. Most of the time these configurations are to be kept the same, as illustrated by this case that uses a lowercase filter:

  • text to be indexed: Lorem Ipsum,
  • tokenized: 'Lorem', 'Ipsum'
  • filtered: 'lorem', 'ipsum'

 

If you search for this text (using the lowercase filter again):

  • text you search for: 'Lorem'
  • tokenized: 'Lorem'
  • filtered: 'lorem'

 

Notice that both the indexed text and the search term are lowercase and will thus match. If you would not configure a lowercase search filter this would not return a result.

 

In the case where you want to search for parts of words however, the filters used for index time differ from the filters used for search. Here is the elasticsearch configuration for creating an index, that defines the filters to be used for indexing and for searching:

 

Elasticsearch

 

The numbers correspond to the critical parts of the configuration (explained below).

  1. At index time, we use the translation filter, that is of type ngram, declared at (3).
  2. Notice that for the search we DO NOT use the ngram filter
  3. The declaration of the ngram filter.
The way the ngram filter works, is that it explodes the token stream into more tokens. The token 'lorem' for example, is exploded into these tokens:
 
  • l
  • lo
  • lor
  • lore
  • lorem

 

All of these are indexed, making it possible to find any part of the word lorem. The configuration parameters min_gram and max_gram indicate the shortest and longest token  to be extracted from the input token. A min_gram of 2 and a max_gram 0f 4 would result in:
  • lo
  • lor
  • lore

 

This shortens the time required for indexing, but limits the parts of the words you are able to find.
 
To use the defined filters from your elasticsearch mapping you would use this configuration in your mapping:
 

Elasticsearch mapping