Vortrag: Elasticsearch Grundlagen und Rails-Integration mit searchkick

Updated . Posted . Visible to the public.

Was ist Elastic?

  • Suchmaschine, basierend auf Apache Lucene
  • größtenteils Open-Source
  • einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
    • Zugriffsrechte (bis vor kurzen)
    • Monitoring
    • Reporting
    • Graph-Unterstützung
    • Machine Learning
  • REST-Api (JSON über HTTP)

Grundlagen

Elastic antwortet per Default auf Port 9200

http GET :9200
{
  "name": "ntK2ZrY",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw",
  "version": {
    "number": "6.7.1",
    "build_flavor": "default",
    "build_type": "deb",
    "build_hash": "2f32220",
    "build_date": "2019-04-02T15:59:27.961366Z",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

Indexe

Dokumente in Elastic leben in einem Index (analog zu SQL Table)

Index erstellen:

http PUT :9200/my_documents

Index ansehen:

http GET :9200/my_documents
{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
    },
    "settings": {
      "index": {
        "creation_date": "1556793804746",
        "number_of_replicas": "1",
        "number_of_shards": "5",
        "provided_name": "my_documents",
        "uuid": "Q9nNksJhT3Ky7SOqDQa4bA",
        "version": {
          "created": "6070199"
        }
      }
    }
  }
}

Dokumente

Dokument indizieren:

cat | http PUT :9200/my_documents/post/1
{
  "title": "A sample document",
  "body": "I wrote something",
  "author": {
    "name": "Tobias"
  }
}
{
  "_id": "1",
  "_index": "my_documents",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "post",
  "_version": 1,
  "result": "created"
}

Suchen

http GET :9200/my_documents/_search?q=tobias
{
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "hits": {
    "hits": [
      {
        "_id": "1",
        "_index": "my_documents",
        "_score": 0.2876821,
        "_source": {
          "author": {
            "name": "Tobias Kraze"
          },
          "body": "I wrote something",
          "title": "A sample document"
        },
        "_type": "post"
      }
    ],
    "max_score": 0.2876821,
    "total": 1
  },
  "timed_out": false,
  "took": 1
}
http GET :9200/my_documents/_search?q=body:tobias

-> nichts

http GET :9200/my_documents/_search?q=body:something

-> findet

http GET :9200/my_documents/_search?q=tob

-> nichts

http GET :9200/my_documents/_search?q=tob*

-> findet

Suche mit JSON-Body

Für Software die eindeutigere Art Queries zu bauen:

cat | http GET :9200/my_documents/_search
{
  "query": {
    "match": { "author.name": "tob*" }
  }
}

cat | http GET :9200/my_documents/_search
{
  "query": {
    "multi_match": { 
      "query": "wrote document",
      "fields": ["*"]
    }
  }
}

Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:

  • Match All Query
  • Full text queries
    • Match Query
    • Match Phrase Query
    • Match Phrase Prefix Query
    • Multi Match Query
    • Common Terms Query
    • Query String Query
    • Simple Query String Query
    • Intervals query
  • Term level queries
    • Term Query
    • Terms Query
    • Terms Set Query
    • Range Query
    • Exists Query
    • Prefix Query
    • Wildcard Query
    • Regexp Query
    • Fuzzy Query
    • Type Query
    • Ids Query
  • Compound queries
    • ...
  • Joining queries
    • ...
  • Geo queries
    • ...

Z.B. (aus Elastic-Doku)

GET /_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Search"
          }
        },
        {
          "match": {
            "content": "Elasticsearch"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "status": "published"
          }
        },
        {
          "range": {
            "publish_date": {
              "gte": "2015-01-01"
            }
          }
        }
      ]
    }
  }
}

Cluster / Nodes / Shards

  • Node ist ein Elasticsearch-Server auf einer Maschine
  • Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
    • Man redet effektiv immer mit dem Cluster
    • Load-Balancing passiert intern automatisch
  • Das Cluster hat Indexe (analog zu SQL Tabelle)
  • Jeder Index hat n Shards (Default 5)
  • Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
  • Shards können automatisch repliziert werden

diagram

Mappings

Index noch mal anschauen:

http GET :9200/my_documents/
{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
      "post": {
        "properties": {
          "author": {
            "properties": {
              "name": {
                "fields": {
                  "keyword": {
                    "ignore_above": 256,
                    "type": "keyword"
                  }
                },
                "type": "text"
              }
            }
          },
          "body": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          },
          "title": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          }
        }
      }
    },
    "settings": {
      # ...
    }
  }
}

Keyword vs Text

  • Text werden Tokenized und gebenenenfalls weiter verarbeitet
  • Keywords matchen immer nur "exakt"

Derived Fields

  • Felder können mehrfach im Index landen, mit verschiedenen Analyzern
cat | http GET :9200/my_documents/_search
{
  "query": {
    "term": {
      "author.name.keyword": "Tobias Kraze"
    }
  }
}

Advanced Analyzer

  • Man kann pro Feld einen speziellen Analyzer verwenden
  • Damit macht man typischerweise Stemming etc.

Beispiel: Mapping für "ich weiß die Sprache nicht"

{
  "mappings": {
    "post": {
      "properties": {
        "body": {
          "analyzer": "english",
          "type": "text"
        }
      }
    }
  }
}

Integration in Rails + searchkick

Was benötigt man für Integration von Elastic in eine Rails-Anwendung?

Mindestens:

  • Client um mit Elastic zu reden
  • Erzeugen von Indexen, passender Mappings + Migration von Mappings
  • Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
  • Bulk-Reindizierung (evtl. async)
  • Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen

Client gibt es mit elasticsearch-api Show archive.org snapshot .

Alles andere bekommt man von searchkick Show archive.org snapshot

Integration (etwas vereinfacht):

module Post::DoesSearch
  as_trait do

    searchkick callbacks: :async # use sidekiq

    scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing

    # index these fields
    def search_data
      attributes.slice(
        'title',
        'url',
      ).merge(
        body: html_to_text(body_html),
        user: {
          name: user&.name,
        },
        acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic
      )
    end

    # do not index, if not published
    def should_index?
      published?
    end

  end
end

Indizieren mit:

Post.reindex(async: true)

Callbacks in anderen Models:

class User < ApplicationRecord

  has_many :posts
  after_commit :reindex_posts

  def reindex_posts
    if saved_change_to_attribute?(:name)
      posts.reindex # this is not async, but we have workarounds!
    end
  end

end

Suche:

Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a

Was macht Searchkick hinter den Kulissen

  • Ein Index pro Model
  • Index-Aliase
  • Default-Mappings
  • Suche über mehrere Indexe

Mapping ansehen

http GET :9200/posts_development

Suche macht folgende Query:

cat | http GET :9200 /documents_development,posts_development/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "dis_max": {
            "queries": [
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "timeout": "11s",
  "_source": false,
  "size": 10000,
  "from": 0
}

Aggregations

  • aggregations liefern "Statistiken" zu Suchtreffern
  • hieß früher mal "Facetten"
  • ein bisschen analog zu "GROUP BY" in SQL
  • gibt endlose Liste unterstützter aggregations:
    • Metrics
      • Avg
      • Weighted avg
      • Geo bounds
      • Max
      • Min
      • ...
    • Buckets
      • Terms
      • Date Histogram
      • IP Range
      • Significant Terms
Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a
cat | http GET :9200/documents_development,posts_development/_search
{
  "query": { ... },
  "aggs": {
    "_type": {
      "terms": {
        "field": "_type",
        "size": 1000
      }
    }
  }
}
{
  "aggregations": {
    "_type": {
      "buckets": [
        {
          "doc_count": 476,
          "key": "document"
        },
        {
          "doc_count": 31,
          "key": "post"
        }
      ],
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0
    }
  },
  "hits": { ... }
}

Testing

Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.

# spec/spec_helper.rb

require 'searchkick'
Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren
# search_spec.rb

describe Search, search: true do # enable indexing
  # ...

  it 'only returns posts for a post query' do
    post = create(:post)
    create(:event)
    Post.search_index.refresh # force index to update
    Event.search_index.refresh # force index to update

    expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post]
  end

  # ...

end
Profile picture of Tobias Kraze
Tobias Kraze
Last edit
Tobias Kraze
License
Source code in this card is licensed under the MIT License.
Posted by Tobias Kraze to makandra dev (2019-05-02 12:47)