Posted about 2 years ago. Visible to the public.

Vortrag: Elasticsearch Grundlagen und Rails-Integration mit searchkick

Was ist Elastic?

  • Suchmaschine, basierend auf Apache Lucene
  • größtenteils Open-Source
  • einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
    • Zugriffsrechte (bis vor kurzen)
    • Monitoring
    • Reporting
    • Graph-Unterstützung
    • Machine Learning
  • REST-Api (JSON über HTTP)

Grundlagen

Elastic antwortet per Default auf Port 9200

Copy
http GET :9200
Copy
{ "name": "ntK2ZrY", "cluster_name": "elasticsearch", "cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw", "version": { "number": "6.7.1", "build_flavor": "default", "build_type": "deb", "build_hash": "2f32220", "build_date": "2019-04-02T15:59:27.961366Z", "build_snapshot": false, "lucene_version": "7.7.0", "minimum_wire_compatibility_version": "5.6.0", "minimum_index_compatibility_version": "5.0.0" }, "tagline": "You Know, for Search" }

Indexe

Dokumente in Elastic leben in einem Index (analog zu SQL Table)

Index erstellen:

Copy
http PUT :9200/my_documents

Index ansehen:

Copy
http GET :9200/my_documents
Copy
{ "my_documents": { "aliases": { }, "mappings": { }, "settings": { "index": { "creation_date": "1556793804746", "number_of_replicas": "1", "number_of_shards": "5", "provided_name": "my_documents", "uuid": "Q9nNksJhT3Ky7SOqDQa4bA", "version": { "created": "6070199" } } } } }

Dokumente

Dokument indizieren:

Copy
cat | http PUT :9200/my_documents/post/1 { "title": "A sample document", "body": "I wrote something", "author": { "name": "Tobias" } }
Copy
{ "_id": "1", "_index": "my_documents", "_primary_term": 1, "_seq_no": 0, "_shards": { "failed": 0, "successful": 1, "total": 2 }, "_type": "post", "_version": 1, "result": "created" }

Suchen

Copy
http GET :9200/my_documents/_search?q=tobias
Copy
{ "_shards": { "failed": 0, "skipped": 0, "successful": 5, "total": 5 }, "hits": { "hits": [ { "_id": "1", "_index": "my_documents", "_score": 0.2876821, "_source": { "author": { "name": "Tobias Kraze" }, "body": "I wrote something", "title": "A sample document" }, "_type": "post" } ], "max_score": 0.2876821, "total": 1 }, "timed_out": false, "took": 1 }
Copy
http GET :9200/my_documents/_search?q=body:tobias

-> nichts

Copy
http GET :9200/my_documents/_search?q=body:something

-> findet

Copy
http GET :9200/my_documents/_search?q=tob

-> nichts

Copy
http GET :9200/my_documents/_search?q=tob*

-> findet

Suche mit JSON-Body

Für Software die eindeutigere Art Queries zu bauen:

Copy
cat | http GET :9200/my_documents/_search { "query": { "match": { "author.name": "tob*" } } }
Copy
cat | http GET :9200/my_documents/_search { "query": { "multi_match": { "query": "wrote document", "fields": ["*"] } } }

Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:

  • Match All Query
  • Full text queries
    • Match Query
    • Match Phrase Query
    • Match Phrase Prefix Query
    • Multi Match Query
    • Common Terms Query
    • Query String Query
    • Simple Query String Query
    • Intervals query
  • Term level queries
    • Term Query
    • Terms Query
    • Terms Set Query
    • Range Query
    • Exists Query
    • Prefix Query
    • Wildcard Query
    • Regexp Query
    • Fuzzy Query
    • Type Query
    • Ids Query
  • Compound queries
    • ...
  • Joining queries
    • ...
  • Geo queries
    • ...

Z.B. (aus Elastic-Doku)

Copy
GET /_search { "query": { "bool": { "must": [ { "match": { "title": "Search" } }, { "match": { "content": "Elasticsearch" } } ], "filter": [ { "term": { "status": "published" } }, { "range": { "publish_date": { "gte": "2015-01-01" } } } ] } } }

Cluster / Nodes / Shards

  • Node ist ein Elasticsearch-Server auf einer Maschine
  • Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
    • Man redet effektiv immer mit dem Cluster
    • Load-Balancing passiert intern automatisch
  • Das Cluster hat Indexe (analog zu SQL Tabelle)
  • Jeder Index hat n Shards (Default 5)
  • Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
  • Shards können automatisch repliziert werden

diagram

Mappings

Index noch mal anschauen:

Copy
http GET :9200/my_documents/
Copy
{ "my_documents": { "aliases": { }, "mappings": { "post": { "properties": { "author": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" } } }, "body": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "title": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" } } } }, "settings": { # ... } } }

Keyword vs Text

  • Text werden Tokenized und gebenenenfalls weiter verarbeitet
  • Keywords matchen immer nur "exakt"

Derived Fields

  • Felder können mehrfach im Index landen, mit verschiedenen Analyzern
Copy
cat | http GET :9200/my_documents/_search { "query": { "term": { "author.name.keyword": "Tobias Kraze" } } }

Advanced Analyzer

  • Man kann pro Feld einen speziellen Analyzer verwenden
  • Damit macht man typischerweise Stemming etc.

Beispiel: Mapping für "ich weiß die Sprache nicht"

Copy
{ "mappings": { "post": { "properties": { "body": { "analyzer": "english", "type": "text" } } } } }

Integration in Rails + searchkick

Was benötigt man für Integration von Elastic in eine Rails-Anwendung?

Mindestens:

  • Client um mit Elastic zu reden
  • Erzeugen von Indexen, passender Mappings + Migration von Mappings
  • Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
  • Bulk-Reindizierung (evtl. async)
  • Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen

Client gibt es mit elasticsearch-api.

Alles andere bekommt man von searchkick

Integration (etwas vereinfacht):

Copy
module Post::DoesSearch as_trait do searchkick callbacks: :async # use sidekiq scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing # index these fields def search_data attributes.slice( 'title', 'url', ).merge( body: html_to_text(body_html), user: { name: user&.name, }, acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic ) end # do not index, if not published def should_index? published? end end end

Indizieren mit:

Copy
Post.reindex(async: true)

Callbacks in anderen Models:

Copy
class User < ApplicationRecord has_many :posts after_commit :reindex_posts def reindex_posts if saved_change_to_attribute?(:name) posts.reindex # this is not async, but we have workarounds! end end end

Suche:

Copy
Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a

Was macht Searchkick hinter den Kulissen

  • Ein Index pro Model
  • Index-Aliase
  • Default-Mappings
  • Suche über mehrere Indexe

Mapping ansehen

Copy
http GET :9200/posts_development

Suche macht folgende Query:

Copy
cat | http GET :9200 /documents_development,posts_development/_search { "query": { "bool": { "should": [ { "dis_max": { "queries": [ { "multi_match": { "query": "my search query", "boost": 10, "operator": "and", "analyzer": "searchkick_search", "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 10, "operator": "and", "analyzer": "searchkick_search2", "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 1, "operator": "and", "analyzer": "searchkick_search", "fuzziness": 1, "prefix_length": 0, "max_expansions": 3, "fuzzy_transpositions": true, "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 1, "operator": "and", "analyzer": "searchkick_search2", "fuzziness": 1, "prefix_length": 0, "max_expansions": 3, "fuzzy_transpositions": true, "fields": [ "*.analyzed" ], "type": "best_fields" } } ] } } ] } }, "timeout": "11s", "_source": false, "size": 10000, "from": 0 }

Aggregations

  • aggregations liefern "Statistiken" zu Suchtreffern
  • hieß früher mal "Facetten"
  • ein bisschen analog zu "GROUP BY" in SQL
  • gibt endlose Liste unterstützter aggregations:
    • Metrics
      • Avg
      • Weighted avg
      • Geo bounds
      • Max
      • Min
      • ...
    • Buckets
      • Terms
      • Date Histogram
      • IP Range
      • Significant Terms
Copy
Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a
Copy
cat | http GET :9200/documents_development,posts_development/_search { "query": { ... }, "aggs": { "_type": { "terms": { "field": "_type", "size": 1000 } } } }
Copy
{ "aggregations": { "_type": { "buckets": [ { "doc_count": 476, "key": "document" }, { "doc_count": 31, "key": "post" } ], "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0 } }, "hits": { ... } }

Testing

Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.

Copy
# spec/spec_helper.rb require 'searchkick' Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren
Copy
# search_spec.rb describe Search, search: true do # enable indexing # ... it 'only returns posts for a post query' do post = create(:post) create(:event) Post.search_index.refresh # force index to update Event.search_index.refresh # force index to update expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post] end # ... end

By refactoring problematic code and creating automated tests, makandra can vastly improve the maintainability of your Rails application.

Owner of this card:

Avatar
Tobias Kraze
Last edit:
7 months ago
by Tobias Kraze
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Tobias Kraze to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more