Posted almost 2 years ago. Visible to the public.

Vortrag: Elasticsearch Grundlagen und Rails-Integration mit searchkick

Was ist Elastic?

  • Suchmaschine, basierend auf Apache Lucene
  • größtenteils Open-Source
  • einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
    • Zugriffsrechte (bis vor kurzen)
    • Monitoring
    • Reporting
    • Graph-Unterstützung
    • Machine Learning
  • REST-Api (JSON über HTTP)

Grundlagen

Elastic antwortet per Default auf Port 9200

Copy
http GET :9200
Copy
{ "name": "ntK2ZrY", "cluster_name": "elasticsearch", "cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw", "version": { "number": "6.7.1", "build_flavor": "default", "build_type": "deb", "build_hash": "2f32220", "build_date": "2019-04-02T15:59:27.961366Z", "build_snapshot": false, "lucene_version": "7.7.0", "minimum_wire_compatibility_version": "5.6.0", "minimum_index_compatibility_version": "5.0.0" }, "tagline": "You Know, for Search" }

Indexe

Dokumente in Elastic leben in einem Index (analog zu SQL Table)

Index erstellen:

Copy
http PUT :9200/my_documents

Index ansehen:

Copy
http GET :9200/my_documents
Copy
{ "my_documents": { "aliases": { }, "mappings": { }, "settings": { "index": { "creation_date": "1556793804746", "number_of_replicas": "1", "number_of_shards": "5", "provided_name": "my_documents", "uuid": "Q9nNksJhT3Ky7SOqDQa4bA", "version": { "created": "6070199" } } } } }

Dokumente

Dokument indizieren:

Copy
cat | http PUT :9200/my_documents/post/1 { "title": "A sample document", "body": "I wrote something", "author": { "name": "Tobias" } }
Copy
{ "_id": "1", "_index": "my_documents", "_primary_term": 1, "_seq_no": 0, "_shards": { "failed": 0, "successful": 1, "total": 2 }, "_type": "post", "_version": 1, "result": "created" }

Suchen

Copy
http GET :9200/my_documents/_search?q=tobias
Copy
{ "_shards": { "failed": 0, "skipped": 0, "successful": 5, "total": 5 }, "hits": { "hits": [ { "_id": "1", "_index": "my_documents", "_score": 0.2876821, "_source": { "author": { "name": "Tobias Kraze" }, "body": "I wrote something", "title": "A sample document" }, "_type": "post" } ], "max_score": 0.2876821, "total": 1 }, "timed_out": false, "took": 1 }
Copy
http GET :9200/my_documents/_search?q=body:tobias

-> nichts

Copy
http GET :9200/my_documents/_search?q=body:something

-> findet

Copy
http GET :9200/my_documents/_search?q=tob

-> nichts

Copy
http GET :9200/my_documents/_search?q=tob*

-> findet

Suche mit JSON-Body

Für Software die eindeutigere Art Queries zu bauen:

Copy
cat | http GET :9200/my_documents/_search { "query": { "match": { "author.name": "tob*" } } }
Copy
cat | http GET :9200/my_documents/_search { "query": { "multi_match": { "query": "wrote document", "fields": ["*"] } } }

Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:

  • Match All Query
  • Full text queries
    • Match Query
    • Match Phrase Query
    • Match Phrase Prefix Query
    • Multi Match Query
    • Common Terms Query
    • Query String Query
    • Simple Query String Query
    • Intervals query
  • Term level queries
    • Term Query
    • Terms Query
    • Terms Set Query
    • Range Query
    • Exists Query
    • Prefix Query
    • Wildcard Query
    • Regexp Query
    • Fuzzy Query
    • Type Query
    • Ids Query
  • Compound queries
    • ...
  • Joining queries
    • ...
  • Geo queries
    • ...

Z.B. (aus Elastic-Doku)

Copy
GET /_search { "query": { "bool": { "must": [ { "match": { "title": "Search" } }, { "match": { "content": "Elasticsearch" } } ], "filter": [ { "term": { "status": "published" } }, { "range": { "publish_date": { "gte": "2015-01-01" } } } ] } } }

Cluster / Nodes / Shards

  • Node ist ein Elasticsearch-Server auf einer Maschine
  • Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
    • Man redet effektiv immer mit dem Cluster
    • Load-Balancing passiert intern automatisch
  • Das Cluster hat Indexe (analog zu SQL Tabelle)
  • Jeder Index hat n Shards (Default 5)
  • Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
  • Shards können automatisch repliziert werden

diagram

Mappings

Index noch mal anschauen:

Copy
http GET :9200/my_documents/
Copy
{ "my_documents": { "aliases": { }, "mappings": { "post": { "properties": { "author": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" } } }, "body": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "title": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" } } } }, "settings": { # ... } } }

Keyword vs Text

  • Text werden Tokenized und gebenenenfalls weiter verarbeitet
  • Keywords matchen immer nur "exakt"

Derived Fields

  • Felder können mehrfach im Index landen, mit verschiedenen Analyzern
Copy
cat | http GET :9200/my_documents/_search { "query": { "term": { "author.name.keyword": "Tobias Kraze" } } }

Advanced Analyzer

  • Man kann pro Feld einen speziellen Analyzer verwenden
  • Damit macht man typischerweise Stemming etc.

Beispiel: Mapping für "ich weiß die Sprache nicht"

Copy
{ "mappings": { "post": { "properties": { "body": { "analyzer": "english", "type": "text" } } } } }

Integration in Rails + searchkick

Was benötigt man für Integration von Elastic in eine Rails-Anwendung?

Mindestens:

  • Client um mit Elastic zu reden
  • Erzeugen von Indexen, passender Mappings + Migration von Mappings
  • Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
  • Bulk-Reindizierung (evtl. async)
  • Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen

Client gibt es mit elasticsearch-api.

Alles andere bekommt man von searchkick

Integration (etwas vereinfacht):

Copy
module Post::DoesSearch as_trait do searchkick callbacks: :async # use sidekiq scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing # index these fields def search_data attributes.slice( 'title', 'url', ).merge( body: html_to_text(body_html), user: { name: user&.name, }, acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic ) end # do not index, if not published def should_index? published? end end end

Indizieren mit:

Copy
Post.reindex(async: true)

Callbacks in anderen Models:

Copy
class User < ApplicationRecord has_many :posts after_commit :reindex_posts def reindex_posts if saved_change_to_attribute?(:name) posts.reindex # this is not async, but we have workarounds! end end end

Suche:

Copy
Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a

Was macht Searchkick hinter den Kulissen

  • Ein Index pro Model
  • Index-Aliase
  • Default-Mappings
  • Suche über mehrere Indexe

Mapping ansehen

Copy
http GET :9200/posts_development

Suche macht folgende Query:

Copy
cat | http GET :9200 /documents_development,posts_development/_search { "query": { "bool": { "should": [ { "dis_max": { "queries": [ { "multi_match": { "query": "my search query", "boost": 10, "operator": "and", "analyzer": "searchkick_search", "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 10, "operator": "and", "analyzer": "searchkick_search2", "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 1, "operator": "and", "analyzer": "searchkick_search", "fuzziness": 1, "prefix_length": 0, "max_expansions": 3, "fuzzy_transpositions": true, "fields": [ "*.analyzed" ], "type": "best_fields" } }, { "multi_match": { "query": "my search query", "boost": 1, "operator": "and", "analyzer": "searchkick_search2", "fuzziness": 1, "prefix_length": 0, "max_expansions": 3, "fuzzy_transpositions": true, "fields": [ "*.analyzed" ], "type": "best_fields" } } ] } } ] } }, "timeout": "11s", "_source": false, "size": 10000, "from": 0 }

Aggregations

  • aggregations liefern "Statistiken" zu Suchtreffern
  • hieß früher mal "Facetten"
  • ein bisschen analog zu "GROUP BY" in SQL
  • gibt endlose Liste unterstützter aggregations:
    • Metrics
      • Avg
      • Weighted avg
      • Geo bounds
      • Max
      • Min
      • ...
    • Buckets
      • Terms
      • Date Histogram
      • IP Range
      • Significant Terms
Copy
Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a
Copy
cat | http GET :9200/documents_development,posts_development/_search { "query": { ... }, "aggs": { "_type": { "terms": { "field": "_type", "size": 1000 } } } }
Copy
{ "aggregations": { "_type": { "buckets": [ { "doc_count": 476, "key": "document" }, { "doc_count": 31, "key": "post" } ], "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0 } }, "hits": { ... } }

Testing

Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.

Copy
# spec/spec_helper.rb require 'searchkick' Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren
Copy
# search_spec.rb describe Search, search: true do # enable indexing # ... it 'only returns posts for a post query' do post = create(:post) create(:event) Post.search_index.refresh # force index to update Event.search_index.refresh # force index to update expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post] end # ... end

Once an application no longer requires constant development, it needs periodic maintenance for stable and secure operation. makandra offers monthly maintenance contracts that let you focus on your business while we make sure the lights stay on.

Owner of this card:

Avatar
Tobias Kraze
Last edit:
3 months ago
by Tobias Kraze
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Tobias Kraze to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more