Vortrag: Elasticsearch Grundlagen und Rails-Integration mit searchkick

Was ist Elastic?

Suchmaschine, basierend auf Apache Lucene
größtenteils Open-Source
einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
- Zugriffsrechte (bis vor kurzen)
- Monitoring
- Reporting
- Graph-Unterstützung
- Machine Learning
REST-Api (JSON über HTTP)

Grundlagen

Elastic antwortet per Default auf Port 9200

http GET :9200

{
  "name": "ntK2ZrY",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw",
  "version": {
    "number": "6.7.1",
    "build_flavor": "default",
    "build_type": "deb",
    "build_hash": "2f32220",
    "build_date": "2019-04-02T15:59:27.961366Z",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

Indexe

Dokumente in Elastic leben in einem Index (analog zu SQL Table)

Index erstellen:

http PUT :9200/my_documents

Index ansehen:

http GET :9200/my_documents

{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
    },
    "settings": {
      "index": {
        "creation_date": "1556793804746",
        "number_of_replicas": "1",
        "number_of_shards": "5",
        "provided_name": "my_documents",
        "uuid": "Q9nNksJhT3Ky7SOqDQa4bA",
        "version": {
          "created": "6070199"
        }
      }
    }
  }
}

Dokumente

Dokument indizieren:

cat | http PUT :9200/my_documents/post/1
{
  "title": "A sample document",
  "body": "I wrote something",
  "author": {
    "name": "Tobias"
  }
}

{
  "_id": "1",
  "_index": "my_documents",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "post",
  "_version": 1,
  "result": "created"
}

Suchen

http GET :9200/my_documents/_search?q=tobias

{
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "hits": {
    "hits": [
      {
        "_id": "1",
        "_index": "my_documents",
        "_score": 0.2876821,
        "_source": {
          "author": {
            "name": "Tobias Kraze"
          },
          "body": "I wrote something",
          "title": "A sample document"
        },
        "_type": "post"
      }
    ],
    "max_score": 0.2876821,
    "total": 1
  },
  "timed_out": false,
  "took": 1
}

http GET :9200/my_documents/_search?q=body:tobias

-> nichts

http GET :9200/my_documents/_search?q=body:something

-> findet

http GET :9200/my_documents/_search?q=tob

-> nichts

http GET :9200/my_documents/_search?q=tob*

-> findet

Suche mit JSON-Body

Für Software die eindeutigere Art Queries zu bauen:

cat | http GET :9200/my_documents/_search
{
  "query": {
    "match": { "author.name": "tob*" }
  }
}

cat | http GET :9200/my_documents/_search
{
  "query": {
    "multi_match": { 
      "query": "wrote document",
      "fields": ["*"]
    }
  }
}

Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:

Match All Query
Full text queries
- Match Query
- Match Phrase Query
- Match Phrase Prefix Query
- Multi Match Query
- Common Terms Query
- Query String Query
- Simple Query String Query
- Intervals query
Term level queries
- Term Query
- Terms Query
- Terms Set Query
- Range Query
- Exists Query
- Prefix Query
- Wildcard Query
- Regexp Query
- Fuzzy Query
- Type Query
- Ids Query
Compound queries
- ...
Joining queries
- ...
Geo queries
- ...

Z.B. (aus Elastic-Doku)

GET /_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Search"
          }
        },
        {
          "match": {
            "content": "Elasticsearch"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "status": "published"
          }
        },
        {
          "range": {
            "publish_date": {
              "gte": "2015-01-01"
            }
          }
        }
      ]
    }
  }
}

Cluster / Nodes / Shards

Node ist ein Elasticsearch-Server auf einer Maschine
Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
- Man redet effektiv immer mit dem Cluster
- Load-Balancing passiert intern automatisch
Das Cluster hat Indexe (analog zu SQL Tabelle)
Jeder Index hat n Shards (Default 5)
Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
Shards können automatisch repliziert werden

diagram

Mappings

Index noch mal anschauen:

http GET :9200/my_documents/

{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
      "post": {
        "properties": {
          "author": {
            "properties": {
              "name": {
                "fields": {
                  "keyword": {
                    "ignore_above": 256,
                    "type": "keyword"
                  }
                },
                "type": "text"
              }
            }
          },
          "body": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          },
          "title": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          }
        }
      }
    },
    "settings": {
      # ...
    }
  }
}

Keyword vs Text

Text werden Tokenized und gebenenenfalls weiter verarbeitet
Keywords matchen immer nur "exakt"

Derived Fields

Felder können mehrfach im Index landen, mit verschiedenen Analyzern

cat | http GET :9200/my_documents/_search
{
  "query": {
    "term": {
      "author.name.keyword": "Tobias Kraze"
    }
  }
}

Advanced Analyzer

Man kann pro Feld einen speziellen Analyzer verwenden
Damit macht man typischerweise Stemming etc.

Beispiel: Mapping für "ich weiß die Sprache nicht"

{
  "mappings": {
    "post": {
      "properties": {
        "body": {
          "analyzer": "english",
          "type": "text"
        }
      }
    }
  }
}

Integration in Rails + searchkick

Was benötigt man für Integration von Elastic in eine Rails-Anwendung?

Mindestens:

Client um mit Elastic zu reden
Erzeugen von Indexen, passender Mappings + Migration von Mappings
Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
Bulk-Reindizierung (evtl. async)
Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen

Client gibt es mit elasticsearch-api Show archive.org snapshot .

Alles andere bekommt man von searchkick Show archive.org snapshot

Integration (etwas vereinfacht):

module Post::DoesSearch
  as_trait do

    searchkick callbacks: :async # use sidekiq

    scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing

    # index these fields
    def search_data
      attributes.slice(
        'title',
        'url',
      ).merge(
        body: html_to_text(body_html),
        user: {
          name: user&.name,
        },
        acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic
      )
    end

    # do not index, if not published
    def should_index?
      published?
    end

  end
end

Indizieren mit:

Post.reindex(async: true)

Callbacks in anderen Models:

class User < ApplicationRecord

  has_many :posts
  after_commit :reindex_posts

  def reindex_posts
    if saved_change_to_attribute?(:name)
      posts.reindex # this is not async, but we have workarounds!
    end
  end

end

Suche:

Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a

Was macht Searchkick hinter den Kulissen

Ein Index pro Model
Index-Aliase
Default-Mappings
Suche über mehrere Indexe

Mapping ansehen

http GET :9200/posts_development

Suche macht folgende Query:

cat | http GET :9200 /documents_development,posts_development/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "dis_max": {
            "queries": [
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "timeout": "11s",
  "_source": false,
  "size": 10000,
  "from": 0
}

Aggregations

aggregations liefern "Statistiken" zu Suchtreffern
hieß früher mal "Facetten"
ein bisschen analog zu "GROUP BY" in SQL
gibt endlose Liste unterstützter aggregations:
- Metrics
  - Avg
  - Weighted avg
  - Geo bounds
  - Max
  - Min
  - ...
- Buckets
  - Terms
  - Date Histogram
  - IP Range
  - Significant Terms

Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a

cat | http GET :9200/documents_development,posts_development/_search
{
  "query": { ... },
  "aggs": {
    "_type": {
      "terms": {
        "field": "_type",
        "size": 1000
      }
    }
  }
}

{
  "aggregations": {
    "_type": {
      "buckets": [
        {
          "doc_count": 476,
          "key": "document"
        },
        {
          "doc_count": 31,
          "key": "post"
        }
      ],
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0
    }
  },
  "hits": { ... }
}

Testing

Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.

# spec/spec_helper.rb

require 'searchkick'
Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren

# search_spec.rb

describe Search, search: true do # enable indexing
  # ...

  it 'only returns posts for a post query' do
    post = create(:post)
    create(:event)
    Post.search_index.refresh # force index to update
    Event.search_index.refresh # force index to update

    expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post]
  end

  # ...

end

Tobias Kraze

makandra.de

Say thanks2

Last edit

2019-05-22

Tobias Kraze

License

Source code in this card is licensed under the MIT License.

Posted by Tobias Kraze to makandra dev (2019-05-02 12:47)