Was ist Elastic?
- Suchmaschine, basierend auf Apache Lucene
- größtenteils Open-Source
- einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
- Zugriffsrechte (bis vor kurzen)
- Monitoring
- Reporting
- Graph-Unterstützung
- Machine Learning
- REST-Api (JSON über HTTP)
Grundlagen
Elastic antwortet per Default auf Port 9200
http GET :9200
{
"name": "ntK2ZrY",
"cluster_name": "elasticsearch",
"cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw",
"version": {
"number": "6.7.1",
"build_flavor": "default",
"build_type": "deb",
"build_hash": "2f32220",
"build_date": "2019-04-02T15:59:27.961366Z",
"build_snapshot": false,
"lucene_version": "7.7.0",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"
},
"tagline": "You Know, for Search"
}
Indexe
Dokumente in Elastic leben in einem Index (analog zu SQL Table)
Index erstellen:
http PUT :9200/my_documents
Index ansehen:
http GET :9200/my_documents
{
"my_documents": {
"aliases": {
},
"mappings": {
},
"settings": {
"index": {
"creation_date": "1556793804746",
"number_of_replicas": "1",
"number_of_shards": "5",
"provided_name": "my_documents",
"uuid": "Q9nNksJhT3Ky7SOqDQa4bA",
"version": {
"created": "6070199"
}
}
}
}
}
Dokumente
Dokument indizieren:
cat | http PUT :9200/my_documents/post/1
{
"title": "A sample document",
"body": "I wrote something",
"author": {
"name": "Tobias"
}
}
{
"_id": "1",
"_index": "my_documents",
"_primary_term": 1,
"_seq_no": 0,
"_shards": {
"failed": 0,
"successful": 1,
"total": 2
},
"_type": "post",
"_version": 1,
"result": "created"
}
Suchen
http GET :9200/my_documents/_search?q=tobias
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "1",
"_index": "my_documents",
"_score": 0.2876821,
"_source": {
"author": {
"name": "Tobias Kraze"
},
"body": "I wrote something",
"title": "A sample document"
},
"_type": "post"
}
],
"max_score": 0.2876821,
"total": 1
},
"timed_out": false,
"took": 1
}
http GET :9200/my_documents/_search?q=body:tobias
-> nichts
http GET :9200/my_documents/_search?q=body:something
-> findet
http GET :9200/my_documents/_search?q=tob
-> nichts
http GET :9200/my_documents/_search?q=tob*
-> findet
Suche mit JSON-Body
Für Software die eindeutigere Art Queries zu bauen:
cat | http GET :9200/my_documents/_search
{
"query": {
"match": { "author.name": "tob*" }
}
}
cat | http GET :9200/my_documents/_search
{
"query": {
"multi_match": {
"query": "wrote document",
"fields": ["*"]
}
}
}
Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:
- Match All Query
- Full text queries
- Match Query
- Match Phrase Query
- Match Phrase Prefix Query
- Multi Match Query
- Common Terms Query
- Query String Query
- Simple Query String Query
- Intervals query
- Term level queries
- Term Query
- Terms Query
- Terms Set Query
- Range Query
- Exists Query
- Prefix Query
- Wildcard Query
- Regexp Query
- Fuzzy Query
- Type Query
- Ids Query
- Compound queries
- ...
- Joining queries
- ...
- Geo queries
- ...
Z.B. (aus Elastic-Doku)
GET /_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "Search"
}
},
{
"match": {
"content": "Elasticsearch"
}
}
],
"filter": [
{
"term": {
"status": "published"
}
},
{
"range": {
"publish_date": {
"gte": "2015-01-01"
}
}
}
]
}
}
}
Cluster / Nodes / Shards
- Node ist ein Elasticsearch-Server auf einer Maschine
- Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
- Man redet effektiv immer mit dem Cluster
- Load-Balancing passiert intern automatisch
- Das Cluster hat Indexe (analog zu SQL Tabelle)
- Jeder Index hat n Shards (Default 5)
- Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
- Shards können automatisch repliziert werden
Mappings
Index noch mal anschauen:
http GET :9200/my_documents/
{
"my_documents": {
"aliases": {
},
"mappings": {
"post": {
"properties": {
"author": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
}
},
"body": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"title": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
}
}
},
"settings": {
# ...
}
}
}
Keyword vs Text
- Text werden Tokenized und gebenenenfalls weiter verarbeitet
- Keywords matchen immer nur "exakt"
Derived Fields
- Felder können mehrfach im Index landen, mit verschiedenen Analyzern
cat | http GET :9200/my_documents/_search
{
"query": {
"term": {
"author.name.keyword": "Tobias Kraze"
}
}
}
Advanced Analyzer
- Man kann pro Feld einen speziellen Analyzer verwenden
- Damit macht man typischerweise Stemming etc.
Beispiel: Mapping für "ich weiß die Sprache nicht"
{
"mappings": {
"post": {
"properties": {
"body": {
"analyzer": "english",
"type": "text"
}
}
}
}
}
Integration in Rails + searchkick
Was benötigt man für Integration von Elastic in eine Rails-Anwendung?
Mindestens:
- Client um mit Elastic zu reden
- Erzeugen von Indexen, passender Mappings + Migration von Mappings
- Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
- Bulk-Reindizierung (evtl. async)
- Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen
Client gibt es mit elasticsearch-api Show archive.org snapshot .
Alles andere bekommt man von searchkick Show archive.org snapshot
Integration (etwas vereinfacht):
module Post::DoesSearch
as_trait do
searchkick callbacks: :async # use sidekiq
scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing
# index these fields
def search_data
attributes.slice(
'title',
'url',
).merge(
body: html_to_text(body_html),
user: {
name: user&.name,
},
acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic
)
end
# do not index, if not published
def should_index?
published?
end
end
end
Indizieren mit:
Post.reindex(async: true)
Callbacks in anderen Models:
class User < ApplicationRecord
has_many :posts
after_commit :reindex_posts
def reindex_posts
if saved_change_to_attribute?(:name)
posts.reindex # this is not async, but we have workarounds!
end
end
end
Suche:
Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a
Was macht Searchkick hinter den Kulissen
- Ein Index pro Model
- Index-Aliase
- Default-Mappings
- Suche über mehrere Indexe
Mapping ansehen
http GET :9200/posts_development
Suche macht folgende Query:
cat | http GET :9200 /documents_development,posts_development/_search
{
"query": {
"bool": {
"should": [
{
"dis_max": {
"queries": [
{
"multi_match": {
"query": "my search query",
"boost": 10,
"operator": "and",
"analyzer": "searchkick_search",
"fields": [
"*.analyzed"
],
"type": "best_fields"
}
},
{
"multi_match": {
"query": "my search query",
"boost": 10,
"operator": "and",
"analyzer": "searchkick_search2",
"fields": [
"*.analyzed"
],
"type": "best_fields"
}
},
{
"multi_match": {
"query": "my search query",
"boost": 1,
"operator": "and",
"analyzer": "searchkick_search",
"fuzziness": 1,
"prefix_length": 0,
"max_expansions": 3,
"fuzzy_transpositions": true,
"fields": [
"*.analyzed"
],
"type": "best_fields"
}
},
{
"multi_match": {
"query": "my search query",
"boost": 1,
"operator": "and",
"analyzer": "searchkick_search2",
"fuzziness": 1,
"prefix_length": 0,
"max_expansions": 3,
"fuzzy_transpositions": true,
"fields": [
"*.analyzed"
],
"type": "best_fields"
}
}
]
}
}
]
}
},
"timeout": "11s",
"_source": false,
"size": 10000,
"from": 0
}
Aggregations
- aggregations liefern "Statistiken" zu Suchtreffern
- hieß früher mal "Facetten"
- ein bisschen analog zu "GROUP BY" in SQL
- gibt endlose Liste unterstützter aggregations:
- Metrics
- Avg
- Weighted avg
- Geo bounds
- Max
- Min
- ...
- Buckets
- Terms
- Date Histogram
- IP Range
- Significant Terms
- Metrics
Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a
cat | http GET :9200/documents_development,posts_development/_search
{
"query": { ... },
"aggs": {
"_type": {
"terms": {
"field": "_type",
"size": 1000
}
}
}
}
{
"aggregations": {
"_type": {
"buckets": [
{
"doc_count": 476,
"key": "document"
},
{
"doc_count": 31,
"key": "post"
}
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
}
},
"hits": { ... }
}
Testing
Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.
# spec/spec_helper.rb
require 'searchkick'
Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren
# search_spec.rb
describe Search, search: true do # enable indexing
# ...
it 'only returns posts for a post query' do
post = create(:post)
create(:event)
Post.search_index.refresh # force index to update
Event.search_index.refresh # force index to update
expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post]
end
# ...
end