Was ist Elastic?
- Suchmaschine, basierend auf Apache Lucene
 - größtenteils Open-Source
 - einige kommerzielle Features ("Elastic Stack", früher "X-Pack")
- Zugriffsrechte (bis vor kurzen)
 - Monitoring
 - Reporting
 - Graph-Unterstützung
 - Machine Learning
 
 - REST-Api (JSON über HTTP)
 
Grundlagen
Elastic antwortet per Default auf Port 9200
http GET :9200
{
  "name": "ntK2ZrY",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "Bbc-ix5bQZij5vfFU29-Cw",
  "version": {
    "number": "6.7.1",
    "build_flavor": "default",
    "build_type": "deb",
    "build_hash": "2f32220",
    "build_date": "2019-04-02T15:59:27.961366Z",
    "build_snapshot": false,
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}
Indexe
Dokumente in Elastic leben in einem Index (analog zu SQL Table)
Index erstellen:
http PUT :9200/my_documents
Index ansehen:
http GET :9200/my_documents
{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
    },
    "settings": {
      "index": {
        "creation_date": "1556793804746",
        "number_of_replicas": "1",
        "number_of_shards": "5",
        "provided_name": "my_documents",
        "uuid": "Q9nNksJhT3Ky7SOqDQa4bA",
        "version": {
          "created": "6070199"
        }
      }
    }
  }
}
Dokumente
Dokument indizieren:
cat | http PUT :9200/my_documents/post/1
{
  "title": "A sample document",
  "body": "I wrote something",
  "author": {
    "name": "Tobias"
  }
}
{
  "_id": "1",
  "_index": "my_documents",
  "_primary_term": 1,
  "_seq_no": 0,
  "_shards": {
    "failed": 0,
    "successful": 1,
    "total": 2
  },
  "_type": "post",
  "_version": 1,
  "result": "created"
}
Suchen
http GET :9200/my_documents/_search?q=tobias
{
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "hits": {
    "hits": [
      {
        "_id": "1",
        "_index": "my_documents",
        "_score": 0.2876821,
        "_source": {
          "author": {
            "name": "Tobias Kraze"
          },
          "body": "I wrote something",
          "title": "A sample document"
        },
        "_type": "post"
      }
    ],
    "max_score": 0.2876821,
    "total": 1
  },
  "timed_out": false,
  "took": 1
}
http GET :9200/my_documents/_search?q=body:tobias
-> nichts
http GET :9200/my_documents/_search?q=body:something
-> findet
http GET :9200/my_documents/_search?q=tob
-> nichts
http GET :9200/my_documents/_search?q=tob*
-> findet
Suche mit JSON-Body
Für Software die eindeutigere Art Queries zu bauen:
cat | http GET :9200/my_documents/_search
{
  "query": {
    "match": { "author.name": "tob*" }
  }
}
cat | http GET :9200/my_documents/_search
{
  "query": {
    "multi_match": { 
      "query": "wrote document",
      "fields": ["*"]
    }
  }
}
Elastic hat eine große Menge verschiedener Query-Typen, die man beliebig kombinieren kann:
- Match All Query
 - Full text queries
- Match Query
 - Match Phrase Query
 - Match Phrase Prefix Query
 - Multi Match Query
 - Common Terms Query
 - Query String Query
 - Simple Query String Query
 - Intervals query
 
 - Term level queries
- Term Query
 - Terms Query
 - Terms Set Query
 - Range Query
 - Exists Query
 - Prefix Query
 - Wildcard Query
 - Regexp Query
 - Fuzzy Query
 - Type Query
 - Ids Query
 
 - Compound queries
- ...
 
 - Joining queries
- ...
 
 - Geo queries
- ...
 
 
Z.B. (aus Elastic-Doku)
GET /_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Search"
          }
        },
        {
          "match": {
            "content": "Elasticsearch"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "status": "published"
          }
        },
        {
          "range": {
            "publish_date": {
              "gte": "2015-01-01"
            }
          }
        }
      ]
    }
  }
}
Cluster / Nodes / Shards
- Node ist ein Elasticsearch-Server auf einer Maschine
 - Cluster ist zusammenschluss von Nodes (analog zu einer verteilten SQL Datenbank)
- Man redet effektiv immer mit dem Cluster
 - Load-Balancing passiert intern automatisch
 
 - Das Cluster hat Indexe (analog zu SQL Tabelle)
 - Jeder Index hat n Shards (Default 5)
 - Ein Dokument (analog zu einer SQL Zeile) landet in einem der Shards
 - Shards können automatisch repliziert werden
 

Mappings
Index noch mal anschauen:
http GET :9200/my_documents/
{
  "my_documents": {
    "aliases": {
    },
    "mappings": {
      "post": {
        "properties": {
          "author": {
            "properties": {
              "name": {
                "fields": {
                  "keyword": {
                    "ignore_above": 256,
                    "type": "keyword"
                  }
                },
                "type": "text"
              }
            }
          },
          "body": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          },
          "title": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          }
        }
      }
    },
    "settings": {
      # ...
    }
  }
}
Keyword vs Text
- Text werden Tokenized und gebenenenfalls weiter verarbeitet
 - Keywords matchen immer nur "exakt"
 
Derived Fields
- Felder können mehrfach im Index landen, mit verschiedenen Analyzern
 
cat | http GET :9200/my_documents/_search
{
  "query": {
    "term": {
      "author.name.keyword": "Tobias Kraze"
    }
  }
}
Advanced Analyzer
- Man kann pro Feld einen speziellen Analyzer verwenden
 - Damit macht man typischerweise Stemming etc.
 
Beispiel: Mapping für "ich weiß die Sprache nicht"
{
  "mappings": {
    "post": {
      "properties": {
        "body": {
          "analyzer": "english",
          "type": "text"
        }
      }
    }
  }
}
Integration in Rails + searchkick
Was benötigt man für Integration von Elastic in eine Rails-Anwendung?
Mindestens:
- Client um mit Elastic zu reden
 - Erzeugen von Indexen, passender Mappings + Migration von Mappings
 - Automatische Indizierung + Löschen von Dokumenten (after_commit + evtl. async)
 - Bulk-Reindizierung (evtl. async)
 - Möglichkeit, Sucherergebnisse zu Datenbank-Records zuzuordnen
 
Client gibt es mit elasticsearch-api Show archive.org snapshot .
Alles andere bekommt man von searchkick Show archive.org snapshot
Integration (etwas vereinfacht):
module Post::DoesSearch
  as_trait do
    searchkick callbacks: :async # use sidekiq
    scope :search_import, -> { published.includes(:channel, :link) } # use for bulk indexing
    # index these fields
    def search_data
      attributes.slice(
        'title',
        'url',
      ).merge(
        body: html_to_text(body_html),
        user: {
          name: user&.name,
        },
        acl_group_ids: channel&.group_ids, # we need to somehow model powers in elastic
      )
    end
    # do not index, if not published
    def should_index?
      published?
    end
  end
end
Indizieren mit:
Post.reindex(async: true)
Callbacks in anderen Models:
class User < ApplicationRecord
  has_many :posts
  after_commit :reindex_posts
  def reindex_posts
    if saved_change_to_attribute?(:name)
      posts.reindex # this is not async, but we have workarounds!
    end
  end
end
Suche:
Searchkick.search(params[:query].to_s, models: [Post, Message, Document], page: params[:page], per_page: 20).to_a
Was macht Searchkick hinter den Kulissen
- Ein Index pro Model
 - Index-Aliase
 - Default-Mappings
 - Suche über mehrere Indexe
 
Mapping ansehen
http GET :9200/posts_development
Suche macht folgende Query:
cat | http GET :9200 /documents_development,posts_development/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "dis_max": {
            "queries": [
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 10,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              },
              {
                "multi_match": {
                  "query": "my search query",
                  "boost": 1,
                  "operator": "and",
                  "analyzer": "searchkick_search2",
                  "fuzziness": 1,
                  "prefix_length": 0,
                  "max_expansions": 3,
                  "fuzzy_transpositions": true,
                  "fields": [
                    "*.analyzed"
                  ],
                  "type": "best_fields"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "timeout": "11s",
  "_source": false,
  "size": 10000,
  "from": 0
}
Aggregations
- aggregations liefern "Statistiken" zu Suchtreffern
 - hieß früher mal "Facetten"
 - ein bisschen analog zu "GROUP BY" in SQL
 - gibt endlose Liste unterstützter aggregations:
- Metrics
- Avg
 - Weighted avg
 - Geo bounds
 - Max
 - Min
 - ...
 
 - Buckets
- Terms
 - Date Histogram
 - IP Range
 - Significant Terms
 
 
 - Metrics
 
Searchkick.search(query, models: [Post, Message], limit: 0, aggs: [:_type]).to_a
cat | http GET :9200/documents_development,posts_development/_search
{
  "query": { ... },
  "aggs": {
    "_type": {
      "terms": {
        "field": "_type",
        "size": 1000
      }
    }
  }
}
{
  "aggregations": {
    "_type": {
      "buckets": [
        {
          "doc_count": 476,
          "key": "document"
        },
        {
          "doc_count": 31,
          "key": "post"
        }
      ],
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0
    }
  },
  "hits": { ... }
}
Testing
Indizierung ist per Default aus, wird für Tags angemacht. Dann muss auch einmal der Index angelegt werden.
# spec/spec_helper.rb
require 'searchkick'
Searchkick.index_suffix = ENV['TEST_ENV_NUMBER'] # muss vor dem Laden von Models passieren
# search_spec.rb
describe Search, search: true do # enable indexing
  # ...
  it 'only returns posts for a post query' do
    post = create(:post)
    create(:event)
    Post.search_index.refresh # force index to update
    Event.search_index.refresh # force index to update
    expect(described_class.new(query: '*', type: 'post').each.map(&:record)).to eq [post]
  end
  # ...
end