Read more

Customize tokenization of the MySQL FULLTEXT parser

Henning Koch
January 14, 2013Software engineer at makandra GmbH

The way MySQL's FULLTEXT tokenizer splits text into word tokens might not always be what you need. E.g. it splits a word at period characters.

Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot

Since the tokenizer has near-zero configuration options (minimum word length and stopwords list), you need to hack it. There are three options available.

Option 1: If you like pain

Write a Full-Text parser plugin in C Show archive.org snapshot .

Option 2: Make the problem go away

Normalize the text before it is stored in your FULLTEXT column. E.g. if you don't want MySQL to split at period characters, just strip out period characters.
If you apply the same kind of normalization to your queries of that column, you have won.

Option 3: Switch to a proper search engine

Indexers like Solr Show archive.org snapshot give you near-unlimited ways to configure search and tokenization. Note that Solr is sort of painful to deploy and operate, so probably consult with your ops team first.

Posted by Henning Koch to makandra dev (2013-01-14 15:34)