Read more

Customize tokenization of the MySQL FULLTEXT parser

Henning Koch
January 14, 2013Software engineer at makandra GmbH

The way MySQL's FULLTEXT tokenizer splits text into word tokens might not always be what you need. E.g. it splits a word at period characters.

Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

Since the tokenizer has near-zero configuration options (minimum word length and stopwords list), you need to hack it. There are three options available.

Option 1: If you like pain

Write a Full-Text parser plugin in C Show archive.org snapshot .

Option 2: Make the problem go away

Normalize the text before it is stored in your FULLTEXT column. E.g. if you don't want MySQL to split at period characters, just strip out period characters.
If you apply the same kind of normalization to your queries of that column, you have won.

Option 3: Switch to a proper search engine

Indexers like Solr Show archive.org snapshot give you near-unlimited ways to configure search and tokenization. Note that Solr is sort of painful to deploy and operate, so probably consult with your ops team first.

Posted by Henning Koch to makandra dev (2013-01-14 15:34)