Customize tokenization of the MySQL FULLTEXT parser

Posted . Visible to the public.

The way MySQL's FULLTEXT tokenizer splits text into word tokens might not always be what you need. E.g. it splits a word at period characters.

Since the tokenizer has near-zero configuration options (minimum word length and stopwords list), you need to hack it. There are three options available.

Option 1: If you like pain

Write a Full-Text parser plugin in C Show archive.org snapshot .

Option 2: Make the problem go away

Normalize the text before it is stored in your FULLTEXT column. E.g. if you don't want MySQL to split at period characters, just strip out period characters.
If you apply the same kind of normalization to your queries of that column, you have won.

Option 3: Switch to a proper search engine

Indexers like Solr Show archive.org snapshot give you near-unlimited ways to configure search and tokenization. Note that Solr is sort of painful to deploy and operate, so probably consult with your ops team first.

Henning Koch
Last edit
Keywords
tokenizer, token, tokenize
License
Source code in this card is licensed under the MIT License.
Posted by Henning Koch to makandra dev (2013-01-14 14:34)