Posted about 9 years ago. Visible to the public.

Customize tokenization of the MySQL FULLTEXT parser

The way MySQL's FULLTEXT tokenizer splits text into word tokens might not always be what you need. E.g. it splits a word at period characters.

Since the tokenizer has near-zero configuration options (minimum word length and stopwords list), you need to hack it. There are three options available.

Option 1: If you like pain

Write a Full-Text parser plugin in C Archive .

Option 2: Make the problem go away

Normalize the text before it is stored in your FULLTEXT column. E.g. if you don't want MySQL to split at period characters, just strip out period characters.
If you apply the same kind of normalization to your queries of that column, you have won.

Option 3: Switch to a proper search engine

Indexers like Solr Archive give you near-unlimited ways to configure search and tokenization. Note that Solr is sort of painful to deploy and operate, so probably consult with your ops team first.

Flaky tests are tests that sometimes fail for no obvious reason. They are the plague of many end-to-end (E2E) test suites that automate the browser through tools like Capybara and Selenium.

Join our free training event and learn to fix any flaky test suite, even in large legacy applications.

Owner of this card:

Avatar
Henning Koch
Last edit:
about 9 years ago
Keywords:
tokenizer, token, tokenize
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Henning Koch to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more