Re: Free text search in MySQL

From: Gordon Burditt <gordonb.vgcnj_at_burditt.org>
Date: Thu, 31 Mar 2016 03:02:02 -0500
Message-ID: <CoSdnXOGOOLnRGHLnZ2dnUU7-W3NnZ2d_at_posted.internetamerica>


>> Hi,
>> If I have the text "A rat in a hat" inside a MySQL DB, and if on my webpage
>> someone searches for the text "rat", I would like them to get that line -
>> how do I do that? I realise that this sort of *precludes* my previous
>> question about fuzzy matching, but I would still like to know....
>
> SELECT column FROM table WHERE text_key_column LIKE '%rat%';

Beware that this also matches words or phrases like 'Democrat', 'movie rating', and 'proration'. If you put spaces around the word in the query, you will fail to find mentions of that word that begin or end the entry, or where the word has punctuation instead of a trailing space.

Some forms of regular expressions have a better facility for matching "word boundaries" for some reasonable definition of the term, useful if your database provides that.

It usually doesn't make a lot of sense to search *every* column in a database. Do you really want to search people's names, addresses, license plate numbers, hat size, ring size, account number, VIN number, and the names of companies they've paid with their credit card? What does the result mean?

It might make sense to prepare an index you can search before the query. It helps if your database supports triggers.

Suppose you have a table of postings. Each posting has a Subject: and a Text Body (in separate columns). There is also a primary key for the article, which we'll assume is just a number. Every time an article is inserted, this processing is done:

	Take the contents of the Subject and Text Body, split it into
	words, remove the common words (who needs to search for all
	articles containing "the", "and" and "but"?), and add
	records to the word-search table:
		the primary key of the article involved,
		the word (you insert one record for each distinct word),
		the count of number of times that word was used.

If an article is deleted, also delete all the records in the word-search table with the article ID of the article involved.

Now, with a 3-table self-join, you can find all aricles containing 3 specific words chosen by the user. Using the counts of how many times the words were used, you can come up with a ranking.

Doing phrases is harder and needs more storage, as you have to record the position of the words in the post.

This approach can chew up quite a bit of storage compared with just storing the articles. Received on Thu Mar 31 2016 - 10:02:02 CEST

Original text of this message