Peter Gulutzan is a Software Architect at MySQL AB, and author of SQL Performance Tuning. He's also another Canadian member of the MySQL Documentation Team.
By Peter Gulutzan
The MySQL Reference Manual describes "Full-Text Search Functions" with lots of exciting details.
But we leave some things out because they're ephemeral. That is, they could change at any moment. Here, for people who really want to know what's going on with full-text searching as of May 2004 in version 4.1.2, is some information to supplement the manual:
When we see "two words" it's fairly obvious that "two" is a word" and "words" is another word -- we can tell because there's a space between. Generally, anything that's alphabetic or numeric (including accented characters) is part of a word, anything that's not alphabetic or numeric (even "-") is not part of a word and is therefore used as a separator. Two interesting exceptions are the underscore '_' and the apostrophe ', as this example shows:
mysql> INSERT INTO articles (title) mysql> VALUES ('John O''Hara''s nightmare'); Query OK, 1 row affected (0.00 sec) mysql> SELECT title FROM articles -> WHERE MATCH (title, body) AGAINST ('O''Hara''s'); +-------------------------+ | title | +-------------------------+ | John O'Hara's nightmare | +-------------------------+ 1 row in set (0.00 sec)
The full-text index is much like other indexes: a sorted list of "keys" which point to records in the data file. Each key has:
{ Word -- VARCHAR. a word within the text. Count -- LONG. how many times word occurs in text. } { Weight -- FLOAT. Our evaluation of the word's importance. Rowid -- a pointer to the row in the data file. }
This is the version 4.1 format. Earlier, it was a bit simpler.
MySQL's full-text creator Sergei Golubchik observes that IN BOOLEAN
MODE
can be "suitable for experienced users" but "formulating a proper
query can be very difficult". I don't find it so, in fact I find boolean mode
easier because it's closer to what other DBMSs have for full-text search, and
closer to the standard (SQL/MM) specification. But let me clear up three things
in the documentation.
1. The manual says:
"A leading plus sign indicates that this word must be present in every row returned."
That's not always true. Here's a complicated query:
mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('(dbms -wombat) (+mysqld)' IN BOOLEAN MODE); +----+-------------------+-------------------------------------+ | id | title | body | +----+-------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | | 11 | mysqld | wombat | +----+-------------------+-------------------------------------+ 3 rows in set (0.00 sec)
The word 'mysqld' isn't in every row, it's only in the rows that match the parenthesized expression that contains +mysqld.
2. The manual says:
"- A leading minus sign indicates that this word must not be present in any row returned."
Again, that's not always true, and here's another complex example:
mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('(dbms -wombat) (mysqld)' IN BOOLEAN MODE); +----+-------------------+-------------------------------------+ | id | title | body | +----+-------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | | 11 | mysqld | wombat | +----+-------------------+-------------------------------------+ 3 rows in set (0.00 sec)
Notice the wombat? This example makes it clear that a leading minus sign can merely mean AND NOT.
3. The manual says:
"The argument to AGAINST() must be a constant string."
This is nothing to worry about. We just haven't updated the manual yet to show that the AGAINST() argument can be a variable or parameter, if you use MySQL Version 5. For example:
mysql> CREATE PROCEDURE p4 (param1 VARCHAR(100)) -> SELECT * FROM articles -> WHERE MATCH (title, body) AGAINST (param1); Query OK, 0 rows affected (0.00 sec) mysql> CALL p4('database'); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec) Query OK, -1 rows affected (0.00 sec)
Suppose you want to search for a text that contains "you want to".
SELECT ... FROM table1 WHERE MATCH(textcolumn) AGAINST ('"you want to"' IN BOOLEAN MODE)
This is awful, for two reasons:
So MySQL can't get the answer just by looking in the index. The method has to be: first try to filter out the candidate rows by looking for whatever words aren't stopwords, then examine each candidate row in the data file and make sure the words are there in the correct order. This means exact phrase searching can be a little slower than ordinary searching. On the other hand, MySQL won't find "Sam was tall" if you search for "Sam grew tall". (If "was" and "grew" are both stopwords, there's another DBMS that would find "Sam was tall". We think that looks like a bug.)
As well as the MySQL Reference Manual, many of the many books about MySQL have a section about full-text indexing and querying. Plus, here are some articles available free on the web:
Here are all the stopwords in the default English file, which (in the MySQL
Source download) is myisam/ft_static.c
.
a's, able, about, above, according, accordingly, across, actually, after, afterwards, again, against, ain't, all, allow, allows, almost, alone, along, already, also, although, always, am, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, appear, appreciate, appropriate, are, aren't, around, as, aside, ask, asking, associated, at, available, away, awfully, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, believe, below, beside, besides, best, better, between, beyond, both, brief, but, by, c'mon, c's, came, can, can't, cannot, cant, cause, causes, certain, certainly, changes, clearly, co, com, come, comes, concerning, consequently, consider, considering, contain, containing, contains, corresponding, could, couldn't, course, currently, definitely, described, despite, did, didn't, different, do, does, doesn't, doing, don't, done, down, downwards, during, each, edu, eg, eight, either, else, elsewhere, enough, entirely, especially, et, etc, even, ever, every, everybody, everyone, everything, everywhere, ex, exactly, example, except, far, few, fifth, first, five, followed, following, follows, for, former, formerly, forth, four, from, further, furthermore, get, gets, getting, given, gives, go, goes, going, gone, got, gotten, greetings, had, hadn't, happens, hardly, has, hasn't, have, haven't, having, he, he's, hello, help, hence, her, here, here's, hereafter, hereby, herein, hereupon, hers, herself, hi, him, himself, his, hither, hopefully, how, howbeit, however, i'd, i'll, i'm, i've, ie, if, ignored, immediate, in, inasmuch, inc, indeed, indicate, indicated, indicates, inner, insofar, instead, into, inward, is, isn't, it, it'd, it'll, it's, its, itself, just, keep, keeps, kept, know, knows, known, last, lately, later, latter, latterly, least, less, lest, let, let's, like, liked, likely, little, look, looking, looks, ltd, mainly, many, may, maybe, me, mean, meanwhile, merely, might, more, moreover, most, mostly, much, must, my, myself, name, namely, nd, near, nearly, necessary, need, needs, neither, never, nevertheless, new, next, nine, no, nobody, non, none, noone, nor, normally, not, nothing, novel, now, nowhere, obviously, of, off, often, oh, ok, okay, old, on, once, one, ones, only, onto, or, other, others, otherwise, ought, our, ours, ourselves, out, outside, over, overall, own, particular, particularly, per, perhaps, placed, please, plus, possible, presumably, probably, provides, que, quite, qv, rather, rd, re, really, reasonably, regarding, regardless, regards, relatively, respectively, right, said, same, saw, say, saying, says, second, secondly, see, seeing, seem, seemed, seeming, seems, seen, self, selves, sensible, sent, serious, seriously, seven, several, shall, she, should, shouldn't, since, six, so, some, somebody, somehow, someone, something, sometime, sometimes, somewhat, somewhere, soon, sorry, specified, specify, specifying, still, sub, such, sup, sure, t's, take, taken, tell, tends, th, than, thank, thanks, thanx, that, that's, thats, the, their, theirs, them, themselves, then, thence, there, there's, thereafter, thereby, therefore, therein, theres, thereupon, these, they, they'd, they'll, they're, they've, think, third, this, thorough, thoroughly, those, though, three, through, throughout, thru, thus, to, together, too, took, toward, towards, tried, tries, truly, try, trying, twice, two, un, under, unfortunately, unless, unlikely, until, unto, up, upon, us, use, used, useful, uses, using, usually, value, various, very, via, viz, vs, want, wants, was, wasn't, way, we, we'd, we'll, we're, we've, welcome, well, went, were, weren't, what, what's, whatever, when, whence, whenever, where, where's, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, who's, whoever, whole, whom, whose, why, will, willing, wish, with, within, without, won't, wonder, would, would, wouldn't, yes, yet, you, you'd, you'll, you're, you've, your, yours, yourself, yourselves, zero