jueves, 14 de enero de 2010

Search in text fields with T-SQL: beyond LIKE

Versión en castellano

One of the first things you learn when begin with T-SQL is to use the LIKE clause to make comparisons in text fields (char, nchar, varchar, nvarchar or text). But this clause is quite limited and don’t have good performance. So, what can be used instead of LIKE?

A good option is to use CONTAINS(that to be honest, I discovered by accident: P), which allows us to refine more comparisons and offers much better performance.


To give an example of using CONTAINS I will use the PUBS database (Microsoft offers as an example with SQL Server and can be downloaded from here) and we will find records in the "jobs" table filtering by "job_desc" field.

To give you an idea (or if you don't want download example databases) the "jobs" table has the following records:


job_id job_desc min_lvl max_lvl
-------- -------------------------------------------------- ------- -------
1 New Hire - Job not specified 10 10
2 Chief Executive Officer 200 250
3 Business Operations Manager 175 225
4 Chief Financial Officier 175 250
5 Publisher 150 250
6 Managing Editor 140 225
7 Marketing Manager 120 200
8 Public Relations Manager 100 175
9 Acquisitions Manager 75 175
10 Productions Manager 75 165
11 Operations Manager 75 150
12 Editor 25 100
13 Sales Representative 25 100
14 Designer 25 100


Before using CONTAINS we must create a FULLTEXT index in the table, some like this:


CREATE FULLTEXT CATALOG ft AS DEFAULT;
CREATE FULLTEXT INDEX ON Jobs(job_desc)
KEY INDEX PK__jobs__173876EA
GO


PK__jobs__173876EA is the table’s PRIMARY KEY.


Once this is done we can start with examples:

Search exact word

Obtains all records containing the word 'manager' in "job_desc" field.
Note: is not case sensitive.


select * from jobs where CONTAINS( job_desc, 'manager' )

job_id job_desc min_lvl max_lvl
-------- -------------------------------------------------- ------- -------
3 Business Operations Manager 175 225
7 Marketing Manager 120 200
8 Public Relations Manager 100 175
9 Acquisitions Manager 75 175
10 Productions Manager 75 165
11 Operations Manager 75 150


Also you can search for phrases (more than 1 word) and separators are ignored.


select * from jobs where CONTAINS (job_desc, '"hire job"');

job_id job_desc min_lvl max_lvl
-------- -------------------------------------------------- ------- -------
1 New Hire - Job not specified 10 10


Search by approximation with several options

Finds the records containing any word beginning with 'pub' or 'edit'.


select * from jobs where CONTAINS(job_desc, '"pub*" OR "edit*"');

job_id job_desc min_lvl max_lvl
-------- -------------------------------------------------- ------- -------
5 Publisher 150 250
6 Managing Editor 140 225
8 Public Relations Manager 100 175
12 Editor 25 100


CONTAINS supports AND, AND NOT, OR.

Search a word near another one

Finds records that contain a word beginning with 'pub' and that word is near 'manager'.


select * from jobs where CONTAINS (job_desc, '"pub*" NEAR manager')

job_id job_desc min_lvl max_lvl
-------- -------------------------------------------------- ------- -------
8 Public Relations Manager 100 175


CONTAINS can also search for synonyms and derived words, but this requires using a thesaurus.

Happy codding ;)

Extra ball
If you find CONTAINS interested, see also FREETEXT and CONTAINSTABLE.

2 comentarios:

Guti dijo...

Sólo puntualizar que el motor de indexación de texto completo, no se instala por defecto.

Pablo Bouzada dijo...

Gracias por el apunte Guti.