Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Seven expert tools for advanced string search in PostgreSQL

  
  
  

You can improve PostgreSQL text search performance by executing searches directly in the database interface layer. Advanced text search mechanisms include SELECT LIKE queries, queries with regular expressions, the Levenshtein algorithm, Trigram, TSvector and TSquery, and Metaphone.

To get started, first make sure you have installed the latest stable version of PostgreSQL on your CentOS server. Pick the correct RPM package for your architecture from the download page and run the necessary commands to install PostgreSQL:

wget http://yum.postgresql.org/9.3/redhat/rhel-6-i386/pgdg-centos93-9.3-1.noarch.rpm
rpm -ivH pgdg-centos93-9.3-1.noarch.rpm
yum install postgresql93-devel postgresql93-server postgresql93-contrib

Next, initialize the PostgreSQL cluster, start the server, enter the command-line interface, and create a new database:

service postgresql-9.3 initdb
Initializing database:                                     [  OK  ]
/etc/init.d/postgresql-9.3 start
Starting postgresql-9.3 service:                           [  OK  ]
su postgres
bash-4.1$ psql
postgres=# create database vehicles;
CREATE DATABASE

Connect to the new database, set up tables, and add some sample data for a test case:

postgres=# \c vehicles;
You are now connected to database "vehicles" as user "postgres".
vehicles=# CREATE TABLE cars(id SERIAL PRIMARY KEY NOT NULL, name TEXT NOT NULL, description TEXT NOT NULL);
CREATE TABLE
vehicles=# INSERT INTO cars(name, description) VALUES ('BMW X5', 'Luxury German SUV with front-engine and four-wheel-drive'), ('Mercedes SLR', 'Luxury German grand tourer with front-engine and rear-wheel-drive'), ('Cadillac Escalade', 'Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive'),('Volkswagen Phaeton', 'Luxury German sedan with front-engine and four-wheel drive'), ('Ford Probe', 'USA sports coupe with front-engine and rear-wheel drive'), ('Honda City', 'Japanese compact car with front-engine and front-wheel drive'), ('KIA Sportage', 'South-Korean crossover with front-engine and rear-wheel-drive or four-wheel-drive'),('Skoda Yeti', 'Czech compact SUV with front-engine and front-wheel-drive or four-wheel-drive');
INSERT 0 8
vehicles=# CREATE TABLE engines_manufacturers(id SERIAL PRIMARY KEY NOT NULL, manufacturer_name TEXT NOT NULL);
CREATE TABLE
vehicles=# INSERT INTO engines_manufacturers(manufacturer_name) VALUES ('BMW'),('McLaren'),('GM'),('Mercedes Benz'),('Ford Motor Company'),('VW'),('Hyundai-Kia'),('Peugeot Citroen Moteurs'),('Honda Motor Company');
INSERT 0 9
vehicles=# CREATE TABLE cars_engines(cars_id INT REFERENCES cars NOT NULL, engines_id INT REFERENCES engines_manufacturers NOT NULL, UNIQUE(cars_id,engines_id));
CREATE TABLE
vehicles=# INSERT INTO cars_engines(cars_id,engines_id) VALUES(1,1),(2,2),(3,3),(4,6),(5,5),(6,9),(7,7),(8,6);
INSERT 0 8

Standard string search methods

PostgreSQL comes with two built-in mechanisms for string searches. Let's start with the one based on the standard SQL LIKE query structure. LIKE and ILIKE (case-insensitive) syntax can include the wild-card characters % and _ before and after the searched string. The percent sign replaces any given sequence of characters, while the underscore replaces a single character. For instance, this query lists records that include "luxury" and "suv" in any case in their description fields:

vehicles=# SELECT name, description FROM cars WHERE description ILIKE '%luxury%suv%';
       name        |                                description
-------------------+---------------------------------------------------------------------------
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

You can also search for strings in text based on POSIX-style regular expressions. The ~ match operator must precede a regular expression. You can use the optional ! operator to inverse the logic (not match) and the * operator for case-insensitive search. The following query list all non-German cars from our database built by VW:

vehicles=# SELECT cars.name as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON cars.id = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = engines_manufacturers.id WHERE engines_manufacturers.manufacturer_name ILIKE 'vw' AND cars.description !~* '.*german.*';
    cars    | manufacturer
------------+--------------
 Skoda Yeti | VW
(1 row)

Fuzzy string matching

In addition to the standard solutions, PostgreSQL has several contributed packages that allow you to perform more complicated text searches. You'll find them in the PostgreSQL contrib package (in our case postgresql93-contrib), which we installed earlier.

The Levenshtein algorithm calculates the distance between two strings based on the total number of steps that should be completed to change one of the strings into the other. To determine the number of steps, it adds one point for each new character added, existing character removed, and replaced character.

Install and verify the fuzzystrmatch extension, which implements the Levenshtein algorithm, by running the command CREATE EXTENSION fuzzystrmatch;. You can then use the function to see the closest car name to a particular string – for example, "Sportage." Notice that even the difference in case adds one point to the calculation:

vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=5;
 id |     name
----+--------------
  7 | KIA Sportage
(1 row)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=4;
 id | name
----+------
(0 rows)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(lower(name), lower('sportage')) <=4;
 id |     name
----+--------------
  7 | KIA Sportage
(1 row)

Trigram

Another tool for string searches, Trigram, splits the chosen string into substrings of three consecutive characters. The query result lists the string with most matching trigrams. That makes it especially useful if you enter a string in your query that may have slight misspellings. According to the PostgreSQL documentation each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.

Install Trigram by executing the command CREATE EXTENSION pg_trgm; then query on the string "Ford":

vehicles=# select show_trgm('Ford');
          show_trgm
-----------------------------
 {"  f"," fo",for,ord,"rd "}
(1 row)

You can directly use the specific Trigram SELECT FROM WHERE % construction, but when you have a really large table with many records, you should optimize the searching process by creating a special index on the text column you plan to search before running the actual SELECT query. The Generalized Index Search Tree (GIST) provides fast text searches by converting the text into a vector of trigrams:

vehicles=# CREATE INDEX cars_name_trigram ON cars USING gist(name gist_trgm_ops);
CREATE INDEX

You can then look for a car model in your table even without spelling it correctly. Trigram's default similarity threshold is 0.3. You can change it to anything between 0 and 1; the smaller it gets, the more spelling error-tolerant it becomes.

vehicles=# select show_limit();
 show_limit
------------
        0.3
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagen';
 id |        name        |                        description
----+--------------------+------------------------------------------------------------
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id | name | description
----+------+-------------
(0 rows)
vehicles=# select set_limit(0.2);
 set_limit
-----------
       0.2
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id |        name        |                        description
----+--------------------+------------------------------------------------------------
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)

TSvector and TSquery

TSvector and TSquery allow full-text searches based on several words from an entire sentence. TSvector splits the full text against which you want to run the query into an array of tokens (words) called lexemes, which are stored in TSvector along with their position in the text. TSquery performs the actual query. You can specify the language whose dictionary should be used and the words that should be matched. The words are united with the combining operator &. The vector and the query interact via the full-text query operator @@. For example, you can see the luxury SUVs from your database with a query such as:

vehicles=# SELECT name, description FROM cars WHERE to_tsvector(description) @@ to_tsquery('english', 'luxury & SUVs');
       name        |                                description
-------------------+---------------------------------------------------------------------------
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

As you can see, PostgreSQL returns the correct result even if one of the searched words is written in a plural form recognized as such by the dictionary used.

You can run the queries for text searches in different languages. PostgreSQL bundles more than a dozen text search dictionaries:

vehicles=# \dFd
                             List of text search dictionaries
   Schema   |      Name       |                        Description
------------+-----------------+-----------------------------------------------------------
 pg_catalog | danish_stem     | snowball stemmer for danish language
 pg_catalog | dutch_stem      | snowball stemmer for dutch language
 pg_catalog | english_stem    | snowball stemmer for english language
 pg_catalog | finnish_stem    | snowball stemmer for finnish language
 pg_catalog | french_stem     | snowball stemmer for french language
 pg_catalog | german_stem     | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem    | snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem    | snowball stemmer for russian language
 pg_catalog | simple          | simple dictionary: just lower case and check for stopword
 pg_catalog | spanish_stem    | snowball stemmer for spanish language
 pg_catalog | swedish_stem    | snowball stemmer for swedish language
 pg_catalog | turkish_stem    | snowball stemmer for turkish language
(16 rows)

Here again, full-text search with really large tables can be quite slow. You can use the Generalized Inverted Index (GIN) to significantly increase the searching speed and improve the performance. It needs more time to be generated compared to GIST, but it runs faster. GIN indexes are preferred when you work with mostly static data, since queries will be completed much faster. GIST indexes are updated much faster when new data is inserted in the corresponding table, so they are more suitable for dynamic data.

vehicles=# CREATE INDEX cars_vector_search ON cars USING gist(to_tsvector('english', description));                          CREATE INDEX

Metaphone

Finally, the Metaphone algorithm, which is installed with the fuzzystrmatch extension, matches the searched string by the way it sounds. Metaphone allows more liberty than the other text searching tools described above when it comes to the way a searched word is spelled. It is the most spelling error-tolerant tool listed in the article.

vehicles=# SELECT metaphone('Honda City', 5);
 metaphone
-----------
 HNTST
(1 row)

A similar function, dmetaphone (short for double metaphone) generates two output strings based on the way a source string sounds, from the results of the dmetaphone() and dmetaphone_alt() functions. Usually they return same results, but when it comes to non-English names the results might be different, depending on the pronunciation. Using dmetaphone instead of metaphone gives you better chance to match the misspelled word with the original one stored in your database.

You can run several sample queries to become acquainted with the functions:

vehicles=# select manufacturer_name, metaphone(manufacturer_name,11), dmetaphone(manufacturer_name), dmetaphone_alt(manufacturer_name) from engines_manufacturers where engines_manufacturers.id=8;
    manufacturer_name     |  metaphone  | dmetaphone | dmetaphone_alt
-------------------------+-------------+------------+----------------
 Peugeot Citroen Moteurs | PJTSTRNMTRS | PJTS       | PKTS
(1 row)
vehicles=# SELECT manufacturer_name FROM engines_manufacturers WHERE metaphone(engines_manufacturers.manufacturer_name, 2) = metaphone('Pejo', 2);
    manufacturer_name
-------------------------
 Peugeot Citroen Moteurs
(1 row)

With a word like Peugout, in this example, the pronunciation and the spelling differ. In this case dmetaphone gives you two options instead of one with metaphone. In the second query, if you lower the max_output_length value, you have a better chance to match the manufacturer name you are looking for.

Next, you can search for the car models of a manufacturer which name is misspelled:

SELECT cars.name as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON cars.id = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = engines_manufacturers.id WHERE metaphone(engines_manufacturers.manufacturer_name, 6) = metaphone('FW', 6);
        cars        | manufacturer
--------------------+--------------
 Volkswagen Phaeton | VW
 Skoda Yeti         | VW
(2 rows)

To decide which algorithm to use, carefully analyze the data in your database records and the requirements for your project defined during the design stage. Sometimes one solutions works better and faster that the other. Or you can combine several solutions in a single query in order to get the best result.

The techniques described in this article allow application developers to save programming time and improve performance by completing text searches directly on the database level. Using the PostgreSQL syntax makes code easier for other IT specialists to understand.


Do you want to receive a compilation of Wazi's top
blog posts in the past year delivered directly to your inbox?






This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.

Comments

 
 
At times Discount Designer Handbags are much less high-priced since they may be purchased wholesale. Naturally, if a organization purchases a Replica Handbags UK bigger quantity they save dollars and may pass a portion of those savings along to their buyers, resulting in a reduce value for you. A lot of Cheap Replicas are manufactured in China where the expenses to make them are decrease as a Cheap Replica Handbags consequence of their reduce wages even though the price to import them needs to be factored in. 
 
Posted @ Monday, July 21, 2014 1:33 AM by kenny
Pampering all by yourself utilizing extravagance is certainly the way to truly feel hard, as well as compensation all by yourself using an product or service to your chanel replica bags, the fact that pleases, together with inspires one other 7-day period for labor, together with endless effort and hard work. A bit of indulgent purchase cure should not get guilt ridden, or simply a major money judgement, utilizing price reduction fashionable scents with My best Scents On line. Soon there will be pampering all by yourself rather launches endorphins towards an individual's replica burberry bags. Endorphins widely-used to raise away your entire body as well as get you to appearance, together with truly feel, more effective. They can be like as a all natural together with suitable huge. Regretably, most of overly most people are attempting save you rolex replica watches nearly you can easliy and tend to be start to refrain from pampering us in anyway. Truthfully the fact that a little pampering, which include thru choosing chanel replica on line, tend to make everyone more joyful together with far healthier. However , so why happen to be people cutting short a huge amount of. No matter where everyone choose or simply in your city, it's hard to refrain from replica watches for sale: one can find major difficulties with any North american market at this moment.
Posted @ Wednesday, August 06, 2014 3:48 AM by fake chanel
It's actually a widely authorized truth disposition with the help of best suited blend of fashion accessories will make a woman be different in your public. replica cartier Wholesale handbags are actually this sort gadget which may be long been used by a lot of women as eras. fake chanel coco But, they already have increased because of his or her's typical character from pouring only a click holder from gear whereas moving forward out from the residential home. hermes replica Presently, they've been thought of as a significant obligation, of which murmurs very much concerning disposition from a partner as symbolic from their experience from type. fake cartier At the same time, with the help of trendy purses caused concerning most of typically the racks from life not to mention web stores, a lot of women presently don't need to take things every day. 
fake gucci Believe it or not, he or she can nowadays prefer to get trendy wholesale handbags for the purpose of completely different moments.
Posted @ Sunday, September 28, 2014 1:09 AM by xxccc
Paris show Fleur De Jais series of another bag, count the years of Speedy, which one do you like best? Stars favorite graffiti series, Louis vuitton graffiti. Still cheap designer handbags the ribbon Speedy to mark the 100th anniversary of the limited edition. Contracted and Prada, compared to neutral, I prefer young and jump Miu Miu. Miuccia Prada in the design of the brand named after her nickname, also seems to be more casual and Ugg Boots New Arrivals not stick to one pattern. In the autumn wind restoring ancient ways through the fashion world, Miu Miu also should allow lively elf, introduced the fair maiden nostalgic faux suede handbags.
Posted @ Wednesday, October 08, 2014 1:21 AM by vuitton
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics