Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Seven expert tools for advanced string search in PostgreSQL


You can improve PostgreSQL text search performance by executing searches directly in the database interface layer. Advanced text search mechanisms include SELECT LIKE queries, queries with regular expressions, the Levenshtein algorithm, Trigram, TSvector and TSquery, and Metaphone.

To get started, first make sure you have installed the latest stable version of PostgreSQL on your CentOS server. Pick the correct RPM package for your architecture from the download page and run the necessary commands to install PostgreSQL:

rpm -ivH pgdg-centos93-9.3-1.noarch.rpm
yum install postgresql93-devel postgresql93-server postgresql93-contrib

Next, initialize the PostgreSQL cluster, start the server, enter the command-line interface, and create a new database:

service postgresql-9.3 initdb
Initializing database:                                     [  OK  ]
/etc/init.d/postgresql-9.3 start
Starting postgresql-9.3 service:                           [  OK  ]
su postgres
bash-4.1$ psql
postgres=# create database vehicles;

Connect to the new database, set up tables, and add some sample data for a test case:

postgres=# \c vehicles;
You are now connected to database "vehicles" as user "postgres".
vehicles=# INSERT INTO cars(name, description) VALUES ('BMW X5', 'Luxury German SUV with front-engine and four-wheel-drive'), ('Mercedes SLR', 'Luxury German grand tourer with front-engine and rear-wheel-drive'), ('Cadillac Escalade', 'Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive'),('Volkswagen Phaeton', 'Luxury German sedan with front-engine and four-wheel drive'), ('Ford Probe', 'USA sports coupe with front-engine and rear-wheel drive'), ('Honda City', 'Japanese compact car with front-engine and front-wheel drive'), ('KIA Sportage', 'South-Korean crossover with front-engine and rear-wheel-drive or four-wheel-drive'),('Skoda Yeti', 'Czech compact SUV with front-engine and front-wheel-drive or four-wheel-drive');
vehicles=# CREATE TABLE engines_manufacturers(id SERIAL PRIMARY KEY NOT NULL, manufacturer_name TEXT NOT NULL);
vehicles=# INSERT INTO engines_manufacturers(manufacturer_name) VALUES ('BMW'),('McLaren'),('GM'),('Mercedes Benz'),('Ford Motor Company'),('VW'),('Hyundai-Kia'),('Peugeot Citroen Moteurs'),('Honda Motor Company');
vehicles=# CREATE TABLE cars_engines(cars_id INT REFERENCES cars NOT NULL, engines_id INT REFERENCES engines_manufacturers NOT NULL, UNIQUE(cars_id,engines_id));
vehicles=# INSERT INTO cars_engines(cars_id,engines_id) VALUES(1,1),(2,2),(3,3),(4,6),(5,5),(6,9),(7,7),(8,6);

Standard string search methods

PostgreSQL comes with two built-in mechanisms for string searches. Let's start with the one based on the standard SQL LIKE query structure. LIKE and ILIKE (case-insensitive) syntax can include the wild-card characters % and _ before and after the searched string. The percent sign replaces any given sequence of characters, while the underscore replaces a single character. For instance, this query lists records that include "luxury" and "suv" in any case in their description fields:

vehicles=# SELECT name, description FROM cars WHERE description ILIKE '%luxury%suv%';
       name        |                                description
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

You can also search for strings in text based on POSIX-style regular expressions. The ~ match operator must precede a regular expression. You can use the optional ! operator to inverse the logic (not match) and the * operator for case-insensitive search. The following query list all non-German cars from our database built by VW:

vehicles=# SELECT as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = WHERE engines_manufacturers.manufacturer_name ILIKE 'vw' AND cars.description !~* '.*german.*';
    cars    | manufacturer
 Skoda Yeti | VW
(1 row)

Fuzzy string matching

In addition to the standard solutions, PostgreSQL has several contributed packages that allow you to perform more complicated text searches. You'll find them in the PostgreSQL contrib package (in our case postgresql93-contrib), which we installed earlier.

The Levenshtein algorithm calculates the distance between two strings based on the total number of steps that should be completed to change one of the strings into the other. To determine the number of steps, it adds one point for each new character added, existing character removed, and replaced character.

Install and verify the fuzzystrmatch extension, which implements the Levenshtein algorithm, by running the command CREATE EXTENSION fuzzystrmatch;. You can then use the function to see the closest car name to a particular string – for example, "Sportage." Notice that even the difference in case adds one point to the calculation:

vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=5;
 id |     name
  7 | KIA Sportage
(1 row)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(name, 'sportage') <=4;
 id | name
(0 rows)
vehicles=# SELECT id, name FROM cars WHERE levenshtein(lower(name), lower('sportage')) <=4;
 id |     name
  7 | KIA Sportage
(1 row)


Another tool for string searches, Trigram, splits the chosen string into substrings of three consecutive characters. The query result lists the string with most matching trigrams. That makes it especially useful if you enter a string in your query that may have slight misspellings. According to the PostgreSQL documentation each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.

Install Trigram by executing the command CREATE EXTENSION pg_trgm; then query on the string "Ford":

vehicles=# select show_trgm('Ford');
 {"  f"," fo",for,ord,"rd "}
(1 row)

You can directly use the specific Trigram SELECT FROM WHERE % construction, but when you have a really large table with many records, you should optimize the searching process by creating a special index on the text column you plan to search before running the actual SELECT query. The Generalized Index Search Tree (GIST) provides fast text searches by converting the text into a vector of trigrams:

vehicles=# CREATE INDEX cars_name_trigram ON cars USING gist(name gist_trgm_ops);

You can then look for a car model in your table even without spelling it correctly. Trigram's default similarity threshold is 0.3. You can change it to anything between 0 and 1; the smaller it gets, the more spelling error-tolerant it becomes.

vehicles=# select show_limit();
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagen';
 id |        name        |                        description
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id | name | description
(0 rows)
vehicles=# select set_limit(0.2);
(1 row)
vehicles=# SELECT * FROM cars WHERE name % 'Folkswagon';
 id |        name        |                        description
  4 | Volkswagen Phaeton | Luxury German sedan with front-engine and four-wheel drive
(1 row)

TSvector and TSquery

TSvector and TSquery allow full-text searches based on several words from an entire sentence. TSvector splits the full text against which you want to run the query into an array of tokens (words) called lexemes, which are stored in TSvector along with their position in the text. TSquery performs the actual query. You can specify the language whose dictionary should be used and the words that should be matched. The words are united with the combining operator &. The vector and the query interact via the full-text query operator @@. For example, you can see the luxury SUVs from your database with a query such as:

vehicles=# SELECT name, description FROM cars WHERE to_tsvector(description) @@ to_tsquery('english', 'luxury & SUVs');
       name        |                                description
 BMW X5            | Luxury German SUV with front-engine and four-wheel-drive
 Cadillac Escalade | Luxury USA SUV with front-engine and rear-wheel-drive or four-wheel-drive
(2 rows)

As you can see, PostgreSQL returns the correct result even if one of the searched words is written in a plural form recognized as such by the dictionary used.

You can run the queries for text searches in different languages. PostgreSQL bundles more than a dozen text search dictionaries:

vehicles=# \dFd
                             List of text search dictionaries
   Schema   |      Name       |                        Description
 pg_catalog | danish_stem     | snowball stemmer for danish language
 pg_catalog | dutch_stem      | snowball stemmer for dutch language
 pg_catalog | english_stem    | snowball stemmer for english language
 pg_catalog | finnish_stem    | snowball stemmer for finnish language
 pg_catalog | french_stem     | snowball stemmer for french language
 pg_catalog | german_stem     | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem    | snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem    | snowball stemmer for russian language
 pg_catalog | simple          | simple dictionary: just lower case and check for stopword
 pg_catalog | spanish_stem    | snowball stemmer for spanish language
 pg_catalog | swedish_stem    | snowball stemmer for swedish language
 pg_catalog | turkish_stem    | snowball stemmer for turkish language
(16 rows)

Here again, full-text search with really large tables can be quite slow. You can use the Generalized Inverted Index (GIN) to significantly increase the searching speed and improve the performance. It needs more time to be generated compared to GIST, but it runs faster. GIN indexes are preferred when you work with mostly static data, since queries will be completed much faster. GIST indexes are updated much faster when new data is inserted in the corresponding table, so they are more suitable for dynamic data.

vehicles=# CREATE INDEX cars_vector_search ON cars USING gist(to_tsvector('english', description));                          CREATE INDEX


Finally, the Metaphone algorithm, which is installed with the fuzzystrmatch extension, matches the searched string by the way it sounds. Metaphone allows more liberty than the other text searching tools described above when it comes to the way a searched word is spelled. It is the most spelling error-tolerant tool listed in the article.

vehicles=# SELECT metaphone('Honda City', 5);
(1 row)

A similar function, dmetaphone (short for double metaphone) generates two output strings based on the way a source string sounds, from the results of the dmetaphone() and dmetaphone_alt() functions. Usually they return same results, but when it comes to non-English names the results might be different, depending on the pronunciation. Using dmetaphone instead of metaphone gives you better chance to match the misspelled word with the original one stored in your database.

You can run several sample queries to become acquainted with the functions:

vehicles=# select manufacturer_name, metaphone(manufacturer_name,11), dmetaphone(manufacturer_name), dmetaphone_alt(manufacturer_name) from engines_manufacturers where;
    manufacturer_name     |  metaphone  | dmetaphone | dmetaphone_alt
 Peugeot Citroen Moteurs | PJTSTRNMTRS | PJTS       | PKTS
(1 row)
vehicles=# SELECT manufacturer_name FROM engines_manufacturers WHERE metaphone(engines_manufacturers.manufacturer_name, 2) = metaphone('Pejo', 2);
 Peugeot Citroen Moteurs
(1 row)

With a word like Peugout, in this example, the pronunciation and the spelling differ. In this case dmetaphone gives you two options instead of one with metaphone. In the second query, if you lower the max_output_length value, you have a better chance to match the manufacturer name you are looking for.

Next, you can search for the car models of a manufacturer which name is misspelled:

SELECT as Cars, engines_manufacturers.manufacturer_name as Manufacturer FROM cars INNER JOIN cars_engines ON = cars_engines.cars_id INNER JOIN engines_manufacturers ON cars_engines.engines_id = WHERE metaphone(engines_manufacturers.manufacturer_name, 6) = metaphone('FW', 6);
        cars        | manufacturer
 Volkswagen Phaeton | VW
 Skoda Yeti         | VW
(2 rows)

To decide which algorithm to use, carefully analyze the data in your database records and the requirements for your project defined during the design stage. Sometimes one solutions works better and faster that the other. Or you can combine several solutions in a single query in order to get the best result.

The techniques described in this article allow application developers to save programming time and improve performance by completing text searches directly on the database level. Using the PostgreSQL syntax makes code easier for other IT specialists to understand.

Do you want to receive a compilation of Wazi's top
blog posts in the past year delivered directly to your inbox?

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.


At times Discount Designer Handbags are much less high-priced since they may be purchased wholesale. Naturally, if a organization purchases a Replica Handbags UK bigger quantity they save dollars and may pass a portion of those savings along to their buyers, resulting in a reduce value for you. A lot of Cheap Replicas are manufactured in China where the expenses to make them are decrease as a Cheap Replica Handbags consequence of their reduce wages even though the price to import them needs to be factored in. 
Posted @ Monday, July 21, 2014 1:33 AM by kenny
Pampering all by yourself utilizing extravagance is certainly the way to truly feel hard, as well as compensation all by yourself using an product or service to your chanel replica bags, the fact that pleases, together with inspires one other 7-day period for labor, together with endless effort and hard work. A bit of indulgent purchase cure should not get guilt ridden, or simply a major money judgement, utilizing price reduction fashionable scents with My best Scents On line. Soon there will be pampering all by yourself rather launches endorphins towards an individual's replica burberry bags. Endorphins widely-used to raise away your entire body as well as get you to appearance, together with truly feel, more effective. They can be like as a all natural together with suitable huge. Regretably, most of overly most people are attempting save you rolex replica watches nearly you can easliy and tend to be start to refrain from pampering us in anyway. Truthfully the fact that a little pampering, which include thru choosing chanel replica on line, tend to make everyone more joyful together with far healthier. However , so why happen to be people cutting short a huge amount of. No matter where everyone choose or simply in your city, it's hard to refrain from replica watches for sale: one can find major difficulties with any North american market at this moment.
Posted @ Wednesday, August 06, 2014 3:48 AM by fake chanel
Post Comment
Website (optional)

Allowed tags: <a> link, <b> bold, <i> italics