Search beyond MySQL with Sphinx
MySQL is a powerful database, but its search capabilities are most efficient for searches based upon a single database index. If your application requires complex searching, consider adding Sphinxto the mix to gain searching speed. Sphinx's main selling point is its full-text search capability, but it can also shine with regular queries.
Our test data
To show how Sphinx can improve searching performance, I needed a database with at least a few million records. I turned to the World Cities data I used in my recent Apache and RESTful Web Services article. The "world" database schema holds data about a couple of hundred countries, their four thousand regions, and more than three million towns and cities. Countries have an ID (two letters) and a name; regions in a country have an ID (two letters) and a name; cities in a region have an ID (integer), a name, an accented version of the name with possibly foreign characters, population (if known, zero otherwise), and geographic coordinates (latitude, longitude). Searches in the country and region tables don't require much processing power, but joins involving cities can, so I used Sphinx mainly for such searches.
Sphinx (a sort of acronym for "SQL Phrase Index") is a full-text open source search engine that supports MySQL databases (and others) natively. Its development began in 2001, and its latest version, 2.0.6, was released last month. Sphinx uses its own tuned-for-speed index files, and can thus achieve much faster search speeds than MySQL alone. It is being used for databases with billions of documents, spanning terabytes of disk space; for scalability, you can set up a cluster of Sphinx machines.
Installation of Sphinx is simple. Binary versions are available for a few Linux distributions, but installation from source isn't that complicated, so we'll take that route. After downloading the source tarball from the Sphinx download page, extract its file, and run the usual
./configure, make, make install sequence, but specify the path for Sphinx's binaries; I opted for bin/sphinx under my home directory. Try
./configure --help to get a full list of available options.
> tar -zxf sphinx-2.0.6-release.tar.gz
> cd sphinx-2.0.6-release
> ./configure --prefix=/home/fkereki/bin/sphinx
> make install
After you install Sphinx you must configure it so it can access your MySQL database, create whatever index files you request, index your data, and run a daemon that actually performs searches.
You configure Sphinx by editing its etc/sphinx.conf file in the installation directory. The configuration file is divided into stanzas. For each index you want to create, you must provide a
source, which tells Sphinx how to procure the data it must index, its data types, and more, and an
index definition, which specifies how to index the data, where to store the index files, and so on:
type = mysql
sql_host = 192.168.1.200
sql_port = 3306
sql_user = fkereki_user
sql_pass = fkereki_pass
sql_db = world
sql_query = \
cityCode id, \
FROM cities \
JOIN regions USING (countryCode,regionCode) \
JOIN countries USING (countryCode)
sql_field_string = countryName
sql_field_string = regionName
sql_field_string = cityName
sql_attr_bigint = population
sql_attr_float = latitude
sql_attr_float = longitude
We want to index cities data. The first attribute tells Sphinx it will work with a MySQL database, and the five following ones (host, port, user, password, and schema name) tell it how to access it. The
sql_query definition tells Sphinx how to get the "documents" it will index; in our case, it will use the results of joining together countries, regions, and cities, using most of their fields. A "document ID," a unique unsigned positive integer number, must always be the first field, followed by as many as 32 full-text fields, and any number of other attributes. The last
sql_ lines define the data types of your document fields; see the table below for common allowed data types. You could also have
sql_query_post options, to specify SQL commands to be run before or after indexing; these could set up temporary tables, mark records to be indexed, update counters or totals, and clean up afterward.
Sphinx frequently used data types
|Data Type||MySQL equivalent|
||BIGINT, signed, 64 bits wide
||FLOAT, 32 bits wide
||UNSIGNED INTEGER, up to 32 bits wide
||UNIX timestamps, 32 bits wide.
||VARCHAR fields, not indexed
||VARCHAR fields, full-text indexed
index entry specifies where the index files and attribute files are to be stored, whether you can do '*' queries (similar to LIKE queries with the '%' special character), and whether you are working with single-byte data (we are; the "world" schema is in ISO-8859-1) or multi-byte (UTF-8, for example) data. Setting
enable_star=1 makes for larger index files and slower index times, so don't specify it unless your business logic does require '*' queries.
source = citiesSource
path = /home/fkereki/bin/sphinx/var/data/world
docinfo = extern
enable_star = 1
min_infix_len = 3
charset_type = sbcs
Finally, you must also specify some parameters for
indexer, the Sphinx module that creates and rotates index files, and
searchd, the Sphinx search deamon, which actually runs searches. For the former, you can specify how much RAM it will use (the more, the faster) and some other parameters, such as throttling conditions and buffer sizes. For the latter, you must specify the port that the Sphinx daemon will listen on, some file paths for data and logs, and some parameters specifying, for example, how many queries to run in parallel and whether to preopen index files on startup; the default values usually work.
mem_limit = 1024M
listen = 9306:mysql41
binlog_path = /home/fkereki/bin/sphinx/var/data
log = /home/fkereki/bin/sphinx/var/log/searchd.log
pid_file = /home/fkereki/bin/sphinx/var/log/searchd.pid
query_log = /home/fkereki/bin/sphinx/var/log/query.log
You can query Sphinx through an API, but for MySQL programmers, using SphinxQL, a subset of SQL that supports the SELECT statement for running queries, is a better option. You must pick a port for SphinxQL to listen on; 9306 (similar to MySQL's 3306) is suggested. You'll be able to work with MySQL and Sphinx at the same time, each on its own port; we'll see that in a moment, but first let's index and reindex our data.
Indexing the data
Sphinx can recreate all of its index files even when it's running, because it produces an alternate set of files, which it then rotates. Indexing is very fast, and the more RAM, the faster it goes. On my machine it can index the three million city documents in about 25 seconds. Enabling star searches (
enable-star=1) can drastically and negatively impact that; when I made that change, indexing took six times as long, or about 150 seconds, and index sizes were also bigger. If required, you can reindex just a single index at the time, or specify
--all to reindex everything.
> ../bin/indexer --all --rotate
Sphinx 2.0.6-release (r3473)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/home/fkereki/bin/sphinx/etc/sphinx.conf'...
indexing index 'citiesIndex'...
collected 3033147 docs, 89.2 MB
sorted 13.2 Mhits, 100.0% done
total 3033147 docs, 89186515 bytes
total 25.680 sec, 3472869 bytes/sec, 118108.93 docs/sec
total 3 reads, 0.228 sec, 59220.6 kb/call avg, 76.0 msec/call avg
total 552 writes, 3.968 sec, 791.8 kb/call avg, 7.1 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=30416).
You should also think about how you want to keep index files up to date. If your data doesn't change frequently and changes involve a small percent of all data (a common situation) you can have a "main+delta" structure, with a main index for data that rarely (if ever) changes, and a delta index for new data. You can reindex the smaller delta data set frequently, but you'll have to query both index files in order to get a full answer. The Sphinx documentation talks at length about this type of solution.
Sphinx also supports RT (Real Time) index files, which can be updated on the fly by using INSERT, DELETE, and REPLACE statements. Of course, this means that every operation that updates your database must also be replicated for the benefit of Sphinx, so every involved index file is updated. If you require searching the very latest data, you have to decide between RT index files and delta files, which lead to some further research. If not, reindexing periodically can be a solution, as long as you can live with some out-of-date data for a little while.
Trying out some searches
Using SphinxQL isn't that different from using MySQL; mostly it's just a matter of specifying port 9306 instead of 3306. No matter what language you use, if you have an available MySQL client, you'll be able to easily access SphinxQL. However, don't plan on a line-for-line replacement; SphinxQL isn't full MySQL, and you have to work around some restrictions.
As an example of some quirks you have to live with, you cannot directly use string attributes for WHERE conditions, so instead of
cityname='MONTEVIDEO', you have to write
MATCH('@cityname ^MONTEVIDEO$'); the Sphinx site suggests other alternative usages. You also require workarounds with DATE and DATETIME fields; Sphinx doesn't provide those types, so you have to transform DATE fields into integers with TO_DAYS(), and DATETIME fields to timestamps with UNIX_TIMESTAMP(). Every SELECT must include at least one MATCH() call. If you don't specify a LIMIT clause, LIMIT 0,20 is assumed; MySQL would instead return all matching records. Check the latest documentation to see how these restrictions evolve and work out; it's likely most will be, over time, removed, but for the time being they may prove a problem, depending on your needs.
Despite the differences, the speed Sphinx achieves can make it worth using. For example, suppose I want data on countries having cities that include "DARWIN" in their name, with known population; Sphinx can return this information in less than 1/100th of a second. As a second query, to show grouping, I'd like to know which countries have cities with a population greater than that of Uruguay, my home country; this takes Sphinx 0.01 seconds. Note the
@count notation instead of a straight COUNT(*), and also note that queries always return two extra attributes,
weight; the first identifies the original record, and the second measures how well it matches your query.
>mysql -h192.168.1.200 -P9306 -ufkereki_user -pfkereki_pass
mysql> SELECT countryname,regionname,cityname,population
-> FROM citiesIndex
-> WHERE MATCH('DARWIN') AND population>0
-> ORDER BY countryname ASC, regionname ASC;
| id | weight | countryname | regionname | cityname | population |
| 153585 | 1685 | Australia | Northern Territory | darwin | 93081 |
| 3173417 | 1685 | Zimbabwe | Mashonaland Central | mount darwin | 11963 |
2 rows in set (0.00 sec)
mysql> SELECT countryName,@count
-> FROM citiesIndex
-> WHERE population>3500000
-> GROUP BY countryName
-> ORDER BY @count DESC, countryname ASC
-> LIMIT 10;
| id | weight | countryname | @count |
| 437596 | 1 | China | 8 |
| 1322104 | 1 | India | 8 |
| 1995219 | 1 | Nigeria | 3 |
| 156938 | 1 | Australia | 2 |
| 313817 | 1 | Brazil | 2 |
| 840438 | 1 | Egypt | 2 |
| 2200162 | 1 | Pakistan | 2 |
| 2461969 | 1 | Russian Federation | 2 |
| 2769407 | 1 | Turkey | 2 |
| 2920653 | 1 | United States | 2 |
10 rows in set (0.01 sec)
Working with MySQL produces the same results, but at a higher cost. The query for "DARWIN" cities, which requires JOINing three tables, takes almost 10 seconds, and the query for countries with cities more populated than Uruguay takes almost 1 second:
> mysql -h192.168.1.200 -ufkereki_user -pfkereki_pass
mysql> USE world;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
mysql> SELECT countryName,regionName,cityName,population
-> FROM cities
-> JOIN regions USING (countryCode,regionCode)
-> JOIN countries USING (countryCode)
-> WHERE cityName LIKE "%DARWIN%" AND population>0
-> ORDER BY countryName, regionName;
| countryName | regionName | cityName | population |
| Australia | Northern Territory | darwin | 93081 |
| Zimbabwe | Mashonaland Central | mount darwin | 11963 |
2 rows in set (9.71 sec)
mysql> SELECT countryName,count(*)
-> FROM cities join countries USING (countryCode)
-> WHERE population>3500000
-> GROUP by countryName
-> ORDER by 2 desc, 1 asc
-> LIMIT 10;
| countryName | count(*) |
| China | 8 |
| India | 8 |
| Nigeria | 3 |
| Australia | 2 |
| Brazil | 2 |
| Egypt | 2 |
| Pakistan | 2 |
| Russian Federation | 2 |
| Turkey | 2 |
| United States | 2 |
10 rows in set (0.89 sec)
Despite differences in syntax, and some restrictions for writing queries, you should appreciate that if Sphinx handles the kind of queries you require, your search times are going to be far shorter.
Adding a new component to a system can take time and resources, but Sphinx can become a significant performance enhancer, which compensates for the added complexity. Using Sphinx efficiently requires some forethought in defining sources and index files, but that is reasonable. Sphinx is very good with static data, but for frequently updated data, you'll have to go with delta files or real-time indexing, both of which carry performance penalties of their own. All in all, adding Sphinx to your MySQL installation isn't trivial, but it's not too complicated either, so when fast search times are a necessity, consider using this tool.
This work is licensed under a Creative Commons Attribution 3.0 Unported License