Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Search beyond MySQL with Sphinx

  
  
  

MySQL is a powerful database, but its search capabilities are most efficient for searches based upon a single database index. If your application requires complex searching, consider adding Sphinxto the mix to gain searching speed. Sphinx's main selling point is its full-text search capability, but it can also shine with regular queries.

Our test data
To show how Sphinx can improve searching performance, I needed a database with at least a few million records. I turned to the World Cities data I used in my recent Apache and RESTful Web Services article. The "world" database schema holds data about a couple of hundred countries, their four thousand regions, and more than three million towns and cities. Countries have an ID (two letters) and a name; regions in a country have an ID (two letters) and a name; cities in a region have an ID (integer), a name, an accented version of the name with possibly foreign characters, population (if known, zero otherwise), and geographic coordinates (latitude, longitude). Searches in the country and region tables don't require much processing power, but joins involving cities can, so I used Sphinx mainly for such searches.

Sphinx (a sort of acronym for "SQL Phrase Index") is a full-text open source search engine that supports MySQL databases (and others) natively. Its development began in 2001, and its latest version, 2.0.6, was released last month. Sphinx uses its own tuned-for-speed index files, and can thus achieve much faster search speeds than MySQL alone. It is being used for databases with billions of documents, spanning terabytes of disk space; for scalability, you can set up a cluster of Sphinx machines.

Installation of Sphinx is simple. Binary versions are available for a few Linux distributions, but installation from source isn't that complicated, so we'll take that route. After downloading the source tarball from the Sphinx download page, extract its file, and run the usual ./configure, make, make install sequence, but specify the path for Sphinx's binaries; I opted for bin/sphinx under my home directory. Try ./configure --help to get a full list of available options.

> tar -zxf sphinx-2.0.6-release.tar.gz
> cd sphinx-2.0.6-release
> ./configure --prefix=/home/fkereki/bin/sphinx

...

> make

...

> make install

After you install Sphinx you must configure it so it can access your MySQL database, create whatever index files you request, index your data, and run a daemon that actually performs searches.

You configure Sphinx by editing its etc/sphinx.conf file in the installation directory. The configuration file is divided into stanzas. For each index you want to create, you must provide a source, which tells Sphinx how to procure the data it must index, its data types, and more, and an index definition, which specifies how to index the data, where to store the index files, and so on:

source citiesSource
{
  type              = mysql
  sql_host          = 192.168.1.200
  sql_port          = 3306
  sql_user          = fkereki_user
  sql_pass          = fkereki_pass
  sql_db            = world

  sql_query    =        \
    SELECT              \
      cityCode id,      \
      countryName,      \
      regionName,       \
      cityName,         \
      population,       \
      latitude,         \
      longitude         \
    FROM cities         \
      JOIN regions   USING (countryCode,regionCode) \
      JOIN countries USING (countryCode)

  sql_field_string  = countryName
  sql_field_string  = regionName
  sql_field_string  = cityName
  sql_attr_bigint   = population
  sql_attr_float    = latitude
  sql_attr_float    = longitude
}

We want to index cities data. The first attribute tells Sphinx it will work with a MySQL database, and the five following ones (host, port, user, password, and schema name) tell it how to access it. The sql_query definition tells Sphinx how to get the "documents" it will index; in our case, it will use the results of joining together countries, regions, and cities, using most of their fields. A "document ID," a unique unsigned positive integer number, must always be the first field, followed by as many as 32 full-text fields, and any number of other attributes. The last sql_ lines define the data types of your document fields; see the table below for common allowed data types. You could also have sql_query_pre and sql_query_post options, to specify SQL commands to be run before or after indexing; these could set up temporary tables, mark records to be indexed, update counters or totals, and clean up afterward.

Sphinx frequently used data types
Data TypeMySQL equivalent
sql_attr_bool Boolean, TINYINT
sql_attr_bigint BIGINT, signed, 64 bits wide
sql_attr_float FLOAT, 32 bits wide
sql_attr_uint UNSIGNED INTEGER, up to 32 bits wide
sql_attr_timestamp UNIX timestamps, 32 bits wide.
sql_attr_string VARCHAR fields, not indexed
sql_field_string VARCHAR fields, full-text indexed

The corresponding index entry specifies where the index files and attribute files are to be stored, whether you can do '*' queries (similar to LIKE queries with the '%' special character), and whether you are working with single-byte data (we are; the "world" schema is in ISO-8859-1) or multi-byte (UTF-8, for example) data. Setting enable_star=1 makes for larger index files and slower index times, so don't specify it unless your business logic does require '*' queries.

index citiesIndex
{
  source            = citiesSource
  path              = /home/fkereki/bin/sphinx/var/data/world
  docinfo           = extern
  enable_star       = 1
  min_infix_len     = 3
  charset_type      = sbcs
}

Finally, you must also specify some parameters for indexer, the Sphinx module that creates and rotates index files, and searchd, the Sphinx search deamon, which actually runs searches. For the former, you can specify how much RAM it will use (the more, the faster) and some other parameters, such as throttling conditions and buffer sizes. For the latter, you must specify the port that the Sphinx daemon will listen on, some file paths for data and logs, and some parameters specifying, for example, how many queries to run in parallel and whether to preopen index files on startup; the default values usually work.

indexer
{
  mem_limit         = 1024M
}

searchd
{
  listen            = 9306:mysql41

  binlog_path       = /home/fkereki/bin/sphinx/var/data
  log               = /home/fkereki/bin/sphinx/var/log/searchd.log
  pid_file          = /home/fkereki/bin/sphinx/var/log/searchd.pid
  query_log         = /home/fkereki/bin/sphinx/var/log/query.log
}

You can query Sphinx through an API, but for MySQL programmers, using SphinxQL, a subset of SQL that supports the SELECT statement for running queries, is a better option. You must pick a port for SphinxQL to listen on; 9306 (similar to MySQL's 3306) is suggested. You'll be able to work with MySQL and Sphinx at the same time, each on its own port; we'll see that in a moment, but first let's index and reindex our data.

Indexing the data

Sphinx can recreate all of its index files even when it's running, because it produces an alternate set of files, which it then rotates. Indexing is very fast, and the more RAM, the faster it goes. On my machine it can index the three million city documents in about 25 seconds. Enabling star searches (enable-star=1) can drastically and negatively impact that; when I made that change, indexing took six times as long, or about 150 seconds, and index sizes were also bigger. If required, you can reindex just a single index at the time, or specify --all to reindex everything.

> ../bin/indexer --all --rotate
Sphinx 2.0.6-release (r3473)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/home/fkereki/bin/sphinx/etc/sphinx.conf'...
indexing index 'citiesIndex'...
collected 3033147 docs, 89.2 MB
sorted 13.2 Mhits, 100.0% done
total 3033147 docs, 89186515 bytes
total 25.680 sec, 3472869 bytes/sec, 118108.93 docs/sec
total 3 reads, 0.228 sec, 59220.6 kb/call avg, 76.0 msec/call avg
total 552 writes, 3.968 sec, 791.8 kb/call avg, 7.1 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=30416).

You should also think about how you want to keep index files up to date. If your data doesn't change frequently and changes involve a small percent of all data (a common situation) you can have a "main+delta" structure, with a main index for data that rarely (if ever) changes, and a delta index for new data. You can reindex the smaller delta data set frequently, but you'll have to query both index files in order to get a full answer. The Sphinx documentation talks at length about this type of solution.

Sphinx also supports RT (Real Time) index files, which can be updated on the fly by using INSERT, DELETE, and REPLACE statements. Of course, this means that every operation that updates your database must also be replicated for the benefit of Sphinx, so every involved index file is updated. If you require searching the very latest data, you have to decide between RT index files and delta files, which lead to some further research. If not, reindexing periodically can be a solution, as long as you can live with some out-of-date data for a little while.

Trying out some searches

Using SphinxQL isn't that different from using MySQL; mostly it's just a matter of specifying port 9306 instead of 3306. No matter what language you use, if you have an available MySQL client, you'll be able to easily access SphinxQL. However, don't plan on a line-for-line replacement; SphinxQL isn't full MySQL, and you have to work around some restrictions.

As an example of some quirks you have to live with, you cannot directly use string attributes for WHERE conditions, so instead of cityname='MONTEVIDEO', you have to write MATCH('@cityname ^MONTEVIDEO$'); the Sphinx site suggests other alternative usages. You also require workarounds with DATE and DATETIME fields; Sphinx doesn't provide those types, so you have to transform DATE fields into integers with TO_DAYS(), and DATETIME fields to timestamps with UNIX_TIMESTAMP(). Every SELECT must include at least one MATCH() call. If you don't specify a LIMIT clause, LIMIT 0,20 is assumed; MySQL would instead return all matching records. Check the latest documentation to see how these restrictions evolve and work out; it's likely most will be, over time, removed, but for the time being they may prove a problem, depending on your needs.

Despite the differences, the speed Sphinx achieves can make it worth using. For example, suppose I want data on countries having cities that include "DARWIN" in their name, with known population; Sphinx can return this information in less than 1/100th of a second. As a second query, to show grouping, I'd like to know which countries have cities with a population greater than that of Uruguay, my home country; this takes Sphinx 0.01 seconds. Note the @count notation instead of a straight COUNT(*), and also note that queries always return two extra attributes, id and weight; the first identifies the original record, and the second measures how well it matches your query.

>mysql -h192.168.1.200 -P9306 -ufkereki_user -pfkereki_pass

...

mysql> SELECT countryname,regionname,cityname,population
    -> FROM citiesIndex
    -> WHERE MATCH('DARWIN') AND population>0
    -> ORDER BY countryname ASC, regionname ASC;
+---------+--------+-------------+---------------------+--------------+------------+
| id      | weight | countryname | regionname          | cityname     | population |
+---------+--------+-------------+---------------------+--------------+------------+
|  153585 |   1685 | Australia   | Northern Territory  | darwin       |      93081 |
| 3173417 |   1685 | Zimbabwe    | Mashonaland Central | mount darwin |      11963 |
+---------+--------+-------------+---------------------+--------------+------------+
2 rows in set (0.00 sec)

mysql> SELECT countryName,@count
    -> FROM citiesIndex
    -> WHERE population>3500000
    -> GROUP BY countryName
    -> ORDER BY @count DESC, countryname ASC
    -> LIMIT 10;
+---------+--------+--------------------+--------+
| id      | weight | countryname        | @count |
+---------+--------+--------------------+--------+
|  437596 |      1 | China              |      8 |
| 1322104 |      1 | India              |      8 |
| 1995219 |      1 | Nigeria            |      3 |
|  156938 |      1 | Australia          |      2 |
|  313817 |      1 | Brazil             |      2 |
|  840438 |      1 | Egypt              |      2 |
| 2200162 |      1 | Pakistan           |      2 |
| 2461969 |      1 | Russian Federation |      2 |
| 2769407 |      1 | Turkey             |      2 |
| 2920653 |      1 | United States      |      2 |
+---------+--------+--------------------+--------+
10 rows in set (0.01 sec)

Working with MySQL produces the same results, but at a higher cost. The query for "DARWIN" cities, which requires JOINing three tables, takes almost 10 seconds, and the query for countries with cities more populated than Uruguay takes almost 1 second:

> mysql -h192.168.1.200 -ufkereki_user -pfkereki_pass

...

mysql> USE world;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> SELECT countryName,regionName,cityName,population
    -> FROM cities
    -> JOIN regions USING (countryCode,regionCode)
    -> JOIN countries USING (countryCode)
    -> WHERE cityName LIKE "%DARWIN%" AND population>0
    -> ORDER BY countryName, regionName;
+-------------+---------------------+--------------+------------+
| countryName | regionName          | cityName     | population |
+-------------+---------------------+--------------+------------+
| Australia   | Northern Territory  | darwin       |      93081 |
| Zimbabwe    | Mashonaland Central | mount darwin |      11963 |
+-------------+---------------------+--------------+------------+
2 rows in set (9.71 sec)

mysql> SELECT countryName,count(*)
    -> FROM cities join countries USING (countryCode)
    -> WHERE population>3500000
    -> GROUP by countryName
    -> ORDER by 2 desc, 1 asc
    -> LIMIT 10;
+--------------------+----------+
| countryName        | count(*) |
+--------------------+----------+
| China              |        8 |
| India              |        8 |
| Nigeria            |        3 |
| Australia          |        2 |
| Brazil             |        2 |
| Egypt              |        2 |
| Pakistan           |        2 |
| Russian Federation |        2 |
| Turkey             |        2 |
| United States      |        2 |
+--------------------+----------+
10 rows in set (0.89 sec)

Despite differences in syntax, and some restrictions for writing queries, you should appreciate that if Sphinx handles the kind of queries you require, your search times are going to be far shorter.

In conclusion

Adding a new component to a system can take time and resources, but Sphinx can become a significant performance enhancer, which compensates for the added complexity. Using Sphinx efficiently requires some forethought in defining sources and index files, but that is reasonable. Sphinx is very good with static data, but for frequently updated data, you'll have to go with delta files or real-time indexing, both of which carry performance penalties of their own. All in all, adding Sphinx to your MySQL installation isn't trivial, but it's not too complicated either, so when fast search times are a necessity, consider using this tool.




This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.

Comments

Another alternative is SphinxSE. It's a mySQL plugin that enables Sphinx as an engine. You create a table among your normal innodb/myisam tables with the sphinx engine and query against it as you would any other table. The only difference is you run your Sphinx query within a single field: 
 
select id from sphinx_table where query="search query;filter=etc.;sort=extended:etc." 
 
The reason I love this method is the fact that you can join your traditional mySQL tables to the results with a single database connection. 
 
It does add a little complexity because you need to compile the plugin against your version of mySQL. This can be a problem if you're running a cPanel server that auto-upgrades your mySQL server.
Posted @ Monday, November 12, 2012 10:35 AM by Scott Aikin
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics