At OpenLogic, we manage a lot of data. We track all the world's open source software: every project, every version, every file, every line of code. Keeping all that information up-to-date and available for search is a challenge. We use the open source applications Solr, Hadoop, and HBase to store, maintain, and query our big data sets.
With the scope of data we're talking about, individual tables hold many terabytes of data. You can't fit that into a relational database, because relational databases don't scale to that extent - you can't just plug in a new machine and gain more performance. In our case, we need real-time random access to a huge non-static set of data so we can scan an organization's code to see if they're using improperly licensed code. We also need to execute long-running, complex analysis jobs against the same database of code so we can analyze trends and spot relationships.The solution involves Hadoop, HBase, and Solr. Hadoop gives you a scalable distributed file system and a MapReduce framework for parallelizing jobs. HBase give you a NoSQL data store on top of Hadoop. Solr, which is based on Lucene, gives you the ability to search what you've put in those databases. We also use software like Stargate, which is a REST-based interface (representational state transfer) in front of HBase, which lets us access HBase from a Ruby and Ruby on Rails environment.Here's the way our system works. People query our data via both web browser and scanner clients. Those clients take requests and pass them to our application layer. We store our primary data in a MySQL database. We use Redis for cross-machine coordination message queuing via Resque. We talk to a live replicated Solr cluster to ask for matches. If there's a match, the Resque workers ask a Stargate interface that sits in front of HBase for the full content of the matched file. The results then get put into MySQL, where they can be picked up by front-end application servers.All of this software runs on a private cloud with more than 100 CPU cores and 100TB of disk. Why not use a public cloud like Amazon EC2? EC2 is great for computational bursts, but it's expensive for long-term storage of big data, and not yet consistent enough for mission-critical information in general and HBase in particular, which requires low latency. With EC2, you sometimes get latency spikes of several seconds or longer. Buying or leasing your own hardware is much more cost-effective.
When you have big data, you need to get information out efficiently. You want the equivalent in a relational model to a lookup via primary key rather than table joins, but there is no primary key with NoSQL. HBase is more like a hash table that you scan than a relational database that you query.
Making the HBase information available to search is where Solr comes in. It offers built-in sharding and replication, along with a dynamic schema, powerful query capabilities, and faceted search, and it's accessible through a simple API via a number of languages and technologies.With sharding, you can query any server in a cluster, and thereby automatically execute the same query on all of the server's peers in the cluster, aggregate the results, and return them to the original caller. Sharding also offers asynchronous replication, which allows you to drop in more hardware resources if you need to.At OpenLogic, we have our own Solr farm, sharded, cross-replicated, and front-ended with HAProxy for high availability, and for load-balanced writes across our masters and reads across both slaves and masters. We have billions of lines of code in HBase, all indexed in Solr – in fact, more than 20 fields indexed per source file.We replicate each Solr core, or instance, from the master on one machine to a slave on the next, and so on down the line, until at the end we wrap around back to the first machine. That means if we lose any core, or any machine, or even the entire master or slave layer, we still have access to all of our data. Reads, writes, and deletes get distributed by HAProxy, and Solr takes care of the operation without our having to know anything about specific cores or hardware.
Here are a few tips on the often troublesome process of loading big data. In Solr, experiment with the load merge factor, which tells Solr how many indices you want to manage. If you raise the factor to, say, 25, you can minimize the work Solr has to go through to combine indices to optimize what's on disk. On the other hand, having a large number of indices makes searching slower, so after the initial load, when you'll be adding data more slowly, you can shrink the factor down to something like five.
A Solr commit isn't like a commit in a relational database. Once you add data under Solr, it's there and it's durable, but after you perform the commit, it will also be visible to queries. Solr isn't built to perform many commits per second, so change the auto-commit value during loads.Test your write-focused load balancing so that you don't wind up with a big variance in Solr index size from machine to machine. You may have to do a commit, optimize, write again, and commit again before you can tell what your index sizes really are, because Solr does a lot of caching to optimize for speed, and here you're optimizing for size, which will ultimately enhance speed.Make sure your replication slaves are keeping up. Use identical hardware for masters and slaves. If your index directories on the masters and slaves don't look the same, it's an indication that your slaves aren't keeping up with live backup of the data.Avoid putting large values (greater than 5MB) into a single cell in HBase. Doing so can cause instability or strange performance issues that are hard to track down. HBase is designed to handle billions of rows by millions of columns, so you can use as many as you want. Split your values across rows or columns.Don't use a single machine to try to load your cluster. You might not live long enough to see it finish. Spread the load across as many machines as you can, with many hard drives per machine.Load your big data into HBase via Hadoop MapReduce jobs to take advantage of Hadoop's parallel load capabilities. Turn off the write-ahead log (WAL) in HBase via the command put.setWriteToWAL(false) in order to greatly enhance performance. You want those writes in a production environment, but you don't need them during the initial load. You should also index into Solr as you go. This helps test your load balancing, your write schemes, and your replication setup.Scripting languages can help make data loading jobs less tedious, and help with system administration tasks. We use JRuby extensively, and in fact the HBase shell is based on JRuby. It's easy to write and then maintain MapReduce jobs using JRuby and the open source project Wukong.
HBase and Solr are fast enough to return data for more than a hundred random queries per second on huge datasets within a few milliseconds. Just give them plenty of memory.
HBase scales very well. Solr scales well up to a point, but if you find yourself outgrowing a rack of Solr instances, you should begin thinking about ways to partition your data explicitly rather than relying on the software's automatic sharding.You can host your own big data by taking advantage of open source tools that didn't exist a few years ago. Prototyping is fast, but getting a system production-ready takes time. Budget time and money for training and support, and plan to experiment to get all the pieces to work well together.Watch a recording of the webinar "Real-Time Searching of Big Data with Solr and Hadoop," featuring Rod Cope.Write for Wazi, and get paid!
Allowed tags: <a> link, <b> bold, <i> italics