provides software and services that enable enterprises
Live Chat 1-888-673-6564

Open Source Software Technical Articles

  • Home
  • Search
  • Contact Us
  • Products and Support
  • Services
  • Enterprise OSS Blog
  • Wazi Technical Blog
  • About Wazi
  • Attributions and Licensing
  • Supply Chain Compliance
  • How to Contribute
  • Contributors
  • Resources Library
  • Cloud Services
  • Partners
  • Customers
  • Community
  • Company
  • Careers
  • News and Events

Subscribe to Wazi by Email

Your email:


Enterprise Developer Support 24 x 7, Get a Support Quote Now!


click-here-to-chat-with-an-online-representative

download-oss-discovery

Latest Posts

  • Use Perl to enhance ModSecurity
  • The secret to great reporting with Drupal 7
  • A more colorful LibreOffice unveiled
  • Toward a more colorful LibreOffice
  • Flexible administration with Puppet's Facter and templates
  • Knock for OpenSSH
  • Get more out of phpMyAdmin
  • Image annotation in GIMP, Dia, and OpenOffice Draw
  • Solr, Drupal 7, and faceted search
  • Using FreeNAS' new full disk encryption for ZFS

Connect with Us!

Current Articles | RSS Feed RSS Feed

Tips on Loading and Real-Time Searching of Big Data Sets

Posted by Rod Cope on Tue, May 17, 2011
  
Email This Email Article  
Tweet  
  

At OpenLogic, we manage a lot of data. We track all the world's open source software: every project, every version, every file, every line of code. Keeping all that information up-to-date and available for search is a challenge. We use the open source applications Solr, Hadoop, and HBase to store, maintain, and query our big data sets.



With the scope of data we're talking about, individual tables hold many terabytes of data. You can't fit that into a relational database, because relational databases don't scale to that extent - you can't just plug in a new machine and gain more performance. In our case, we need real-time random access to a huge non-static set of data so we can scan an organization's code to see if they're using improperly licensed code. We also need to execute long-running, complex analysis jobs against the same database of code so we can analyze trends and spot relationships.

The solution involves Hadoop, HBase, and Solr. Hadoop gives you a scalable distributed file system and a MapReduce framework for parallelizing jobs. HBase give you a NoSQL data store on top of Hadoop. Solr, which is based on Lucene, gives you the ability to search what you've put in those databases. We also use software like Stargate, which is a REST-based interface (representational state transfer) in front of HBase, which lets us access HBase from a Ruby and Ruby on Rails environment.



Here's the way our system works. People query our data via both web browser and scanner clients. Those clients take requests and pass them to our application layer. We store our primary data in a MySQL database. We use Redis for cross-machine coordination message queuing via Resque. We talk to a live replicated Solr cluster to ask for matches. If there's a match, the Resque workers ask a Stargate interface that sits in front of HBase for the full content of the matched file. The results then get put into MySQL, where they can be picked up by front-end application servers.

All of this software runs on a private cloud with more than 100 CPU cores and 100TB of disk. Why not use a public cloud like Amazon EC2? EC2 is great for computational bursts, but it's expensive for long-term storage of big data, and not yet consistent enough for mission-critical information in general and HBase in particular, which requires low latency. With EC2, you sometimes get latency spikes of several seconds or longer. Buying or leasing your own hardware is much more cost-effective.

Getting Data Out of HBase


When you have big data, you need to get information out efficiently. You want the equivalent in a relational model to a lookup via primary key rather than table joins, but there is no primary key with NoSQL. HBase is more like a hash table that you scan than a relational database that you query.



Making the HBase information available to search is where Solr comes in. It offers built-in sharding and replication, along with a dynamic schema, powerful query capabilities, and faceted search, and it's accessible through a simple API via a number of languages and technologies.

With sharding, you can query any server in a cluster, and thereby automatically execute the same query on all of the server's peers in the cluster, aggregate the results, and return them to the original caller. Sharding also offers asynchronous replication, which allows you to drop in more hardware resources if you need to.

At OpenLogic, we have our own Solr farm, sharded, cross-replicated, and front-ended with HAProxy for high availability, and for load-balanced writes across our masters and reads across both slaves and masters. We have billions of lines of code in HBase, all indexed in Solr – in fact, more than 20 fields indexed per source file.

We replicate each Solr core, or instance, from the master on one machine to a slave on the next, and so on down the line, until at the end we wrap around back to the first machine. That means if we lose any core, or any machine, or even the entire master or slave layer, we still have access to all of our data. Reads, writes, and deletes get distributed by HAProxy, and Solr takes care of the operation without our having to know anything about specific cores or hardware.

19a98812-f823-48dc-841e-bf029c63c6d7

Because there are so many parts in this solution, configuration is key. Using an open source configuration management and provisioning tool like Chef or Puppet or Cfengine to keep your machines synchronized with the same versions of software takes away the pain of having something be slightly off on one machine, and lets you quickly replace hardware if necessary and recover from any problems. Even with one of these tools, big data puts a heavy load on the underlying operating system. You should look at many of the operating system's limits on things like number of open files; you may have to raise them significantly so that you're not underutilizing your hardware. The HBase and Hadoop wikis can point you in the right direction.

Don't try to run big data on commodity hardware. Use modern name-brand rack-mounted dual-quad-core servers with 32GB or more of ECC RAM and dual- or quad-gigabit NICs. Expect to pay $6-8,000 per machine. You don't need RAID on Hadoop data disks, because Hadoop automatically gives you triple redundancy, but you may want to put your operating system and software disks on RAID. Use enterprise drives, which are built to handle the vibrations you get with machines in big server racks. Connect the nodes through redundant enterprise switches. Regardless of what hardware you use, you will have hardware failures, but Hadoop is very good at working around them. You'll also have weird cases with your own code or data that cause failures. Log the failures for later processing; don't stop jobs to fix those errors. And when you change software versions, expect more things to break.

Bottom line: If you're expecting something rock-solid and infallible, you'll be disappointed, but if you use appropriately provisioned hardware and versions of software that are known to work well together, they will run pretty well in production.

 

Loading Big Data


Here are a few tips on the often troublesome process of loading big data. In Solr, experiment with the load merge factor, which tells Solr how many indices you want to manage. If you raise the factor to, say, 25, you can minimize the work Solr has to go through to combine indices to optimize what's on disk. On the other hand, having a large number of indices makes searching slower, so after the initial load, when you'll be adding data more slowly, you can shrink the factor down to something like five.



A Solr commit isn't like a commit in a relational database. Once you add data under Solr, it's there and it's durable, but after you perform the commit, it will also be visible to queries. Solr isn't built to perform many commits per second, so change the auto-commit value during loads.

Test your write-focused load balancing so that you don't wind up with a big variance in Solr index size from machine to machine. You may have to do a commit, optimize, write again, and commit again before you can tell what your index sizes really are, because Solr does a lot of caching to optimize for speed, and here you're optimizing for size, which will ultimately enhance speed.

Make sure your replication slaves are keeping up. Use identical hardware for masters and slaves. If your index directories on the masters and slaves don't look the same, it's an indication that your slaves aren't keeping up with live backup of the data.

Avoid putting large values (greater than 5MB) into a single cell in HBase. Doing so can cause instability or strange performance issues that are hard to track down. HBase is designed to handle billions of rows by millions of columns, so you can use as many as you want. Split your values across rows or columns.

Don't use a single machine to try to load your cluster. You might not live long enough to see it finish. Spread the load across as many machines as you can, with many hard drives per machine.

Load your big data into HBase via Hadoop MapReduce jobs to take advantage of Hadoop's parallel load capabilities. Turn off the write-ahead log (WAL) in HBase via the command put.setWriteToWAL(false) in order to greatly enhance performance. You want those writes in a production environment, but you don't need them during the initial load. You should also index into Solr as you go. This helps test your load balancing, your write schemes, and your replication setup.

Scripting languages can help make data loading jobs less tedious, and help with system administration tasks. We use JRuby extensively, and in fact the HBase shell is based on JRuby. It's easy to write and then maintain MapReduce jobs using JRuby and the open source project Wukong.

Final Thoughts


HBase and Solr are fast enough to return data for more than a hundred random queries per second on huge datasets within a few milliseconds. Just give them plenty of memory.



HBase scales very well. Solr scales well up to a point, but if you find yourself outgrowing a rack of Solr instances, you should begin thinking about ways to partition your data explicitly rather than relying on the software's automatic sharding.

You can host your own big data by taking advantage of open source tools that didn't exist a few years ago. Prototyping is fast, but getting a system production-ready takes time. Budget time and money for training and support, and plan to experiment to get all the pieces to work well together.

Watch a recording of the webinar "Real-Time Searching of Big Data with Solr and Hadoop," featuring Rod Cope.

Write for Wazi, and get paid!

Follow @openlogic
Follow @CloudSwing

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.Follow @openlogic
Follow @OSCloudServices

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.
Tags: MySQL, Technical, Redis, Cloud, Solr, Chef, Puppet, Cfengine, Lucene, Hadoop, Wukong, HBase, Stargate

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Loading...
Error sending email
Email sent successfully

Email article
Email To : 
Your name : 
Message : (maximum 200 characters)
Home | Search | Contact Us | Products and Support | Services | Enterprise OSS Blog | Wazi Technical Blog | Resources Library | Cloud Services | Partners | Customers | Community | Company | Careers | News and Events
Products
OpenLogic Exchange (OLEX)
License Compliance Module
OSS Discovery
OSS Deep Discovery
OpenUpdate
Services
Open Source Support
CentOS Support
Scanning & Compliance
Open Source Training
Professional Services
Solutions
Support & Indemnification
Open Source Governance
Open Source Scanning
Open Source Provisioning
Consulting & Training
Contact Us
1-888-673-6564


© 2013 OpenLogic, Inc. All rights reserved.
Site Map  |  Privacy Policy