At OpenLogic, we manage a lot of data. We track all the world’s open source software: every project, every version, every file, every line of code. Keeping all that information up-to-date and available for search is a challenge. We use the open source applications Solr, Hadoop, and HBase to store, maintain, and query our big data sets.
With the scope of data we’re talking about, individual tables hold many terabytes of data. You can’t fit that into a relational database, because relational databases don’t scale to that extent – you can’t just plug in a new machine and gain more performance. In our case, we need real-time random access to a huge non-static set of data so we can scan an organization’s code to see if they’re using improperly licensed code. We also need to execute long-running, complex analysis jobs against the same database of code so we can analyze trends and spot relationships.
The solution involves Hadoop, HBase, and Solr. Hadoop gives you a scalable distributed file system and a MapReduce framework for parallelizing jobs. HBase give you a NoSQL data store on top of Hadoop. Solr, which is based on Lucene, gives you the ability to search what you’ve put in those databases. We also use software like Stargate, which is a REST-based interface (representational state transfer) in front of HBase, which lets us access HBase from a Ruby and Ruby on Rails environment.

Here’s the way our system works. People query our data via both web browser and scanner clients. Those clients take requests and pass them to our application layer. We store our primary data in a MySQL database. We use Redis for cross-machine coordination message queuing via Resque. We talk to a live replicated Solr cluster to ask for matches. If there’s a match, the Resque workers ask a Stargate interface that sits in front of HBase for the full content of the matched file. The results then get put into MySQL, where they can be picked up by front-end application servers.
All of this software runs on a private cloud with more than 100 CPU cores and 100TB of disk. Why not use a public cloud like Amazon EC2? EC2 is great for computational bursts, but it’s expensive for long-term storage of big data, and not yet consistent enough for mission-critical information in general and HBase in particular, which requires low latency. With EC2, you sometimes get latency spikes of several seconds or longer. Buying or leasing your own hardware is much more cost-effective.
Getting Data Out of HBase
When you have big data, you need to get information out efficiently. You want the equivalent in a relational model to a lookup via primary key rather than table joins, but there is no primary key with NoSQL. HBase is more like a hash table that you scan than a relational database that you query.
Making the HBase information available to search is where Solr comes in. It offers built-in sharding and replication, along with a dynamic schema, powerful query capabilities, and faceted search, and it’s accessible through a simple API via a number of languages and technologies.
With sharding, you can query any server in a cluster, and thereby automatically execute the same query on all of the server’s peers in the cluster, aggregate the results, and return them to the original caller. Sharding also offers asynchronous replication, which allows you to drop in more hardware resources if you need to.
At OpenLogic, we have our own Solr farm, sharded, cross-replicated, and front-ended with HAProxy for high availability, and for load-balanced writes across our masters and reads across both slaves and masters. We have billions of lines of code in HBase, all indexed in Solr – in fact, more than 20 fields indexed per source file.
We replicate each Solr core, or instance, from the master on one machine to a slave on the next, and so on down the line, until at the end we wrap around back to the first machine. That means if we lose any core, or any machine, or even the entire master or slave layer, we still have access to all of our data. Reads, writes, and deletes get distributed by HAProxy, and Solr takes care of the operation without our having to know anything about specific cores or hardware.
Because there are so many parts in this solution, configuration is key. Using an open source configuration management and provisioning tool like Chef or Puppet or Cfengine to keep your machines synchronized with the same versions of software takes away the pain of having something be slightly off on one machine, and lets you quickly replace hardware if necessary and recover from any problems. Even with one of these tools, big data puts a heavy load on the underlying operating system. You should look at many of the operating system’s limits on things like number of open files; you may have to raise them significantly so that you’re not underutilizing your hardware. The HBase and Hadoop wikis can point you in the right direction.
Don’t try to run big data on commodity hardware. Use modern name-brand rack-mounted dual-quad-core servers with 32GB or more of ECC RAM and dual- or quad-gigabit NICs. Expect to pay $6-8,000 per machine. You don’t need RAID on Hadoop data disks, because Hadoop automatically gives you triple redundancy, but you may want to put your operating system and software disks on RAID. Use enterprise drives, which are built to handle the vibrations you get with machines in big server racks. Connect the nodes through redundant enterprise switches. Regardless of what hardware you use, you will have hardware failures, but Hadoop is very good at working around them. You’ll also have weird cases with your own code or data that cause failures. Log the failures for later processing; don’t stop jobs to fix those errors. And when you change software versions, expect more things to break.
Bottom line: If you’re expecting something rock-solid and infallible, you’ll be disappointed, but if you use appropriately provisioned hardware and versions of software that are known to work well together, they will run pretty well in production.
Loading Big Data
Here are a few tips on the often troublesome process of loading big data. In Solr, experiment with the load merge factor, which tells Solr how many indices you want to manage. If you raise the factor to, say, 25, you can minimize the work Solr has to go through to combine indices to optimize what’s on disk. On the other hand, having a large number of indices makes searching slower, so after the initial load, when you’ll be adding data more slowly, you can shrink the factor down to something like five.
A Solr commit isn’t like a commit in a relational database. Once you add data under Solr, it’s there and it’s durable, but after you perform the commit, it will also be visible to queries. Solr isn’t built to perform many commits per second, so change the auto-commit value during loads.
Test your write-focused load balancing so that you don’t wind up with a big variance in Solr index size from machine to machine. You may have to do a commit, optimize, write again, and commit again before you can tell what your index sizes really are, because Solr does a lot of caching to optimize for speed, and here you’re optimizing for size, which will ultimately enhance speed.
Make sure your replication slaves are keeping up. Use identical hardware for masters and slaves. If your index directories on the masters and slaves don’t look the same, it’s an indication that your slaves aren’t keeping up with live backup of the data.
Avoid putting large values (greater than 5MB) into a single cell in HBase. Doing so can cause instability or strange performance issues that are hard to track down. HBase is designed to handle billions of rows by millions of columns, so you can use as many as you want. Split your values across rows or columns.
Don’t use a single machine to try to load your cluster. You might not live long enough to see it finish. Spread the load across as many machines as you can, with many hard drives per machine.
Load your big data into HBase via Hadoop MapReduce jobs to take advantage of Hadoop’s parallel load capabilities. Turn off the write-ahead log (WAL) in HBase via the command put.setWriteToWAL(false) in order to greatly enhance performance. You want those writes in a production environment, but you don’t need them during the initial load. You should also index into Solr as you go. This helps test your load balancing, your write schemes, and your replication setup.
Scripting languages can help make data loading jobs less tedious, and help with system administration tasks. We use JRuby extensively, and in fact the HBase shell is based on JRuby. It’s easy to write and then maintain MapReduce jobs using JRuby and the open source project Wukong.
Final Thoughts
HBase and Solr are fast enough to return data for more than a hundred random queries per second on huge datasets within a few milliseconds. Just give them plenty of memory.
HBase scales very well. Solr scales well up to a point, but if you find yourself outgrowing a rack of Solr instances, you should begin thinking about ways to partition your data explicitly rather than relying on the software’s automatic sharding.
You can host your own big data by taking advantage of open source tools that didn’t exist a few years ago. Prototyping is fast, but getting a system production-ready takes time. Budget time and money for training and support, and plan to experiment to get all the pieces to work well together.
Related posts:
- Tips for Writing Safe, Secure PHP Code
- OpenSSL Expert Tips and Tricks: Test and Benchmark Servers
- Telling the Time on Linux: It’s Harder Than It Looks
- Tips for Using Vim as an IDE
- Vim Undo Tips and Tricks














Beautifully written article. Kudos for being precise with right mix of technical details and high level architectural reasonings