Blog

November 28, 2023

What Is Apache HBase? HBase Features, Use Cases, and Alternatives

Hadoop

Apache HBase is part of the Hadoop ecosystem, which is widely used in the big data space because it can store and analyze high volumes of unstructured data. Originally prototyped in 2006, HBase has gained traction and has been a top-level Apache project since 2010. Keep reading to learn about HBase architecture, features, use cases, and alternatives.

What Is HBase?
HBase Architecture
Notable HBase Features
HBase Use Cases
HBase Alternatives
Final Thoughts

What Is HBase?

Apache HBase is an open source distributed database built on top of Hadoop File System (HDFS). HBase is written in Java and is a NoSQL column-oriented database capable of managing massive amounts of data — potentially billions of rows and millions of columns.

HBase was developed in 2006 by Powerset, a company later acquired by Microsoft, as part of a project to create a natural language search engine for the Internet. Its design comes from Bigtable: A Distributed Storage System for Structured Data, a paper published by Google that describes the API for a distributed storage system that can manage petabytes of structured data using a cluster of commodity servers. HBase was later contributed as open source and became a sub-project of Hadoop in 2008.

HBase Architecture

People typically associate the term “database” with relational databases (RDBMS). With that in mind, it’s best to think of HBase as a “data store”, since it does not have all the bells and whistles and guardrails that are standard in an RDBMS. In HBase, there are no defined column types, column constraints, action triggers, compound indexes, or native SQL query support.

HBase is built on top of Hadoop, which is geared toward batch processing, and HDFS is not suitable for random disk access. In fact, it cannot update a file in place; all updates require a rewrite of the entire file. These limitations are actually the strengths of a traditional RDBMS. Relational databases are optimized for fast random data i/o. Unfortunately, relational databases struggle to handle large volumes of data due to indexing overhead, as well as the need to maintain performance while providing all the conveniences mentioned previously. An RDBMS is usually scaled vertically and requires specialized hardware and storage devices for optimal performance.

HBase was envisioned, architected, and developed to strike a balance between HDFS and an RDBMS, designed to overcome the drawbacks that existed in real-time data processing in Hadoop. It accomplishes this by focusing on the specific problems of real-time access while leveraging the strengths of some existing components of the Hadoop ecosystem to do the rest.

HBase Data Hierarchy

In HBase:

a table is made up of one or more rows
a row consists of one or more columns identified by a unique row key
a column contain cells, which are timestamped versions of the value in that column
columns are grouped into column families

HBase requires a predefined table schema that specifies the column families. However, there is flexibility in lower levels of the hierarchy, as new columns can be added to families any time, allowing the schema to adapt to evolving application requirements.

HBase I/O Flow

Reads (HBase Client perspective)	Writes (HBase Server perspective)
Request the location of the META system table from ZooKeeper. Cache location of META system table for future requests. Request the location of the RegionServer hosting the desired row key from the META system table. Cache location of the RegionServer for future requests. Request row from corresponding RegionServer.	Write data updates to the Write Ahead Log (WAL) journal (via HDFS HLog). Write data to in-memory MemStore. When MemStore reaches capacity, write data to permanent storage (via HDFS HFile). When a RegionServer outage or failure occurs, use the WAL to replay updates on the new RegionServer assigned by HMaster.

The HBase client API uses the META system table to identify the region hosting the requested key, so it can read or write to the node directly without interacting with the HMaster node.

Clients can write to HDFS directly or through HBase. Either way, the data is accessible through HBase.

Expert Technical Support For Apache Hadoop
OpenLogic provides SLA-backed Hadoop support, delivered by experienced Enterprise Architects.
Explore Hadoop Support

HBase Responsibilities Summary

**HBase Proper**
HMaster	Perform administrative operations on the cluster Apply DDL actions for creating or altering tables Assign and distribute regions among RegionServers (stored in META system table) Conduct region load balancing across RegionServers Handle RegionServer failures by assigning the Region to another RegionServer
RegionServers	Function as clients that buffer I/O operations store/access data on HDFS Host MemStore per column-family Manage WAL per column-family Manage one or more regions Typically are collocated on the same hardware as HDFS DataNodes
Regions	Are used to balance the data processing load Rows of a table are first stored in a single Region Rows are spread across Regions, in key ranges, as data in the table grows

**Leveraging Hadoop**
HDFS	Store HLogs; write ahead log (WAL) files Store HFiles persisting all columns in a column-family
ZooKeeper	Track location of META system table Receive heartbeat signals from HMaster and RegionServers Provide HMaster with RegionServer failure notifications Initiate HMaster fail-over protocol

ZooKeeper is built into HBase; however, a production cluster should have a dedicated ZooKeeper cluster that is integrated with the HBase cluster.

Notable HBase Features

While many NoSQL databases offer eventual consistency, HBase touts strong consistency as a core design tenet. There is a single node in an HBase cluster that is responsible for atomic row operations for a subset of the data, so it is able to guarantee consistency.
Traditional databases require manual sharding. HBase, like many NoSQL databases, provides automatic sharding. The tables are distributed across the cluster via regions, which are automatically split and re-distributed as the data grows. Each individual node has access to the data in HDFS to service reads and writes, and this allows HBase to achieve low latency random access to petabytes of data by distributing requests from applications across a cluster of nodes.
Many databases and data stores require complicated configuration, architectural decisions, and potentially coding or external product integrations to achieve a high degree of fault tolerance to cover node availability issues. HBase leverages the fault tolerance features of HDFS, as it splits data stored in tables across multiple hosts in the cluster, so it can withstand the failure of an individual node. It achieves this by automatically assigning a healthy node to serve the data previously provided by the failed node, then replaying the Write Ahead Log (WAL) to recover data in motion.
Because HBase was developed with Hadoop in mind, it natively supports and leverages other components of that ecosystem. Some examples:
- HBase supports and uses HDFS by default as its distributed file system.
- HBase supports massively parallelized processing via MapReduce, and it can be leveraged as both a source and output for MapReduce jobs.
- Although HBase does not support SQL syntax natively, this can be achieved through the use of Apache Phoenix, a complimentary open source project.
- Likewise, Apache Hive allows users to query HBase tables using the Hive Query Language, which is similar to SQL.
HBase is developed in Java, and it has a Java Client API for convenient access via Java-based applications; however, it also has both Thrift and REST APIs for language agnostic interactions.

HBase Use Cases

HBase is used for both write heavy applications, as well as applications that need to provide fast, random access to vast amounts of available data. Some examples include:

Storing clickstream data for downstream analysis
Storing application logs for diagnostic and trend analysis
Storing document fingerprints used to identify potential plagiarism
Storing genome sequences and the disease history of people in a particular demographic
Storing head-to-head competition histories in sports for better analytics and outcome predictions

HBase Alternatives

Broadly speaking, there are many alternatives to HBase. Any data store or database can be contender for solving the same problems. For teams evaluating different open source databases, considering your specific data management needs can help narrow this pool of options down.

For systems that need to house and process thousands, or maybe even millions, of rows, horizontally scaling is not likely a factor. In these cases, most of the data could be stored on a single server, so some form of RDBMS would be a good choice due to all the conveniences that they provide. There are countless options in this space, but the most popular open source relational databases are PostgreSQL, MySQL, and MariaDB.

For systems that need to house and process millions and billions of rows in a performant way, a NoSQL solution like HBase is going to be on the table. Again, there are many NoSQL options with varying strengths and weaknesses that depend on the specifics of the problem. Some open source options (other than HBase) include Cassandra, Redis, and MongoDB. In environments without a Hadoop implementation, these NoSQL options may be more attractive because they are designed to be used as a standalone data store.

Final Thoughts

Realistically, HBase is the most logical choice when there is an existing investment in the Hadoop ecosystem for housing and managing Big Data. This is because HBase depends heavily on other components of the Hadoop ecosystem, such as HDFS, MapReduce, and Zookeeper. The volume of data being processed, managed, and stored, and whether or not you are already using Hadoop will likely be the key factors that determine if it makes sense to deploy HBase in your environment.

Open Source Big Data Management
With the Hadoop Service Bundle, we can help you manage your Big Data infrastructure no matter where your data is hosted (on-prem, cloud, hybrid) with an open source Hadoop stack.
Visit Solution Page

Additional Resources

Blog - Cracking the Complexity of Hadoop Administration
Webinar - Is It Time to Open Source Your Big Data Management?
Blog - Introducing the Hadoop Service Bundle From OpenLogic
Blog - Weighing the Value of Apache Hadoop vs. Cloudera
White Paper - The Decision Maker's Guide to Open Source Databases
Blog - Apache Spark vs. Hadoop
Guide - Intro to Open Source Databases
Blog - RDBMS vs. NoSQL: Differences and How to Migrate

Featured Product

Kafka Service Bundle

Services

Training

Taking an Open Source Approach to Big Data Management