For organizations who want to break free of expensive data platforms, or who simply want to leverage the latest innovations in data management, open source databases are a popular option. However, the wide selection of seemingly similar databases can make finding the right one a difficult task.
In our intro to open source databases, we give an overview of available databases, the various open source database types, popular databases within those types, and provide resources that will help you learn more about the databases you need.
An open source database is a database, or database management system, that is free to download, modify, and re-use.
While the exact details on how open source databases can be used vary by open source license type, the ruling principle is that the underlying code for the database is publicly accessible and modifiable.
A list of popular open source databases can be found below.
The list below is borrowed from our recent white paper, the Decision Maker’s Guide to Open Source Databases. While it’s not a comprehensive list of all available open source databases, it does feature the databases we viewed to be viable for enterprise organizations.
Relational databases are based on the relational model of data, and provide a means to store and access related data. Relational databases are typically ACID compliant, meaning they operate with atomicity, consistency, isolation, and durability.
Relational databases are ideal for applications or processes that need a high level data integrity, security, and that use structured data.
Popular open source relational databases / database management systems include:
PostgreSQL is one of the most well-known open source databases, and is often compared in terms of features and functionality with larger, commercial databases such as Oracle and DB2.
Postgres achieves extreme consistency and ACID-compliance through its use of MVCC (Multi-Versioning Concurrency Control) and WAL (Write-Ahead Logging).
Perhaps the best-known of all open source RDBMS databases, MySQL forms the (M) in the ubiquitous LAMP stack.
MySQL is a well-rounded database, and while not as capable as far as enterprise concerns as something like Postgres or Oracle, it adapts well to most use cases requiring moderate scale.
Spun from MySQL, MariaDB has since added new features – including a new and improved storage engine called XtraDB.
Much like its progenitor, MariaDB is a popular, well-rounded database. Unlike MySQL, it’s “guaranteed to stay open source.”
CockroachDB uses the replication and scalability features of Kubernetes to bring a container-native RDBMS functionality to the platform. It provides a distributed database based on Facebook’s RocksDB that uses Kubernetes-native functionality to provide support for SQL concepts such as transactions.
Graph databases are designed to maintain both data, and relationships between data points. In fact, the relationship data is just as important (if not more important) than the data points themselves.
Graph databases are useful when the connections between data points is important. Potential use cases include fraud detection, network operations, access management, and real-time recommendation engines.
Neo4j is a well-known and widely used graph database implementation for Java applications. It exists in both a community and enterprise edition and focuses on performance and ease of use. The community edition comes with an impressive set of features, but for enhanced security, availability, and scale, the enterprise edition is recommended.
JanusGraph is an open source, community fork of the Titan graph database product from DataStax Enterprise. As with most DataStax-backed products, it focuses on high distribution, throughput, and the ability to handle heavy complexity.
Wide column databases are defined by their ability to use variable column names and formats across rows. This type of database excels at quickly accessing columnar data, and can be sharded to enhance scalability.
Popular open source wide column databases include Cassandra and Hadoop.
Cassandra is an open source wide-column NoSQL database originally conceived at Facebook. It focuses on being highly distributed, deploying easily across multiple clouds.
Cassandra’s wide distribution makes it an ideal candidate for pairing with streaming data solutions such as Kafka and Spark, as its write-optimized architecture will provide minimal bottlenecks when deployed for those purposes.
Hadoop was the original big data open source ecosystem and saw tremendous success early on in its inception. Hadoop paved the way for numerous well-known and accepted big data concepts including data lakes and distributed ledgers.
Though still widespread in its use and adoption, Hadoop’s batch-oriented patterns are not always suitable for predictive analytics which focus on streaming and analyzing large amounts of data at once, in-memory.
Key value databases are a popular type of non-relational database. They are used in instances where horizontal scaling is a necessity. They use a key-value approach that associates a value with a key, which is used in identifying the object.
Redis was one of the first key caching solutions available as open source and has seen widespread adoption across a range of use cases. One of its most popular use cases was as an enterprise-class session cache, but it has since found applications other data use cases such as fraud analysis and inventory systems.
ElasticSearch is a “Search Engine” style of key value database. It takes the capabilities and simplicity that comes with key value stores, but extends the indexing and searching features a little further.
This makes it ideal for searching lots of freeform data, which is why ElasticSearch forms the critical E in the ELK Stack.
Etcd is the default service registry and backing store application included with Kubernetes, and was designed to be a highly scalable database to hold service endpoints inside a Kubernetes deployment. Etcd’s data model is solidly in the realm of a key value structure, but, its primary access methods are meant to be universal and ubiquitous, and so it allows for cloud-compatible integrations such as JSON/HTTP and gRPC.
Prometheus, the second project to be sponsored by the Cloud Native Computing Foundation (after Kubernetes), has become the de facto standard for gathering metric data from Kubernetes implementations. It’s a high-performance timeseries key value database with a focus on accessibility.
Document databases, or document-oriented databases, are non-relational databases that are used to store and manage semi-structured data (aka document-oriented).
Though similar to key value databases, document databases use internal structure within the document to extract metadata.
Couchbase Server, originally known as the Membase project, is a NoSQL document database with a focus on performance and scale. It contains three internal database engines, a cache, a key value store, and a document database, allowing for flexibility in its use case.
MongoDB has seen meteoric popularity since its release in February of 2009. MongoDB currently sits at number 5 in popularity on the list of database on DB-Engines.
As a traditional document store, MongoDB is capable of ingesting large, unstructured documents of data in JSON and reliably presenting and preserving those documents.
Apache Jackrabbit is an implementation of the Java Content Repository (JCR) standard. This is an object store for Java, which can effectively act as a document database, in that unstructured data in the form of Java objects can be persisted and retrieved from the store natively.
Looking for additional resources on open source databases? Be sure to review the links below.