Databases
Terms
Normalization
Relational databases are organized into tables, which have columns specifying a single type of data in order to reduce data redundancy and prevent corruption. Normalization involves following these rules:
- First Normal Form (1NF) Atomic Values: Records have a single value for each column. No arrays.
- Second Normal Form (2NF) Columns Depend On a Single Primary Key: A record in one table can relate to a record in another table by including a primary key (usually the id) of the other record. Thus, for a Comment to belong to a Post, it will have a column for post_id.
- Third Normal Form (3NF) Non-keys Describe the Key and Nothing Else: All the other fields are specific to that record. Don't include data that isn't relevant or belong in other tables.
There are other normal forms, but they are not as important.
You may frequently need or want to denormalize your database and violate the above rules. This is probably fine if you know what you're not violating the spirit of the rules (you're duplicating lots of data in a way that can lead to inconsistencies).
ACID
ACID is a set of properties of a database that will ensure the data remains intact even if there are errors or system failures. The properties are as follows:
- Atomicity: The database won't save partial data. If a transaction fails partway through, the database state will revert back to before the attempted change occurred.
- Consistency: Any transaction will transition the database from one valid state to another. Invalid data, with respect to the rules defined for the database, will not be allowed to persist.
- Isolation: Depending on method of concurrency control, determines when a transaction becomes available to subsequent transactions.
- Durability: Transactions are store permanently, and will remain there even if power is lost. Data is save to the disk.
SQL
SQLite
SQLite isn't intended to be used on a server, but instead as an embedded database for local/client in app storage. Thus, it is present in many popular browsers, operating systems, and mobile phones, making it one of the most widely deployed databases.
Features:
- Serverless - doesn't need a separate process to operate
- Zero-configuration - no setup or administration needed
- It's stored in a single cross-platform disk file.
- Very small and light weight
- Self-contained - no external dependencies
- Fully ACID compliant
SQLite is a good option for embedded applications and testing, but should not be used for applications with multiple clients or that require lots of writes.
MySQL
MySQL is a more fully featured than SQLite, yet it doesn't implement the full SQL standard, which does provide performance benefits. It's easy to use, secure, scalable, and very fast. While it is open source, it is owned and primarily developed by Oracle. This is a controversial aspect of MySQL, but it does mean that Oracle offers paid editions of the database and provides user support. Nonetheless, there is a large community built around it with third-party tools and libraries.
MySQL can be a strong choice for most general purpose web servers.
MariaDB
MariaDB is a fork of MySQL, borne out of Oracle's acquisition of the database to ensure it remains free under the GNU GPL. It aims to be backwards compatible with MySQL. It offers a number of commands and features that MySQL does not, and replaces features that negatively impact performance.
PostgreSQL
Postgres is the most feature rich and advanced of the SQL databases. It is standards compliant, and can be extended for custom procedures. On account of some its features, it can be slower under certain circumstances than the others, so it may not be the best option when only a simple set up is sufficient. Still, for more complex database, Postgres is a good choice.
NoSQL
Cassandra
Cassandra is a column based NoSQL database designed for big data and high reliability.
The major features are:
- Elastic scalability - easily scale horizontally
- Always on architecture - No single point of failure. It's aways available, which is important for business-critical applications
- Fast linear scale performance - throughput increases as number of nodes in cluster increases, maintaining fast response time.
- Flexible data storage - accommodates structured, semi-structured, and unstructured data.
- Easy data distribution - replicate data across multiple data centers
- Transaction support - supports ACID
- Fast writes - designed to run on cheap commodity hardware to store hundreds of terabytes while performing very fast writes and highly efficient reads.
The data model differs from that of relational databases by not requiring a schema. Columns can be added at any time, and rows aren't forced to have values for each column.
MongoDB
MongoDB is a document oriented database, which aims to function similarly, and with high compatibility, to JSON. It is likely the most popular alternative to SQL databases, however, it should not be used as a direct replacement. Mongo is capable for handling some complex relationships between documents, but nothing nearly as complex as with SQL DBs.
A MongoDB database will be made up of Collections (analogous to tables), and Documents (analogous to rows). The data in each document will resemble JSON objects, which allows for easy nesting of objects and arrays in documents. You should use Mongo instead of SQL if you know documents' ownership won't be shared by many users. If you have an interface where all of the data concerns only that user (like a bank account), then you can query only for that user and get everything you need with that single query.
Features:
- Schema-less
- Dynamic queries
- Clear document structure
- No complex joins
- Easily scalable - horizontal scaling
- High availability
CouchDB
CouchDB is very similar to MongoDB. It as a REST API, giving it a low learning curve. It also supports ACID, which makes the data consistent and reliable. It may not be the best choice if your data needs to change frequently.
Redis
Redis is an in-memory, key-value store database. Although persistence is possible with Redis, it's not really intended to be a long term data store. Without having to write to the disk for every operation, Redis can be very fast.
Other features:
- Supports rich data types - lists, sets, sorted sets, hashes
- Atomic Operations - Redis will receive updated value when two clients access data concurrently
- Multi-utility tool - used for caching, messaging-queues, sessions, and any other short lived data
Redis can be an powerful (and often required) tool for real-time applications.
ElasticSearch
Elasticsearch is more of a search engine than a database. You're likely going to use it alongside your main data store. It provides nearly real-time full-text search, and can easily scale horizontally and be distributed.
Neo4J
Neo4J is a graph database, which is made to resemble a graph data structure. Thus, everything is stored as an edge, node, or attribute, and its relationships and connections are first-class entities.
Features:
- Flexible data model
- Real-time insights - can change analysis as data arrives
- High availability
- Connected and semi-structured data
- Easy retrieval - connect data faster than other DBs
- Cypher query language - a declarative, human readable, easy to learn query language to represent the graph visually.
- No joins