Data Modeling (Part 2):

I recently worked through Udacity’s Data Engineering nanodegree program which consisted of four lessons:  Data Modeling (PostgreSQL and Cassandra) , Data Warehousing (Redshift), Data Lakes (Spark), and Pipeline Orchestration (Airflow).

In this post, I’ll share some of my notes from the first lesson: Data Modeling. The lesson plan focused on two database managers in particular: PostgreSQL and Cassandra, so some of the discussion will be platform specific.

This lesson in the Udacity curriculum covers much of the same material you would cover in a college-level database course. Understanding databases is a fundamental skill for software engineers, data scientists, and data engineers, so I highly recommend taking a database course for anyone who hasn’t.

In the fewest possible words: data modeling is the abstract process of designing a database. There are trade-offs between making your database more or less normalized. Additionally, there’s trade-offs between the different types of database architectures, such as Relational and NoSQL databases. I’ll review some of these trade-offs below and how to strike the right balance in certain situations.

Databases

This won’t be an introduction to SQL. Any understanding of databases should start with learning SQL first. I’ll start by describing the two broad categories of databases: Relational and NoSQL (named for “Not Only SQL” because their languages are typically derivatives of SQL). Among the many differences between these architectures is the fact that Relational databases have ACID transactions which guarantee validity even in the event of errors or hardware failures. ACID transactions are an incredibly valuable property for a database to have. Essentially, this says that a transaction on the database is either “all or nothing” – a hardware or software failure during a transaction will not corrupt the data in the database. Like nearly everything in computer science, ACID is an acronym and stands for the four properties that transactions on Relational databases have:

  • Atomicity: transactions are all or nothing. They either happen or not. There is no middle ground.
  • Consistency: Only transactions that follow the proper rules are accepted, otherwise the database is kept in its original state.
  • Isolation: Transactions are independent of each other. Simultaneous transactions execute in the same way as if the queries were run sequentially. i.e. Rows and tables lock when a transaction has begun and won’t let go until the transaction is either accepted or rejected.
  • Durability: Even in the case of hardware failure, committed transactions will hold. All committed transactions on the database are final and written into nonvolatile memory.

Additional notes:

  • One can imagine a property like this would make Relational databases valuable in banking databases to prevent the “double-spend” problem.
  • Per the above, Relational databases can process multiple transactions in parallel. The data is stored on disk and the queries are executed by the CPU with the results output to memory. This means that machines with multiple CPUs (or cores) can run multiple queries at a time, e.g. a relational database running on a virtual machine in the cloud with 4 virtual CPUs can process 4 queries in parallel. This is not the case for non-relational databases such as MPP (like Redshift or BigQuery) or NoSQL databases.

Conversely, NoSQL databases are restrained by the CAP Theorem which stakes that it is impossible for a distributed data store to simultaneously guarantee more than two of the following three properties:

  • Consistency – Every read from the database gets the latest (and correct) piece of data or an error.
  • Availability – Every request is received and a response is given (not a guarantee that the data is the most up-to-date data, just that a valid response will be given)
  • Partition Tolerance – The system continues to work regardless of losing network connectivity between nodes

This CAP theorem stands in the way of distributed databases like NoSQL databases from performing ACID transactions. This is a big loss, but the benefit of a distributed system can be worth the trade-off.  The various NoSQL architectures, like Cassandra, MongoDB, and DynamoDB, each strike a balance in the trade-off between C, A, and P. Cassandra chooses to prioritize A and P at the cost of C.

Another way of putting the CAP theorem is that if the database has no failures, then it can provide C and A. But upon failure (i.e. if it’s architecture guarantees P), then the database must choose between A and C. Cassandra chooses Availability in the case of network failures.

The main benefits of Cassandra is that it’s a distributed database that is optimized for extremely fast writes (INSERTs). This makes it an excellent OLTP database.

# Pros and cons of each as an OLTP database:

RelationalNoSQL
Has ACID transactionsYESNO
Fast reads and writesACID transactions slow things downYES
Scales Vertically (adding more memory or compute resources to the database)YESYES
Scales Horizontally (adding more nodes to the database)NOYES
Flexible SchemasNOSometimes
Always provide consistencyYESSometimes (not Cassandra. Cassandra is designed to be “eventually” correct.)
Cheap Storage of dataSometimesYES
Perform JOINs, GROUP BY, and subqueriesYESNO (not in Cassandra at least)

# Normalization and Denormalization:

Another concept that is crucial to data modeling is Normalized vs Denormalized data models. Normalization is the standard process in data modeling in which the schema and dependencies of the entities in the database are determined. The schema of the database is gradually refined by ensuring a sequence of Normalization properties.  

  1. First Normal Form (1NF):
    • Atomic values: each cell contains unique and single values
    • Be able to add data without altering tables
    • Separate different relations into different tables
    • Keep relationships between tables together with foreign keys
  2. Second Normal Form (2NF):
    • Have reached 1NF
    • All columns in the table must rely (be dependent) on the Primary Key
  3. Third Normal Form (3NF):
    • Must be in 2nd Normal Form
    • No transitive dependencies (all columns must ONLY rely on the primary key)

There are higher order normal forms like Boyce-Codd normal form, but too much normalization is often undesirable. Highly normalized makes inserts easy, but analysis hard, because the more normalize a database is, the more JOINs that are needed to perform common analytical processes.

Denormalization is an optimization step that comes after a data model has been normalized. Denormalized databases might have repeated data within a single table to avoid having to JOIN two tables together for analytics. In general, write-heavy databases are highly normalized while read-heavy databases can be a little denormalized. Remember the structure of the data pipeline with the OLTP database at the beginning of the pipeline and the OLAP database at the end. Typically, the process of ETL is a process of aggregating and denormalizing the data that is in the transactions databases and presenting the result in easily-queriable OLAP databases.

In both OLAP and OLTP databases, denormalization is essential in Cassandra. NoSQL turns traditional data modeling on its head. In traditional data modeling, the database schema is determined by abstracting the entities of interest into atomized components and then forming queries in response to the final schema after normalization. When designing NoSQL databases, the design of the queries come FIRST. There are no JOINS in Apache Cassandra (i.e. you can only query on one table at a time.). The process of data modeling a NoSQL database starts with thinking about what queries will be run against the database and then optimizing the schema to that query. You can’t simply transition Relational databases into Cassandra. You’ll have to do some new data modeling on it.

The Udacity course itself went into more detail regarding the installation, code, and design of these databases. For brevity, I’ll just summarize a few key concepts in Cassandra and move on:

  • Cassandra’s Clustering column will sort the data in the table in descending order. More than one clustering column can be added.
  • The Partition Key determines how to distribute the database across multiple machines. Choose a partition column that is uniformly distributed so that the data is not skewed and overburdening a few nodes.
  • The Partition Key columns support only two operators: = and IN. Cassandra does not support <, >, ≤, ≥ directly on the partition key.
  • Anything in the WHERE clause of the query must appear in the primary key (or in a secondary index) and must be referenced in the order that it is defined in the table. These values get hashed to access their location in the distributed network.

Finally, modern data lake technologies, like AWS Glue or Spark + HDFS, allow for query-on-read processes which essentially provide the functionality of an RDBMS on a distributed file system (including compressed files), which is an incredible achievement in the development of data engineering technologies, but comes nowhere near replacing the need for databases and data modeling. Data modeling including normalization, denormalization, creating dependencies, and enforcing data type restrictions is a ubiquitous element of building a data pipeline. Databases are everywhere and understanding the basics is essential for aspiring Data Scientists and Data Engineers.

Next up, Data Warehousing.

Leave a comment