Books Review: 97 Things Every Data Engineer Should Know

I recently read 97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts by a collection of data engineers and CEOs, young and old. The advice and overall quality was hit or miss, but it did provide some useful perspective on where the Data Engineering industry came from and where it’s heading. For someone like me who only ever heard about “Data Engineering” within the last 12 months, this was a refreshing perspective.

Here were my 8 favorite points that I collected from the book:

1. Where did Data Engineering come from?  

Businesses have always had large volumes of records. Large businesses with large paper records existed in the 1950s before widespread use of computers. SQL and the relational schema were developed in 1970 precisely to help businesses manage these internal records and transaction logs. But none of this required workers who could “engineer” around the data.

But as the world digitized through the 90s and beyond, the data just kept growing, culminating with the internet companies of the late 2000s (facebook, twitter, et. al.) having now created entirely new types data demands. SQL is still around (to everyone’s astonishment), but Data Engineering and the cornucopia of tools they use grew out of this expanding data landscape. Organizations now store and process Petabytes instead of Gigabytes of data. And customers’ access to web and mobile applications required much lower latencies database transactions than the internal database tools that were used in past decades. And now, the data that companies need to process is not just tabular, structured data but also includes unstructured data like large natural language files, images, or audio files. Everything got bigger, better, and more complex. I hear all the time that organizations are drowning in data but starved for knowledge in the current Information Age.

As the data landscape grew, new niches opened up for specialized workers. Data Analysts could focus on business insights and key performance indicators. Data Scientists could focus on applying machine learning models to make better business decisions. Data Engineers could focus on ingesting, preprocessing, and modeling the data to feed the data into these two teams’ workflows. And IT infrastructure teams and cloud providers could focus on the hardware provisioning to keep the whole organization running smoothly.

It’s not just the job boards that are getting more complex. The number of tools available and the complexity of each of these job titles continues to increase. Data Engineers have to be familiar with a laundry list of tool such as Hadoop HDFS, Spark, Flink, Kafka, Airflow, PostgreSQL, Cassandra, MongoDB, Kubernetes, Docker, Alluxio, Istio, Terraform, Redis, and much, much more… The perspective that the authors give is that the number of tools and frameworks will continue to increase, but the counterforce to this expansion is the tendency to automate and abstract rudimentary tasks (as DevOps has already done with Software Engineering). Many of the current tasks that Data Scientists and Data Engineers perform now will be automated away by DevOps-like tools (sometimes referred to as DataOps), but the core concepts of how to handle, process, ingest, and store data will remain relevant. This brings me to the next piece of advice.

2. You don’t want your career’s legacy to be a tool stack.

Early on in your career, you need to learn hard skills (i.e. specific tools like Spark rather than abstract concepts) to be hirable in an entry-level position, but in the long run, tools come and go. Once you get beyond the hirable-level of understanding for a particular tool, it’s more important to learn the fundamental theories – the first principles – of the systems rather than the fine details of the particular tool you’re working with at your job. One of the authors wrote, “By definition, fundamentals are the primary principles that the rest of the field can be derived from. Thus, if you master the fundamentals, you can master any new emerging field.”

This piece of advice came up again and again in the book: commit to life-long learning and embrace the widening complexity of the field, but don’t corner yourself into a position by solving every problem with just the familiar tools. Learn “problem-solution pairs” and use news tools, even if it takes longer to finish the project. And always ask why a database design or a tool operates in a particular way. Learning unfamiliar tools and googling “why does […] do […]?” is a great way of learning the fundamentals of the problem-solution pair. Learning new tools should be in service of learning new concepts, not to put another notch on your resume.

3. When looking for a job, there will be two types of Data Engineers.

I’ve been thinking of Data Engineers as software developers that work with data in the cloud, but this book offered two job descriptions to look out for:

  1. Data Processing (SQL focused): these engineers are more of analysts than coders. They know data processing, databases, SQL, and are comfortable working in the cloud. They sometimes use click-and-drag data processing, ETL pipelines, and BI/dashboard technologies. “Sometimes their titles are database administrator (DBA), SQL developer, or ETL developer. These engineers have little to no programming experience.”
  2. Traditional Software Engineer (Big Data focused): these engineers have some analytical skills, but their main focus is on Big Data projects in the cloud. They use tools like Cassandra, Spark, Flink, Scala, and Python. “The second type of data engineer is a software engineer who specializes in big data.”

In some of my interviews, I’ve noticed this un-written distinction in Data Science job postings as well. Some companies will put a job posting for a “Data Scientist” when they really just need some SQL work and dashboards set up – not machine learning.

4. For the “Software Engineers who specialize in Big Data”, follow software engineering best practices.

This includes working in CI/CD delivery pipelines, containerization (e.g. Docker), unit tests (e.g. pytest), Git revision control, documentation (e.g. wikis and doc strings), modular coding practices, job runners (e.g. Make), writing reproducible code with random seeds, and publishing the versions of your dependencies (e.g. requirements.txt files for pip). If you aren’t familiar with these things, learn them. “Though it may not seem as glamorous at first, it will pay dividends over the life of both your project and your career.” In particular, I’ve read elsewhere that a pipeline without these best practices is like “debt” incurred on your infrastructure. If the goal is to make reliable, scalable, and maintainable data infrastructure, data engineers must as least follow the established software engineering best practices.

Many authors repeated the importance of following unit test best practices: Ensure that your unit tests can run locally (this makes debugging failures easier), use in-process versions of the systems in your pipeline like Apache Spark’s local mode to provide a self-contained local environment, make unit tests deterministic by sorting the output, setting seeds, and running tests on attributes that aren’t subject to change, and “Parameterize your unit tests by input file so you can run the same test on multiple inputs. Consider adding a switch that allows the test to record the output for a new edge case input, so you can eyeball it for correctness and add it as expected output.”

5. In addition to Unit Tests on the code, Data Engineers should run Data Quality Tests on the data.

Data Quality Tests are to data in a data pipeline what Unit Tests are to code in a CI/CD pipelines. They are pre-defined checks that are run automatically to ensure the fundamental validity of the data. Many authors recommend using the popular open-source tool Great Expectation. Some larger organizations also build their own internal tools and dashboards that they run on a schedule to ensure the ingestion, processing, and storage of the data in the pipeline is working as expected. This typically includes null value checks, average (or expected range) value checks, privacy checks (e.g. encrypt or remove columns that contain personal information), and checking the number of “expected” outliers and anomalies. Data latency can also be part of the quality tests by sampling the latency of API calls, database queries, and query processing times and reporting the summary statistics and anomalous behaviors.

One article that I especially like was called “Your data test failed, now what?”. Here are her 6 responses, in chronological order, that an organization should establish in the event of Data Quality Test failures. This is a good illustration of where Data Quality test fit into the work flow of an organization:

  1. System response – this could be either doing nothing, isolating the “bad” data and continuing the pipeline, or stopping the pipeline altogether.
  2. Logging and Alerting – Your pipeline should keep data logs and alerts (email, slack, etc.). When designing the alter, know the timeliness of the alter (depending on the urgency, either: instantaneous, at the end of a pipeline run, or at a fixed interval (e.g. daily reporting)).
  3. Alter Response – your organization should have some pre-determined response when certain errors happen. Who is on-call? Who is the stakeholder? This is all determined and documented when designing the data quality tests.
  4. Stakeholder communication – Who is the data consumers and let them know that “we just got an alter that X is wrong with the data”.
  5. Root Cause Identification – identify the issue, then identify the cause of the issue, then fix the actual cause. Don’t just patch over it by restarting the pipeline. Get to the root cause.
  6. Issue Resolution – either the test was wrong (and needs to be fixed), or the data is wrong (and needs to be fixed), or the data is wrong and can’t be fixed (in which case you might adjust your downstream use of the data).

6. 87% of Machine Learning projects fail! Here’s why:

This stat was cited from 98 Things that can go wrong in an ML project. The exact number isn’t important, but what is relevant is knowing the common causes of why these projects fail. Most of these are problems with the data, not the ML model (some MLOps professionals have started talking about Data-Centric AI for this reason). Here are the main reasons the author gave for project failure:  

  1. Inadequate documentation of the data. “I thought this attribute meant something else”.
  2. Vague definitions of success from the start of the project. There are likely too many different business metrics and if you don’t optimize the model for any of them, algorithm won’t show any impact.
  3. Schema changes at the source of the database are not typically coordinated with downstream ETL processing teams. (This is important for Data Engineer! Always communicate workflows with data stakeholders).
  4. Changes to the ETL pipeline and the definitions of the metrics in the Data Warehouse often has poor versioning and documentation, making historical data inconsistent. Changes need to be documented and documents need to be read.
  5. Systematic Data issues are causing bias in the overall dataset.
  6. Pipeline development can be like the emperor’s new clothes: there is a long lead-time where money is spent, hours are billed, and no visible progress is being made. Management can get impatient if projects take longer than expected and the Engineering team isn’t able to provide tangible evidence of progress. (I can say from experience, that his is especially true for developing preprocessing pipelines for NLP projects. It’s incredibly important to show examples of the data throughout the project timeline to show how the pipeline is being scaled and the transformations are being changed.)

7. Companies need Data Platforms, not just Data Pipelines. Ambitious infrastructure requires more ambitious “data generalists”, not just teams of specialists.

All of the tools, frameworks, and responsibilities in the world of data engineers can be extremely intimidating to an early engineer (trust me), but one author urges engineers to push past this and embrace the challenge. As I alluded to in the first section, Where did Data Engineering come from?, the needs of modern data infrastructure are constantly and rapidly changing. Embrace this change. Today, companies that are working on successful applications of deep learning require much more than the traditional data pipeline that processes structured data of business transactions. They need fully developed Data Platforms, central repositories for all the data handing, collection, cleaning, transformation, and application of large, unstructured datasets. This type of infrastructure also needs to be dynamic and change with the company’s business goals while keeping all the validity and availability requirements of modern data infrastructure. Ambitions for large infrastructure requires ambitious engineers who are willing to learn the skills to build it, but with the growing complexity of this infrastructure, comes the calcification of over-specialization engineering teams. “Status updates like ‘waiting on ETL changes’ or ‘waiting on ML Eng for implementation’ are common symptoms that you have overspecialized [your organization]. … The challenge with [specialized knowledge] is that data science products and services can rarely be designed up front. They need to be learned and developed via iterations.” The author advocates for a “full-stack data science generalist” with all the skills of a trained software engineer, best practices of a DevOps engineer, and machine learning knowledge of a professional data scientist.

8. Contribute to Open Source!

This last one is self-explanatory. You learn a lot. It looks good on your resume. It makes the world a better place. And it’s fun!

Leave a comment