Contents

Databricks Resources for Practicioners, Managers, and Decision-makers

/databricks_logo.png

Why DataBricks?

There’s a lot to like about DataBricks, a platform that has taken the data world by storm in the last several years. One of the many things that I like about DataBricks is all the valuable product and training information they offer, information that often goes beyond the DataBricks products themselves and extends to Spark and the underlying technologies. DataBricks was founded by some of the creators of the Apache Spark project, which is one of the foundational tools in the data engineering / science / analytics space, so their training material focuses heavily on Spark since the two products are tied so closely together.

Free DataBricks Knowledge Resources

I discovered that DataBricks publishes a number of free short training ebooks that can be downloaded for free on their site. These are typically shortened versions of ebooks often featuring preview material of books that have not been published yet. So going through this material is a great way to boost a number of skills and get access to some leading-edge information that isn’t otherwise easily available yet.

DataBricks is a standout company that offers truly excellent products and services, and any company that’s looking to establish or develop their data operations and culture should strongly consider DataBricks. Their offerings are comprehensive, from the conceptual (the lakehouse data architecture) to the technically foundational (the Delta Lake open source storage and processing framework) to the comprehensive suite of software tools (the DataBricks cloud-based SaaS product which integrates data analysis, business intelligence, and AI/ML).

List of Books and Papers

I downloaded the handful of ebooks that I could find on the site and am re-posting them here with brief descriptions:

Data Engineer’s Guide to Apache Spark and Delta Lake - Contains a lot of great basic but essential information on setting up and using Spark, and finishes up with a chapter on using Spark with Delta Lake. From the DataBricks Docs, “Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.”

Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake - Contains great information about data warehouse architectures in general, getting started with Delta Lake, and a lot of details about the file format and how to work with it. Some of these other books/papers are more conceptual, but this one is very much hands-on with the technical aspects. You’ll learn about using Docker, PySpark, the Parquet format, the Delta format, the Delta transaction log and how it implements ACIDity, and metadata scaling. Pretty good for just 50-odd pages.

The Big Book of Data Engineering: A collection of technical blogs, including code samples and notebooks - Contains an introduction to Data Engineering on DataBricks, several interesting and diverse use cases for the DataBricks Lakehouse platform, and a section on customer stories.

Delta Lake, The Definitive Guide: Modern Data Lakehouse Architectures with Delta Lake - Contains information on basic operations on Delta Lakes, ‘Time Travel’ with Delta Lake, and continuous / streaming operations with Delta Lake.

The Data Lakehouse Platform For Dummies - Covers limitations of current data warehouse architectures, introduces the lakehouse approach, describes the values and benefits of the newer approach, and discusses building a modern lakehouse architecture with DataBricks. I love this little book because it’s compact (you can read it in half an hour) but so informative. DataBricks has developed ideas and tools in specific response to the most common challenges that have arisen in recent years as companies attempt to get the most out of their data. This short volume condenses a lot of that experience with the shortcomings of previous data management approaches, and will almost certainly leave you chuckling along the way and saying, “Yeah, we’ve been there.” If you’ve ever felt like what you’re doing with your data isn’t working as well as it could, and that there surely must be a better, simpler way, then this book will speak to you.

Why This Is All Worth Your Time

There’s a few hundred pages of great information all together in these ebooks, from the conceptual (discussion of various warehouse architectures and all the lakehouse has to offer) to the technical and very hands-on. Here at Northwest Data Consulting we strive always to be as up to date as possible on the concepts and tools of the trade, and the lakehouse architecture (along with DataBrick’s Delta Lake-based implementation of it) have become a force to be reckoned with in the cloud data space. Gartner has named DataBricks a leader two years running in Cloud Database Management Systems, so they are definitely a company to watch, and they’re making a lot of great information available for free to those of us who are trying to stay on top of the often-difficult task of getting the most out of our data.

DataBricks Might Be The Answer You’ve Been Looking For

I respect DataBricks (the company and the product) because they’re truly breaking new ground and doing things that other companies simply aren’t capable of. There are a LOT of data tools and products out there - databases and storage systems, cloud platforms, ELT tools, reporting and analytics tools, and on and on. DataBricks is the rare company that sees the big picture and is not simply building their own version of an existing product, or even building a better mousetrap, but (to continue with this analogy) actually re-defining the mousetrap, creating new mousetrap-building tools, and then producing a new type of mousetrap that no one else had imagined before. I’m pretty jaded after spending twenty years in tech, and it’s rare to find companies and products that are truly transformational in the way they claim. But DataBricks is such a company.

If you’re a newer company that’s looking to build your data stack and culture, it would be really hard to do better than DataBricks. They provide a comprehensive set of products, so that they’re not just yet another small-piece-of-the-puzzle vendor. Between the Delta Lake framework, the lakehouse architecture, and the DataBricks cloud software suite, they provide a set of tools that you would otherwise only be able to put in place by assembling different pieces of software from multiple vendors (like the common stack of using an analytics database like Redshift, Tableau for reporting, and separate cloud services for ML applications). And as we all know, the more different tools and vendors you’re using, the more difficult your development, operations, maintenance, and support picture quickly becomes. And DataBricks has picked their battles well - they have not, for example, developed their own data warehouse tool that competes directly with something like Redshift. What they have done is even more powerful, because they have developed an open-source framework (Delta Lake) that’s truly innovative - I don’t use that word much because you hear it so often these days in tech that its meaning has been diluted beyond all reasonable bounds. Delta Lake builds on existing widely-available technology (cloud storage) and adds tremendous new capability to create something that can actually replace a product like Redshift with something that’s more powerful and flexible, dramatically simplifies data operations, and is significantly cheaper to boot.

Even If You Have An Existing Stack, DataBricks Has Much To Offer

And if your company already has well-established data operations, DataBricks has the tools to help solve those problems that have bedeviled you. Feel like you’re managing way too many data sources and not sure how to make them all accessible in the ways that you need? Have multiple teams that need access to the same data, but are having trouble collaborating with each other because data is walled off? Dealing with multiple copies of data in different places, and wish that you could just have a single source and not have to use more tools to synchronize it (and always worry whether you’re working with the right version?) Often feel that you’re working with data from days or weeks or months ago, or (even worse) unsure what data you are actually working with and how to keep track? Need to automate your reporting so that analysts and decision makers can get what they need without having to order reports? Have a good product or good reporting but having trouble scaling as the company grows? Scratching your head at how to efficiently achieve up-to-date reporting and ML pipelines from the same data without breaking the bank with lots of cloud resources? Does your soul cry out for some simplification, to be able to pare down and not have so insanely many pieces of software and tables and versions and copies and interfaces and everything else? DataBricks can help with these problems and many more.

If I sound impressed about DataBricks, it’s because I am. As a data consultant, it’s my job, first and foremost, to have the best possible understanding of the big picture of what’s most effective in the data space. I put my two decades of experience to work, and spend a lot of time trying out different tools and techniques, so that I can help business owners, managers, and other decision makers to make big wins with their companies’ data. Working with DataBricks is exciting for me, because it helps solve longstanding problems in deep ways, by applying not just new software tools but new ways of thinking about data architecture that fundamentally change the way companies working with data. The world of data, for all its massive development in my lifetime, is still remarkably fragmented, with too many tools, techniques, standards, and products to count. It’s common for companies, despite spending a lot of money and having an (apparent) wealth of tools at their disposal, to struggle, and for their developers, managers, and executives to feel frustrated at how hard it can be to build a healthy data ecosystem.

A Vision Of What Data Operations Can Look Like (if you do it right)

Most companies don’t even know what that looks like - people are so used to being intimidated by the hard work of data operations that they think this is simply how things must be. They don’t realize that it is possible for companies to build data ecosystems in which accurate data is easily available in real time, accessible across multiple platforms and applications (like analytics and ML), and can be managed securely and with a complete picture of data quality across the organization. That’s the goal - a unified data picture in which high-quality data flows freely into, through, and out of the organization without impediments; various teams (analysts, developers, scientists, engineers) can easily access accurate data for their own purposes and collaborate freely without silos; there’s a single source of truth without constant costly ambiguity about data freshness and accuracy; the tools are unified and well-integrated, and not too many separate ones are required; and for this to happen within a budget that just any company to afford.

The Many Challenges And Frustrations Of Leveraging Data

This vision sounds, perhaps, almost utopian, because it’s so far from the current state of affairs in the world of tech and data. Too many companies struggle to get insight from their data. Data is siloed and inaccessible to parts of the organization that need it. Too many tools are in play, and they don’t all place nice with each other. Data is out of data, inaccurate, and moves slowly and without enough automation, requiring constant manual intervention and costly development and troubleshooting work. The data picture is obscured, so that there’s no clear picture of the state of the data, and important decisions are made using data that’s outdated, inaccurate, or lacks context. Reputations, fortunes, and the privacy of customers and partners are at risk for lack of security and good governance. Obscene amounts of money and time are spent on unnecessary migrations and integrations as companies, wooed by expensive marketing campaigns, experiment with an ever-growing stack of tools and methods that are often inappropriate for the task at hand. I could go on and on, but suffice it to say that while a small minority of companies are achieving big wins with their data, most are trying to keep their heads above water, struggling to keep up with ever-increasing volumes of data that they’re overwhelmed by and unsure how to leverage.

You Don’t Have To Struggle

DataBricks, more than almost any company in the data space, provides the concepts and tools to help transform your company’s data picture from mediocre (and frustrating and costly) to outstanding (and exciting and affordable). And I have the expertise and the passion to walk with you on that journey of transformation to get your company’s data operations and culture to a place of consistent success. Too often data is a burden - it takes a lot of work and a lot of difficult decisions, and even after all that work too often the results are underwhelming. But it doesn’t have to be that way. The tools and the knowledge exist to help you take it to the next level - but they can very very hard to assemble in the right way. Click here to book an appointment with me on Square. Initial consultations are free, and I’ll be happy to listen to you and your concerns, and to help take your data situation to the next level.