Databricks Resources

My personal list of resources and samples related to working with Databricks. Opinions are my own and not the views of my employer.

By Cloud: AWS | Azure | Google Cloud
By Industry: FSI | Healthcare | Media & Entertainment | Retail and CPG
dbdemos.ai

Keep Current and Learning Resources

News and Learning Content

▶️ YouTube channel | 🎧 Data Brew Podcast | 📖 Databricks Blog

Demo Hub: access ready-to-run Notebooks that walk you through common workflows and use cases on Databricks
▶️ Data + AI Summit (DAIS) recordings: 2023 | 2022
Weekly "Ask Databricks" Live Sessions. You can also catch replays on Databricks' and Advancing Analytics' YouTube channels
What's coming page: provides an overview of what's coming in near future releases, helping you plan ahead of time
Free Live Onboarding Training: no-cost, role-based onboarding training, multiple times a day across key geographic regions for Databricks customers, partners, as well as the general public
Free half-day live training classes
Customer Academy: all your self-paced training and certification needs in one place
Paid Instructor-Lead Training (ILT) Training
Request private training
Databricks Community: community forum
Enroll in our New Expert-Led Large Language Models (LLMs) Courses on edX starting Jun 8! Master Large Language Models with expert guidance, hands-on learning, and insights from industry pioneers. Explore cutting-edge techniques like prompt engineering, embeddings, vector databases, and model tuning. Learn from luminaries like Stanford Professor & Databricks Co-Founder Matei Zaharia and the creators of Dolly. Access free course materials to audit and elevate your LLM expertise. Consistent with our goal of keeping things open, course materials are free for anyone to audit. Enroll today

Community & Support

Try the Community Edition for free (no Databricks or AWS costs are incurred by you)
Databricks Community Q&A
Stack Overflow: databricks, apache-spark, psypark, apache-spark-sql
User Groups: 🇦🇺 Sydney | Melbourne

Feedback / Feature Requests

Submit feature requests (ideas) through the Ideas Portal

Value Generation

CIO Vision 2025 report in conjunction with MIT
2023 State of Data + AI report
Databricks 30 Index: (March 2021) The Databricks 30 is an equal-weight price index composed of 5 marquee customers each across Retail/Consumer Products, Financial Services, Healthcare, Media/Entertainment, Manufacturing/Logistics, in addition to 5 strategic partners
Forrester Total Economic Impact™ Study (April 2020)
- Customers averaged nearly $29 million in total economic impact and ROI over three years totaled 417%, driven by:
  - 5% increase in revenue by unlocking new data science opportunities
  - $11 million savings from retiring on-prem infrastructure and legacy licenses
  - Faster time to market due to improved data team productivity of up to 25%
Why we bet our business on Databricks, and why you should too

OSS & No Lock-in

Founding member of the Data Cloud Alliance: "Commitment to accelerating adoption across industries through common industry data models, open standards, processes, and end-to-end integrated products and solutions"

Analyst Evaluations

Databricks Named a Leader in 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (Dec 2022)

Lakehouse Paradigm

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics Research paper from the 11th Annual Conference on Innovative Data Systems Research (CIDR ’21), January 11–15, 2021. My annotated version
Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning

Deployment Architecture & Management

Architecture & Data Model Design

6 Guiding Principles to Build an Effective Data Lakehouse
A data architecture pattern to maximize the value of the Lakehouse
Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform
Five Simple Steps for Implementing a Star Schema in Databricks With Delta Lake
Databricks Lakehouse and Data Mesh, (Part 1) (Part 2)
Dimensional modeling implementation on the modern lakehouse using Delta Live Tables: covers SCD1 & SCD2, PK/FK constraints, IDENTITY columns, and constraints (📄 Notebook)
Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform
Architecting MLOps on the Lakehouse
Leveraging Delta Across Teams at McGraw Hill (source code to automate the Databricks to Athena manifest based integration)
Data Modeling using erwin on Databricks
Multi-cloud Architecture for Portable Data and AI Processing in Financial Services: a useful blueprint for owning a reliable and governed multi-cloud data architecture. Although this article is focused on the financial services industry, it is relevant for any organisations with data footprints across clouds

Administration

Databricks Workspace Administration – Best Practices for Account, Workspace and Metastore Admins
Functional Workspace Organization on Databricks (Databricks Admin Essentials: Blog 1/5)
Monitoring Your Databricks Lakehouse Platform with Audit Logs (Databricks Admin Essentials: Blog 2/5) (Notebook)
Serving Up a Primer for Unity Catalog Onboarding (Databricks Admin Essentials)

Cost Management

Intelligently Balance Cost Optimization & Reliability on Databricks
Best Practices for Cost Management on Databricks
Cluster Policy Primer: Cluster policies allow you to enforce security and cost controls on new clusters. This post walks you through what cluster policies are and how to design them for your Workspaces
Cost-conscious R&D in the Cloud that Data Scientists Will Love
Databricks cost management at Coinbase: outlines the cost management strategy implemented at Coinbase for Databricks applications, including the launch of a cost insights platform and quota enforcement platform
How We Cut Our Databricks Costs by 50% (AWS)

Disaster Recovery (DR) and High Availability (HA)

Disaster Recovery Overview, Strategies, and Assessment (Part 1 of DR series) (Part 2) (Part 3)
- Disaster Recovery Impact Assessment questionnaire doc
How illimity Bank Built a Disaster Recovery Strategy on the Lakehouse - DR strategy, Terraform management, data & metadata replication strategy

🔐 Security

Security and Trust Center
Databricks Bug Bounty Program (example of bug bounty response: Admin Isolation on Shared Clusters)
Commitment to Responsible AI
Security best practices including comprehensive checklists for GCP, Azure, and AWS
Security Analysis Tool (SAT): 📄 GitHub repo, ▶️ Setup instructions
GxP Best Practices Whitepaper: GxP stands for "Good x Practices" and the variable x refers to a specific discipline, such as clinical, manufacturing or laboratory. The goal of GxP compliance is to ensure that regulated industries have a process that runs reliably, can survive failures and human error, and meets global traceability, accountability and data integrity requirements. No matter what industry you operate in, I believe these are solid practices to align with
Audit Log schema
Admin Isolation on Shared Clusters
Scanning for Arbitrary Code in Databricks Workspace With Improved Search and Audit Logs
Monitoring Notebook Command Logs With Static Analysis Tools (📄 Notebooks)
Protecting Your Compute Resources From Bitcoin Miners With a Data Lakehouse
An example of using the MASK() (available in DBR 12.2+) function for easy data masking
How Databricks restricts third party libraries in JVM compute platforms
Building the Trusted Research Environment with Azure Databricks

PII

How PII scanning is done at Seek using Presidio and Unity Catalog
Identifying and Tagging PII data with Unity Catalog

Unity Catalog 🔐

Migrating to Unity Catalog

Customer Implementations

How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services: "The core lakehouse foundation and features resonated with the team as an efficient way to build the data platform"

Under the Hood: Apache Spark

Apache Spark

Apache Spark and Photon Receive SIGMOD Awards
Apache Spark wins 2022 ACM SIGMOD Systems Award! “Apache Spark is an innovative, widely-used, open-source, unified data processing system encompassing relational, streaming, and machine-learning workloads.”
GitHub: Apache Spark
Learning Spark (2nd Edition) (book)
Learning Spark code samples

Under the Hood: Photon Engine

Under the Hood: Delta Lake

Delta Lake Benchmarking ⚡️

TPC-DS Benchmarking guide
- Benchmarking Open Table Formats (repo)

Utilities

hydro 💧: a collection of Python-based Apache Spark and Delta Lake extensions

Developing with Delta Lake

The Ubiquity of Delta Standalone: a JVM library that can be used to read and write Delta Lake tables. Unlike Delta Lake Core, this project does not use Spark to read or write tables and has only a few transitive dependencies. It can be used by any application (e.g. Power BI) that cannot use a Spark cluster. The project allows developers to build a Delta connector for an external processing engine following the Delta protocol without using a manifest file.

Delta Sharing

GitHub repository
Release Milestones
▶️ Databricks Delta Sharing demo
▶️ PowerBI and Delta Sharing
▶️ Advancing Spark - Delta Sharing and Excel (via PowerBI)
Arcuate: Machine Learning Model Exchange With Delta Sharing and MLflow
Java connector(supporting blog post)
Security Best Practices for Delta Sharing
Using Structured Streaming with Delta Sharing in Unity Catalog
Cost Effective and Secure Data Sharing: The Advantages of Leveraging Data Partitions for Sharing Large Datasets
Using Delta Sharing to Accelerate Insights with Nasdaq’s Digital Assets Market Data: an example of analysing Nasdaq data shared via Delta Sharing (sample 📕 Notebooks)
How Delta Sharing Helped Rearc Simplify Data Sharing and Maximize the Business Value of Its Data: With over 450+ open curated data products available across different sectors, Rearc's cross-industry catalog of datasets is one of the largest available today (Rearc data library)

ETL / ELT Patterns

Design

Identity Columns to Generate Surrogate Keys

Ingestion

Ingestion: Streaming

Latency goes subsecond in Apache Spark Structured Streaming: progress has been made on Project Lightspeed, and the introduction of async offset management shows a 3x improvement in stream latency
Real-Time Insights: The Top Three Reasons Why Customers Love Data Streaming with Databricks
Simplifying Streaming Data Ingestion into Delta Lake
Streaming in Production: Collected Best Practices - Part 1, Part 2
Speed Up Streaming Queries With Asynchronous State Checkpointing
Scalable Spark Structured Streaming for REST API Destinations: How to use Spark Structured Streaming's foreachBatch to scalably publish data to REST APIs
Feature Deep Dive: Watermarking in Apache Spark Structured Streaming
Python Arbitrary Stateful Processing in Structured Streaming
Monitoring streaming queries (PySpark | Scala)
Using Spark Structured Streaming to Scale Your Analytics
State Rebalancing in Structured Streaming
▶️ Streaming data into the Lakehouse
Simplifying Streaming Data Ingestion into Delta Lake
Enhanced Fan-Out for Kinesis on Databricks
Roadmap: Project Lightspeed: Faster and Simpler Stream Processing With Apache Spark
Debugging using the Structured Streaming UI (Spark docs)
Confluent Streaming for Databricks: Build Scalable Real-time Applications on the Lakehouse (Part I) (Part II)

Delta Live Tables (DLT)

Delta Live Tables Notebooks
How We Performed ETL on One Billion Records For Under $1 With Delta Live Tables (DLT): learn how to own a $1 terabyte scale incremental pipeline that includes varied data structures (CSV & XML), CDC with SCD Type II, modeling, and data quality enforcements, while all along utilising spot compute. Check out the DLT definitions in this GitHub repo. What stood out for me were:
- 2x speedup due to DLT making efficient compute decisions
- Data quality checks exposed data errors others hadn't picked up before (needle in a haystack of 1.5B records)
- Easy SCD Type II logic handling
Processing data simultaneously from multiple streaming platforms using Delta Live Tables: walkthrough of simultaneously ingesting and transforming streams across Azure Event Hubs, Kafka, and Kinesis
Dimensional modeling implementation on the modern lakehouse using Delta Live Tables: covers SCD1 & SCD2, PK/FK constraints, IDENTITY columns, and constraints (📄 Notebook)
Data Vault Best practice & Implementation on the Lakehouse
Track health and fitness goals with Apple Healthkit and Databricks: great walkthrough of building out health insights using a metadata-driven approach with Delta Live Tables for ETL (GitHub repo)
Deloitte’s Guide to Declarative Data Pipelines With Delta Live Tables (Whitepaper)
Build a Customer 360 Solution with Fivetran and Delta Live Tables - includes SCD2 example
Simplifying Change Data Capture With Databricks Delta Live Tables
Delivering Real-Time Data to Retailers with Delta Live Tables (fully documented notebooks)
Building ETL pipelines for the cybersecurity lakehouse with Delta Live Tables: ingest & evaluate AWS CloudTrail & VPC Flow logs (accompanying notebooks: CloudTrail DLT pipeline, VPC Flow Logs DLT pipeline, Zeek DLT pipeline)
Low-latency Streaming Data Pipelines with Delta Live Tables and Apache Kafka
▶️ Apache Kafka and Delta Live Tables
How I Built A Streaming Analytics App With SQL and Delta Live Tables: accompanying repo
How Uplift built CDC and Multiplexing data pipelines with Databricks Delta Live Tables
Near Real-Time Anomaly Detection with Delta Live Tables and Databricks Machine Learning
How Collective Health uses Delta Live Tables and Structured Streaming for Data Integration: informative walkthrough of how they use DLT to perform quality control on data shared by partners while also managing schema evolution
How Audantic Uses Databricks Delta Live Tables to Increase Productivity for Real Estate Market Segments

Transformation

Development

SQL CLI: run SQL queries on your SQL endpoints from your terminal. From the command line, you get productivity features such as suggestions and syntax highlighting
sqlparse: open source library for formatting and analysing SQL strings

Orchestration

Databricks Workflows

DataOps

IDEs

VS Code extension (short ▶️ video)
Using Databricks SQL in VSCode
Use an IDE with Databricks
Debug your code and notebooks by using Visual Studio Code
- Interactive debugging with Databricks Connect: developers can step through their code and inspect variables in real time. Databricks Connect enables running Spark code on remote clusters from the IDE, thereby enabling code step-through while debugging
- Support for ipynb notebooks
- Support for dbutils and Spark SQL
▶️ Databricks Connect v2 Quickstart
Spark Connect Available in Apache Spark 3.4: Run Spark Applications Everywhere: write PySpark in any environment and have your instructions processed in a remote Spark environment (e.g. Databricks cluster)! databricks-connect supports Spark Connect when using DBR13.0+. This simplifies client application development, mitigates memory contention on the Spark driver, separates dependency management for client applications, allows independent client and server upgrades, provides step-through IDE debugging, and thin client logging and metrics!
Use Databricks from anywhere with Databricks Connect v2: use the power of Databricks from any application running anywhere. It is also included in our VS Code extension enabling built-in debugging of code on Databricks! Here's a sample application
▶️ Databricks Connect + Spark Connect: How you can build on Spark from anywhere: "Databricks Connect v2 leverages Spark Connect so you can connect to your Spark clusters within Databricks"
Software engineering best practices for notebooks (accompanying notebooks) (accompanying notebooks)
Build Reliable Production Data and ML Pipelines With Git Support for Databricks Workflows (📄 notebooks)
Run SQL Queries on Databricks From Visual Studio Code: makes life easy if you use the SQLTools extension and want to iterate on your SQL logic while in your local environment (e.g. while debugging dbt model logic) (download Databricks driver for SQLTools

GitHub

Terraform

30+ reusable Terraform modules to provision your Databricks Lakehouse platform: the newly released Terraform Registry modules for Databricks provides 30+ reusable Terraform modules and examples to provision your Databricks Lakehouse platform. I've found this useful for exploring best practices and speeding up Terraform development

Unit Testing

Spark test suite (Scala)

Databricks SQL, Analysis & Business Intelligence (BI)

SQL

Introduction to Databricks SQL: a good walkthrough by one of our partners on the capabilities of Databricks SQL
Understanding Caching in Databricks SQL: UI, Result, and Disk Caches
Actioning Customer Reviews at Scale with Databricks SQL AI Functions: bring meaning to unstructured data using the simplicity of SQL and GPT-3.5
What’s New With SQL User-Defined Functions 2023-01-18

ODBC & JDBC connectivity

Analyst Experience

▶️ Low-Code Exploratory Data Analysis with Bamboolib

Best Practices

Comprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads: a new guide walking you through all the possible ways to tune your workloads for ⚡️ performance and 💰 cost
Streaming in Production: Collected Best Practices - Part 1, Part 2
10 Best Practices for writing SQL in Databricks

Performance tuning

Comprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads
Delta Lake best practices
Optimize performance with file management
Make Your Data Lakehouse Run, Faster With Delta Lake 1.1
Get to Know Your Queries With the New Databricks SQL Query Profile
Top 5 Performance Tips
Memory Profiling in PySpark
How to consistently get the best performance from star schema databases
Delta – Best Practices for Managing Performance by partner Daimlinc
Introducing Ingestion Time Clustering with Databricks SQL and Databricks Runtime 11.2: 19x faster query performance out-of-the-box. Write optimization, ensuring clustering is always maintained by ingestion time → significant query performance gains
Faster insights With Databricks Photon Using AWS i4i Instances With the Latest Intel Ice Lake Scalable Processors: Up to 2.5x price/performance benefits and 5.3x speed up!
Improved Performance and Value With Databricks Photon and Azure Lasv3 Instances Using AMD 3rd Gen EPYC™ 7763v Processors: Up to 2.5x price/performance benefits and 5.3x speed up!
Reduce Time to Decision With the Databricks Lakehouse Platform and Latest Intel 3rd Gen Xeon Scalable Processors: "By enabling Databricks Photon and using Intel’s 3rd Gen Xeon Scalable processors, without making any code modifications, we were able to save ⅔ of the costs on our TPC-DS benchmark at 10TB and run 6.7 times quicker"

Z-Ordering

Delta Lake orders the data in the Parquet files to make range selection on object storage more efficient
Limit the number of columns in the Z-Order to the best 1-4

ANALYZE

ANALYZE TABLE db_name.table_name COMPUTE STATISTICS FOR ALL COLUMNS

Utilised for Adaptive Query Execution (AQE), re-optimisations that occur during query execution
3 major features of AQE
- Coalescing post-shuffle partitions (combine small partitions into reasonably sized partitions)
- Converting sort-merge joins to broadcast hash joins
- Skew join optimisation by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks
- Dynamically detects and propagates empty relations
ANALYZE TABLE collects table statistics that allows AQE to know which plan to choose for you

Machine Learning (ML) & Artificial Intelligence (AI) 🧠

MLOps

Architecting MLOps on the Lakehouse
MLOps at Walgreens Boots Alliance With Databricks Lakehouse Platform - experiences with preview of MLOps Stack

MLflow

▶️ MLflow YouTube channel
Cross-version Testing in MLflow: MLflow integrates with several popular ML frameworks. See how the Databricks Engineering team proactively adapt MLflow and third-party libraries to prevent against breaking changes
Model Evaluation in MLflow

MLflow Recipes

A step-by-step guide to using MLFlow Recipes to refactor messy notebooks

Feature Engineering

Synthetic Data for Better Machine Learning: a guide to using Synthetic Data Vault (SDV) with MLflow to generate synthetic data, that reflects real-world data, for better models or safer data sharing between teams

Feature Store

eBook: The Comprehensive Guide to Feature Stores (Mar 2022)

Distributed Training

How (Not) To Scale Deep Learning in 6 Easy Steps
PyTorch on Databricks - Introducing the Spark PyTorch Distributor: You can now easily perform distributed PyTorch training with PySpark and Databricks! (sample 📕Notebook
Accelerating Your Deep Learning with PyTorch Lightning on Databricks
Ray support on Databricks and Apache Spark Clusters
▶️ Scaling Deep Learning on Databricks
Rapid NLP Development With Databricks, Delta, and Transformers
Mitigating Bias in Machine Learning With SHAP and Fairlearn(accompanying 📄 notebook)
Parallel ML: How Compass Built a Framework for Training Many Machine Learning Models

Predictions

Near Real-Time Anomaly Detection with Delta Live Tables and Databricks Machine Learning

LLMs

LLM Model Recommendations: we often get asked what are the best OSS LLM models to use for which use case. This frequently-updated matrix is a handy reference to identify the right models by use case and depending on whether you are seeking quality-, balanced-, or speed-optimised models
GitHub repo with detailed Notebooks on how to work with OSS LLMs (e.g. Llama-2-*, mpt-*) using MLflow in Databricks
📄 LLM GPU serving examples
Actioning Customer Reviews at Scale with Databricks SQL AI Functions: bring meaning to unstructured data using the simplicity of SQL and GPT-3.5
Tutorial: Connecting Databricks’ DBRX LLM to a Power BI report
Implementing LLM Guardrails for Safe and Responsible Generative AI Deployment

Guides

Getting Started with Personalization through Propensity Scoring (accompanying notebooks)
Building an End-to-End No Code Pipeline with Databricks
Using MLflow to deploy Graph Neural Networks for Monitoring Supply Chain Risk
Predicting the 2022 World Cup with no-code data science and machine learning: covers bamboolib + AutoML + serverless inference (repo)
Fine-Tuning Large Language Models with Hugging Face and DeepSpeed
How Outreach Productionizes PyTorch-based Hugging Face Transformers for NLP
Getting started with NLP using Hugging Face transformers pipelines
Rapid NLP Development With Databricks, Delta, and Transformers: Hugging Face, BERT
GPU-accelerated Sentiment Analysis Using Pytorch and Hugging Face on Databricks
▶️ Streaming Data with Twitter, Delta Live Tables, Databricks Workflows, and Hugging Face
Quantifying uncertainty with Tensorflow Probability
How Corning Built End-to-end ML on Databricks Lakehouse Platform (▶️ AWS re:Invent 2022 talk)
Scale Vision Transformers (ViT) on the Databricks Lakehouse Platform with Spark NLP: Spark NLP's latest release includes support for vision transformers. This guide walks through scaling ViTs on Databricks

Geospatial 🌏

Mosaic: a Databricks Labs extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets
GitHub: Mosaic
Building a Geospatial Lakehouse, Part 1
Building a Geospatial Lakehouse, Part 2: includes downloadable notebooks
High Scale Geospatial Processing With Mosaic: writeup on the underlying philosophy behind Mosaic's design
Building Geospatial Data Products
Built-in H3 Expressions for Geospatial Processing and Analytics
Supercharging H3 for Geospatial Analytics "In this blog, you will learn about the new expressions, performance benchmarks from our vectorized columnar implementation, and multiple approaches for point-in-polygon spatial joins using H3"
Spatial Analytics at Any Scale With H3 and Photon: A Comparison of Discrete, Vector, and Hybrid Approaches
How Thasos Optimized and Scaled Geospatial Workloads with Mosaic on Databricks: Thasos is an alternative data intelligence firm that transforms real-time location data from mobile phones into actionable business performance insights. To derive actionable insights from mobile phone ping data (a time series of points defined by a latitude and longitude pair), Thasos created, maintains and manages a vast collection of verified geofences
ArcGIS GeoAnalytics Engine in Databricks

Use Cases

Anomaly Detection

Unsupervised Outlier Detection on Databricks utilising Databricks' new Kakapo package (which integrates the vast PyOD library of outlier detection algorithms with MLFlow for tracking and packaging of models and Hyperopt for exploring vast, complex and heterogeneous search spaces) (sample 📕 Notebook)

App Dev

Taming JavaScript Exceptions With Databricks

Chatbots

Build your own Chatbot: walks through indexing documents, generating embeddings (using OpenAI embeddings), persisting embeddings in a vector store (FAISS), creating a Q&A flow (using Langchain), persisting the model in MLflow registry, and serving the model for your applications (Accompanying blog post)

Clean Rooms

How Habu Integrates With Databricks to Protect Sensitive Data

Customer Data

Cybersecurity 🔐

Hunting for IOCs Without Knowing Table Names or Field Labels
Hunting Anomalous Connections and Infrastructure With TLS Certificates: TLS hashes as a source for the cybersecurity threat hunting program
Cybersecurity in the Era of Multiple Clouds and Regions
Building ETL pipelines for the cybersecurity lakehouse with Delta Live Tables: ingest & evaluate AWS CloudTrail & VPC Flow logs (accompanying notebooks: CloudTrail DLT pipeline, VPC Flow Logs DLT pipeline, Zeek DLT pipeline)
Accelerating SIEM Migrations With the SPL to PySpark Transpiler
Learn how to connect Databricks to Okta to ingest System Logs, retain, and analyze for complete visibility using your Databricks Lakehouse Platform (accompanying notebooks)
Streaming Windows Event Logs into the Cybersecurity Lakehouse (notebook)
Building a Cybersecurity Lakehouse for CrowdStrike Falcon Events Part I, Part II, Part III
▶️ Vlogs on security engineering for big data and cybersecurity: by Lipyeow Lim, Technical Director, Cybersecurity GTM, Databricks

Entity Resolution

Simplify entity resolution with Databricks Automated Record Connector (ARC) ARC abstracts away the complexity of utilising UK Ministry of Justice's Splink library for entity resolution. It determines the optimal set of blocking rules, comparisons, and deterministic rules (GitHub repo)

ERP

How Organizations Can Extract the Full Potential of SAP Data with a Lakehouse

Large Language Models (LLMs)

Solution Accelerator: Automated Analysis of Product Reviews Using Large Language Models (LLMs)(📕 Notebooks)

Marketing Analytics

How to Build a Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse

Personalisation & Recommendations

Real-Time Propensity Estimation to Drive Online Sales: real-time scoring of purchase intent doesn't have to be hard! This Solution Accelerator walks you through the end-to-end process of having your own real-time scoring model on Databricks. Check out the 📄 detailed notebooks that walk you through data preparation, ETL, model training with Feature Store, model registry, processing live events in streaming or batch, and deploying the model for real-time inference
Enhancing the Amperity CDP with Personalized Product Recommendations: move identity resolution data easily between Amperity and Databricks using Amperity's Databricks Delta table destination connector (sample 📕 Notebook)

Propensity Scoring

Managing Complex Propensity Scoring Scenarios with Databricks(Notebooks| GitHub): a new Solution Accelerator to manage regular feature updates (utilising Feature Store) and periodic model re-training

Search

Enhancing Product Search with LLMs This example utilises Wayfair's annotation dataset (WANDS) to fine-tune a SentenceTransformer('all-MiniLM-L12-v2') model, generate and store embeddings in a vector store (Chroma), register the model and embeddings in MLflow registry, and finally deploy the model for serving (accompanying blog post)

Tools 🛠

Databricks Labs Data Generator: Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses.
dbx: DataBricks CLI eXtensions - aka dbx is a CLI tool for advanced Databricks jobs management

Migrations

End-to-end Guides

Exploration of Twitter sentiment impact on cryptocurrency price

🥂 Customer Stories / Case Studies

Australian Red Cross Lifeblood wins 2023 Data for Good Award: Red Cross are able to attract more donations and strengthen community bonds through initiatives that span granular forecasting, real-time wait-time predictions, customer segmentation, and marketing attribution
Monash University stands up 'lakehouse' in Databricks, Azure
Australia's heavy vehicle regulator builds 'fatigue engine' to reduce truckie deaths
Accelerating Innovation at JetBlue 🛫 Using Databricks: (▶️ video) thanks to Databricks JetBlue has benefited with: Rapid prototyping, iteration, and launching of data pipelines, jobs, and ML models; Elevated customer experience; Continuous improvement of overall value; Lowered TCO
How Akamai Leverages Databricks Unity Catalog For Distributed Data Governance (accompanying Summit presentation and slides): with 50 exabytes of data accessed across 80 workspaces, Unity Catalog brought many efficiencies to the team:
- Eliminated the need for duplicating mounts for shared datasets across workspaces
- Implemented fine-grained access controls at the row and column levels
- Eliminated the pains of syncing user management and access control across multiple workspaces
- Single pane of glass for data observability
Why we migrated to a Data Lakehouse on Delta Lake for T-Mobile Data Science and Analytics Team
FactSet's Lakehouse adoption resulted in faster processing times and improved team productivity while reducing costs by 83%
How Stack Overflow built their new course recommendations solution on Azure Databricks. I think it's safe to say we all owe Stack Overflow some thanks for aiding us in our technical careers; so I was happy to see Databricks play a role in helping the community discover relevant courses. "[It was] clear that leveraging one platform for as much as possible would be wise, and our platform of choice was Azure Databricks. This allowed us to keep all data processing, feature engineering, model versioning, serving, and orchestration all in one place."
How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark
Ahold Delhaize: Workflows helps data teams scale and reduce costs: 1K daily ingestion jobs with 50% cost reduction
Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML: Vizio's journey in adopting the Lakehouse for a single platform that met their data warehouse and ML needs. " Databricks was the only platform that could handle ETL, monitoring, orchestration, streaming, ML, and Data Governance on a single platform. Not only was Databricks SQL + Delta able to run queries faster on real-world data (in our analysis, Databricks was 3x faster) but we no longer needed to buy other services just to run the platform and add features in the future"
Ripple: ML Training and Deployment Pipeline Using Databricks: how Ripple uses Databricks to manage robust MLOps pipelines across a multi-cloud (GCP and AWS) architecture: "ML flow tracking and MLflow API help coordinate these actions with ease in spite of using different platforms for model development, testing and deployment"

Name		Name	Last commit message	Last commit date
Latest commit History 402 Commits
assets		assets
checklists		checklists
clouds		clouds
engineering		engineering
examples		examples
fortnightly-updates		fortnightly-updates
guides		guides
industries		industries
integrations/bi		integrations/bi
partners		partners
presentations		presentations
pyspark		pyspark
updates		updates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

vinoaj/databricks-resources

Folders and files

Latest commit

History

Repository files navigation