Skip to content

vinoaj/databricks-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Databricks Resources

My personal list of resources and samples related to working with Databricks. Opinions are my own and not the views of my employer.



Keep Current and Learning Resources

News and Learning Content

โ–ถ๏ธ YouTube channel | ๐ŸŽง Data Brew Podcast | ๐Ÿ“– Databricks Blog

Useful Customer Academy Courses

Release Notes

GitHub repos

Community & Support

Feedback / Feature Requests


Value Generation

OSS & No Lock-in

  • Founding member of the Data Cloud Alliance: "Commitment to accelerating adoption across industries through common industry data models, open standards, processes, and end-to-end integrated products and solutions"

Analyst Evaluations


Lakehouse Paradigm


Deployment Architecture & Management

Architecture & Data Model Design

Administration

Cost Management

Disaster Recovery (DR) and High Availability (HA)

๐Ÿ” Security

PII

Unity Catalog ๐Ÿ”

Migrating to Unity Catalog

Customer Implementations


Under the Hood: Apache Spark

Apache Spark


Under the Hood: Photon Engine


Under the Hood: Delta Lake

Delta Lake Logo

Delta Lake Benchmarking โšก๏ธ

Utilities

  • hydro ๐Ÿ’ง: a collection of Python-based Apache Spark and Delta Lake extensions

Developing with Delta Lake

  • The Ubiquity of Delta Standalone: a JVM library that can be used to read and write Delta Lake tables. Unlike Delta Lake Core, this project does not use Spark to read or write tables and has only a few transitive dependencies. It can be used by any application (e.g. Power BI) that cannot use a Spark cluster. The project allows developers to build a Delta connector for an external processing engine following the Delta protocol without using a manifest file.

Delta Sharing


ETL / ELT Patterns

Design

Ingestion

Ingestion: Streaming

Delta Live Tables (DLT)

Transformation


Development

  • SQL CLI: run SQL queries on your SQL endpoints from your terminal. From the command line, you get productivity features such as suggestions and syntax highlighting
  • sqlparse: open source library for formatting and analysing SQL strings

Orchestration

Databricks Workflows


DataOps

IDEs

GitHub

Terraform

Unit Testing


Databricks SQL, Analysis & Business Intelligence (BI)

SQL

ODBC & JDBC connectivity

Analyst Experience


Best Practices

Performance tuning

Z-Ordering

  • Delta Lake orders the data in the Parquet files to make range selection on object storage more efficient
  • Limit the number of columns in the Z-Order to the best 1-4

ANALYZE

ANALYZE TABLE db_name.table_name COMPUTE STATISTICS FOR ALL COLUMNS

  • Utilised for Adaptive Query Execution (AQE), re-optimisations that occur during query execution
  • 3 major features of AQE
    • Coalescing post-shuffle partitions (combine small partitions into reasonably sized partitions)
    • Converting sort-merge joins to broadcast hash joins
    • Skew join optimisation by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks
    • Dynamically detects and propagates empty relations
  • ANALYZE TABLE collects table statistics that allows AQE to know which plan to choose for you

Machine Learning (ML) & Artificial Intelligence (AI) ๐Ÿง 

MLOps

MLflow

MLflow Recipes

Feature Engineering

Feature Store

Distributed Training

Predictions

LLMs

Guides


Geospatial ๐ŸŒ


Use Cases

Anomaly Detection

App Dev

Chatbots

  • Build your own Chatbot: walks through indexing documents, generating embeddings (using OpenAI embeddings), persisting embeddings in a vector store (FAISS), creating a Q&A flow (using Langchain), persisting the model in MLflow registry, and serving the model for your applications (Accompanying blog post)

Clean Rooms

Customer Data

Cybersecurity ๐Ÿ”

Entity Resolution

ERP

Large Language Models (LLMs)

Marketing Analytics

Personalisation & Recommendations

Propensity Scoring

Search


Tools ๐Ÿ› 

  • Databricks Labs Data Generator: Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses.
  • dbx: DataBricks CLI eXtensions - aka dbx is a CLI tool for advanced Databricks jobs management

Migrations


End-to-end Guides


๐Ÿฅ‚ Customer Stories / Case Studies


TODO: By Roles

ML/AI Roles

CTO

ML Engineer

Data Scientist

Software Engineer

ML Researcher

Data Engineer

Research Scientist

SRE

DevOps

About

My personal list of resources and samples related to working with Databricks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages