Add OpenTelemetry instrumentation to Databricks Notebooks with Azure Application Insights integration #21

dtisza1 · 2025-03-11T17:37:57Z

Summary

This PR implements a comprehensive framework for instrumenting Databricks notebooks with OpenTelemetry and integrating with Azure Application Insights for monitoring and observability. The implementation focuses on notebook monitoring, providing detailed tracing and metrics collection across extraction, transformation, and loading stages, with enhanced documentation and visual representations to improve usability.

Key Features

Flexible OpenTelemetryHelper class that encapsulates OpenTelemetry functionality with simplified helper methods
Integration with Azure Application Insights for monitoring and alerting
Comprehensive tracing for ETL pipeline stages (extraction, transformation, loading)
Parent-child notebook workflow monitoring with two approaches:
- Monitoring without modifying child notebooks
- Alternative approach for directly instrumenting child notebooks with passed context
Custom metrics collection and visualization
Span attributes for detailed monitoring and troubleshooting
Function decorators for automatic tracing with minimal code changes
Comprehensive documentation with visual diagrams for setup, usage, and monitoring
Multiple installation options for different use cases
Ready-to-use examples of instrumented ETL pipelines and notebook workflows

Implementation Details

The core functionality is in the otel_helper.py module, which provides a reusable helper class
New helper methods simplify OpenTelemetry instrumentation:
- run_notebook_with_tracing for automatic tracing of child notebook executions
- instrument_function for wrapping any function with OpenTelemetry tracing
- trace_function decorator for automatic tracing of function execution
The implementation shows how to add observability with minimal changes to existing notebook code
Comprehensive documentation is included for tracing, metrics, and Azure monitoring
Visual diagrams illustrate the architecture, data flow, and span correlation

Business Value

Enhanced observability for Databricks workflows with real-time visibility into ETL processes
Reduced Mean Time to Resolution (MTTR) by quickly identifying the root cause of failures
Improved performance by identifying bottlenecks in data processing pipelines
Increased reliability through proactive monitoring and alerting
Optimized resource usage by tracking efficiency metrics across pipeline stages

Learning Context

I worked on this project as part of my personal study to deepen my understanding of:

OpenTelemetry instrumentation patterns and best practices
Databricks notebook integration with observability tools
Azure Application Insights for monitoring data workflows
Notebook observability and performance tracking
Practical implementation of distributed tracing in data workflows

This learning exercise has helped me gain hands-on experience with modern observability techniques and how they can be applied to data engineering workflows.

Testing

Verified trace data appears correctly in Azure Application Insights
Confirmed metrics are properly collected and exported
Tested the implementation with simulated notebook scenarios
Validated that span attributes and events are correctly recorded
Ensured new helper methods work correctly with example notebooks

Documentation

Added detailed documentation for setup, usage, and implementation
Included guides for tracing, metrics, and Azure monitoring
Provided example queries for analyzing telemetry data in Azure
Added visual diagrams to illustrate:
- Overall system architecture
- ETL pipeline data flow and tracing
- Parent-child notebook workflow span correlation
Created a glossary of technical terms for better understanding
Added a quick start guide for streamlined setup and onboarding
Expanded README with business value and documentation guide
Reorganized documentation structure for improved navigation

AI Assistance Disclosure

This contribution utilized AI tools (e.g., ChatGPT, Claude 3.7 Sonnet via VSCode Cline) for development assistance. All outputs were manually reviewed and tested to ensure adherence to project standards.

…d docs

…ly describe the notebook

emanguy · 2025-03-13T17:05:24Z

...ata/examples/databricks-opentelemetry-azure-monitoring/examples/parent_notebook_with_otel.py

+try:
+    # Execute Child Notebook 2
+    print("Executing Child Notebook 2...")
+    child2_result_json = dbutils.notebook.run("./child_notebook_2", timeout_seconds=600)


I wonder if there's a way we could wrap dbutils.notebook.run so we don't have to do the manual .start_tracing()/.end_tracing()?

Maybe something like:

// This is a member function on "workflow_otel_helper" def manually_instrument_fn(self, function, trace_name): if trace_name is None: trace_name = function.__name__ outer_self = self def trace_wrapper(*args, **kwargs): nonlocal outer_self # Maybe capture the arguments here? outer_self.start_tracing(trace_name) try: # Could be worth injecting the current trace_id into kwargs # so traces can continue across notebooks function(*args, **kwargs) finally: outer_self.end_tracing(trace_name) return trace_wrapper

Then you could run other notebooks like this without needing to manually include the start_tracing and end_tracing calls:

run_traced_notebook = workflow_otel_helper.manually_instrument_function(dbutils.notebook.run) child2_result_json = run_traced_notebook("./child_notebook_2", timeout_seconds=600)

This is derived from a design philosophy of mine I like to call "make the easiest way to do something the right way"

@emanguy Thank you for the improvement advice and good philosophy!

Just updated the project related to this. Let me know if this looks good to you?

Here's a quick summary:

Added helper methods _trace_execution() (private), instrument_function() and run_notebook_with_tracing() to the helper class.

Updated the example notebooks to use these.

Updated the documentation.

Tested the notebook changes via Azure Databricks.

Tested the related KQL queries in Azure Application Insights.

So the code related to the Child2 notebook call now looks like this:

# Execute Child Notebook 2 with automatic tracing print("Executing Child Notebook 2 with automatic tracing...") child2_result = workflow_otel_helper.run_notebook_with_tracing( notebook_path="./child_notebook_2", span_name="Child_Notebook_2", timeout_seconds=600, etl_pipeline_id=workflow_id, notebook_type="aggregation" ) print(f"Child Notebook 2 completed with status code: {child2_result['status_code']}")

Nice, looks good to me!

…tion

- Expand README with business value, visual diagrams, and documentation guide - Update existing documentation files with more detailed information - Reorganize and improve documentation structure

- Add glossary.md with definitions of technical terms - Add quick_start.md with streamlined setup instructions - Improve onboarding experience for new users

- Add architecture_diagram.md showing overall system architecture - Add etl_pipeline_visualization.md illustrating data flow through ETL pipeline - Add parent_child_workflow_diagram.md showing span correlation in notebook workflows - Enhance documentation with visual representations

dtisza1 added 4 commits March 11, 2025 11:41

Initial commit for databricks-opentelemetry-azure-monitoring project

6312f00

Add function name as span attribute in OpenTelemetryHelper decorator

cf31662

Update documentation to include function_name attribute in queries an…

de77425

…d docs

Update first markdown cell in etl_simulation_with_otel.py to accurate…

a8e3710

…ly describe the notebook

dtisza1 self-assigned this Mar 11, 2025

dtisza1 added 5 commits March 12, 2025 09:36

Add parent-child notebook example implementation

3b0aa0c

Add parent-child notebooks documentation guide

a495949

Update documentation to reference parent-child notebook example

4fa0414

Add documentation for instrumented child notebooks approach

711c175

Update Features section in README.md with enhanced capabilities

2cd39fa

dtisza1 marked this pull request as ready for review March 12, 2025 15:17

dtisza1 requested review from colettace and emanguy March 12, 2025 15:18

emanguy reviewed Mar 13, 2025

View reviewed changes

dtisza1 added 6 commits March 17, 2025 13:17

Add core tracing helper methods to simplify OpenTelemetry instrumenta…

64a4809

…tion

Update example notebooks to use new helper methods

413ca43

Update documentation with new helper methods

0731d10

Update README and core documentation

060e682

- Expand README with business value, visual diagrams, and documentation guide - Update existing documentation files with more detailed information - Reorganize and improve documentation structure

Add new documentation files

396abea

- Add glossary.md with definitions of technical terms - Add quick_start.md with streamlined setup instructions - Improve onboarding experience for new users

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenTelemetry instrumentation to Databricks Notebooks with Azure Application Insights integration #21

Add OpenTelemetry instrumentation to Databricks Notebooks with Azure Application Insights integration #21

Uh oh!

dtisza1 commented Mar 11, 2025 •

edited

Loading

Uh oh!

emanguy Mar 13, 2025 •

edited

Loading

Uh oh!

emanguy Mar 13, 2025

Uh oh!

dtisza1 Mar 17, 2025

Uh oh!

emanguy Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add OpenTelemetry instrumentation to Databricks Notebooks with Azure Application Insights integration #21

Are you sure you want to change the base?

Add OpenTelemetry instrumentation to Databricks Notebooks with Azure Application Insights integration #21

Uh oh!

Conversation

dtisza1 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Implementation Details

Business Value

Learning Context

Testing

Documentation

AI Assistance Disclosure

Uh oh!

emanguy Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emanguy Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

dtisza1 Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

emanguy Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dtisza1 commented Mar 11, 2025 •

edited

Loading

emanguy Mar 13, 2025 •

edited

Loading