Skip to content

Conversation

@dtisza1
Copy link
Collaborator

@dtisza1 dtisza1 commented Mar 11, 2025

Summary

This PR implements a comprehensive framework for instrumenting Databricks notebooks with OpenTelemetry and integrating with Azure Application Insights for monitoring and observability. The implementation focuses on notebook monitoring, providing detailed tracing and metrics collection across extraction, transformation, and loading stages, with enhanced documentation and visual representations to improve usability.

Key Features

  • Flexible OpenTelemetryHelper class that encapsulates OpenTelemetry functionality with simplified helper methods
  • Integration with Azure Application Insights for monitoring and alerting
  • Comprehensive tracing for ETL pipeline stages (extraction, transformation, loading)
  • Parent-child notebook workflow monitoring with two approaches:
    • Monitoring without modifying child notebooks
    • Alternative approach for directly instrumenting child notebooks with passed context
  • Custom metrics collection and visualization
  • Span attributes for detailed monitoring and troubleshooting
  • Function decorators for automatic tracing with minimal code changes
  • Comprehensive documentation with visual diagrams for setup, usage, and monitoring
  • Multiple installation options for different use cases
  • Ready-to-use examples of instrumented ETL pipelines and notebook workflows

Implementation Details

  • The core functionality is in the otel_helper.py module, which provides a reusable helper class
  • New helper methods simplify OpenTelemetry instrumentation:
    • run_notebook_with_tracing for automatic tracing of child notebook executions
    • instrument_function for wrapping any function with OpenTelemetry tracing
    • trace_function decorator for automatic tracing of function execution
  • The implementation shows how to add observability with minimal changes to existing notebook code
  • Comprehensive documentation is included for tracing, metrics, and Azure monitoring
  • Visual diagrams illustrate the architecture, data flow, and span correlation

Business Value

  • Enhanced observability for Databricks workflows with real-time visibility into ETL processes
  • Reduced Mean Time to Resolution (MTTR) by quickly identifying the root cause of failures
  • Improved performance by identifying bottlenecks in data processing pipelines
  • Increased reliability through proactive monitoring and alerting
  • Optimized resource usage by tracking efficiency metrics across pipeline stages

Learning Context

I worked on this project as part of my personal study to deepen my understanding of:

  • OpenTelemetry instrumentation patterns and best practices
  • Databricks notebook integration with observability tools
  • Azure Application Insights for monitoring data workflows
  • Notebook observability and performance tracking
  • Practical implementation of distributed tracing in data workflows

This learning exercise has helped me gain hands-on experience with modern observability techniques and how they can be applied to data engineering workflows.

Testing

  • Verified trace data appears correctly in Azure Application Insights
  • Confirmed metrics are properly collected and exported
  • Tested the implementation with simulated notebook scenarios
  • Validated that span attributes and events are correctly recorded
  • Ensured new helper methods work correctly with example notebooks

Documentation

  • Added detailed documentation for setup, usage, and implementation
  • Included guides for tracing, metrics, and Azure monitoring
  • Provided example queries for analyzing telemetry data in Azure
  • Added visual diagrams to illustrate:
    • Overall system architecture
    • ETL pipeline data flow and tracing
    • Parent-child notebook workflow span correlation
  • Created a glossary of technical terms for better understanding
  • Added a quick start guide for streamlined setup and onboarding
  • Expanded README with business value and documentation guide
  • Reorganized documentation structure for improved navigation

AI Assistance Disclosure

This contribution utilized AI tools (e.g., ChatGPT, Claude 3.7 Sonnet via VSCode Cline) for development assistance. All outputs were manually reviewed and tested to ensure adherence to project standards.

@dtisza1 dtisza1 self-assigned this Mar 11, 2025
@dtisza1 dtisza1 marked this pull request as ready for review March 12, 2025 15:17
@dtisza1 dtisza1 requested review from colettace and emanguy March 12, 2025 15:18
try:
# Execute Child Notebook 2
print("Executing Child Notebook 2...")
child2_result_json = dbutils.notebook.run("./child_notebook_2", timeout_seconds=600)
Copy link
Contributor

@emanguy emanguy Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a way we could wrap dbutils.notebook.run so we don't have to do the manual .start_tracing()/.end_tracing()?

Maybe something like:

// This is a member function on "workflow_otel_helper"
def manually_instrument_fn(self, function, trace_name):
    if trace_name is None:
        trace_name = function.__name__
    
    outer_self = self
    def trace_wrapper(*args, **kwargs):
        nonlocal outer_self
        # Maybe capture the arguments here?
        outer_self.start_tracing(trace_name)
        try:
            # Could be worth injecting the current trace_id into kwargs
            # so traces can continue across notebooks
            function(*args, **kwargs)
        finally:
            outer_self.end_tracing(trace_name)
    
    return trace_wrapper

Then you could run other notebooks like this without needing to manually include the start_tracing and end_tracing calls:

run_traced_notebook = workflow_otel_helper.manually_instrument_function(dbutils.notebook.run)
child2_result_json = run_traced_notebook("./child_notebook_2", timeout_seconds=600)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is derived from a design philosophy of mine I like to call "make the easiest way to do something the right way"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emanguy Thank you for the improvement advice and good philosophy!

Just updated the project related to this. Let me know if this looks good to you?

Here's a quick summary:

  • Added helper methods _trace_execution() (private), instrument_function() and run_notebook_with_tracing() to the helper class.
  • Updated the example notebooks to use these.
  • Updated the documentation.
  • Tested the notebook changes via Azure Databricks.
  • Tested the related KQL queries in Azure Application Insights.

So the code related to the Child2 notebook call now looks like this:

# Execute Child Notebook 2 with automatic tracing
print("Executing Child Notebook 2 with automatic tracing...")
child2_result = workflow_otel_helper.run_notebook_with_tracing(
    notebook_path="./child_notebook_2",
    span_name="Child_Notebook_2",
    timeout_seconds=600,
    etl_pipeline_id=workflow_id,
    notebook_type="aggregation"
)

print(f"Child Notebook 2 completed with status code: {child2_result['status_code']}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good to me!

dtisza1 added 6 commits March 17, 2025 13:17
- Expand README with business value, visual diagrams, and documentation guide
- Update existing documentation files with more detailed information
- Reorganize and improve documentation structure
- Add glossary.md with definitions of technical terms
- Add quick_start.md with streamlined setup instructions
- Improve onboarding experience for new users
- Add architecture_diagram.md showing overall system architecture
- Add etl_pipeline_visualization.md illustrating data flow through ETL pipeline
- Add parent_child_workflow_diagram.md showing span correlation in notebook workflows
- Enhance documentation with visual representations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants