Platforms in the clouds, representing replatforming software engineeringconcept

How to replatform “small data” Notebooks into SOLID Azure Batch

By Edward Schwalb, ML Architect, Grid Dynamics

Grid Dynamics
10 min readSep 1, 2023

--

In the ever-evolving landscape of big data operations, the efficiency and cost-effectiveness of data processing pipelines have become paramount. As organizations grapple with increasing data volumes and the need for quicker insights, the spotlight shines brightly on the concept of streamlining data processing into repeatable, low-cost pipelines. In this era of rapidly advancing technology, where data drives decision-making, the ability to orchestrate seamless, efficient, and affordable data workflows is no longer a luxury but a necessity.

This article delves into the crucial role that streamlined data processing plays in modern data operations, exploring the benefits that Microsoft Azure Batch, following SOLID design principles, offers, and the strategies employed to achieve it. From optimizing resource utilization to ensuring data accuracy and timely insights, the journey toward building agile and cost-conscious data pipelines begins here.

Why SOLID Azure Batch? Limitations of Azure Data Factory

While Azure Data Factory (ADF) notebooks prove to be useful and efficient for exploratory tasks, their limitations render them unsuitable for consistent, reliable operations and ongoing maintenance, necessitating the integration of Azure Batch with SOLID principles. These limitations encompass various aspects:

  • Cost: A common scenario involves linking ADF pipelines to Spark or Synapse clusters. Frequently, the expense incurred in running these clusters far surpasses the cost of executing equivalent pipelines using Azure Batch.
  • Maintainability: Developing reusable components in ADF notebooks presents challenges. Although entire notebooks can be reused, adhering to the established best practices of SOLID object-oriented principles through free-form notebook code fragments becomes notably complex.
  • Observability: Each cell within the notebook displays the stdout (output) of the executed command, yet this output is often incomplete. Generally, only a fraction of the output stream is accessible through the web UI. This interface falls short when comprehensive log inspection and search capabilities are required for deriving value from the logs in their entirety.

What are “SOLID” Principles?

In the realm of software engineering, SOLID stands as a mnemonic acronym encompassing five fundamental design principles aimed at enhancing the effectiveness, comprehensibility, flexibility, and maintainability of object-oriented designs. These principles, collectively known as SOLID, consist of:

  1. Single-Responsibility Principle (S): This principle advocates that a class should never have more than one reason to change. In simpler terms, each class should be entrusted with just one distinct responsibility.
  2. Open-Closed Principle (O): Emphasizing extensibility, this principle dictates that software entities should be open for extension but closed for modification.
  3. Liskov Substitution Principle (L): Involving the use of pointers or references to base classes, this principle asserts that functions should seamlessly accommodate objects derived from those base classes, without necessitating awareness of the distinction.
  4. Interface Segregation Principle (I): This principle underscores that clients shouldn’t be compelled to rely on interfaces they don’t employ, thus advocating for precise, focused interfaces.
  5. Dependency Inversion Principle (D): Encouraging reliance on abstractions over concrete implementations, this principle promotes the creation of adaptable and interconnected systems.

Within this context, this blog post contends that the deployment of notebooks within data pipelines can be strategically organized to capitalize on the wealth of insights derived from the SOLID principles.

What are “small data” Notebooks?

A “small data” notebook delineates a series of sequential actions, each scripted using programming languages like Python, R, or shell scripting. The output of each action is stored within the processor’s memory, facilitating the continuation of subsequent steps once the preceding step concludes. The execution of a notebook commences with the initiation of the first command and concludes when the final command is executed.

These small data notebooks encompass logic designed to manipulate datasets comprising tens of millions of items or less. While the spotlight often falls on “big data,” there exists a plethora of scenarios wherein processing volumes beneath 100 million items is the norm. For example, instances like the cumulative count of transactions within a single hour or the tally of user interaction events during an hour frequently involve quantities below the 100 million mark.

How do “Notebooks” relate to data “Pipelines”?

Frequently, the results generated by notebook commands necessitate integration with other systems. This requirement becomes particularly pertinent when these results correspond to daily or hourly operations that must be executed multiple times throughout the day. Similarly, it’s a common practice to divide the processing of extensive datasets — comprising billions of items — into operations conducted on smaller subsets, each containing fewer than 100 million items.

In addressing these demands, the concept of a “pipeline” emerges as a solution. A pipeline is often constructed by aggregating a series of notebooks. This process is especially seamless with the aid of tools like Azure Data Factory (ADF). Leveraging a user-friendly, code-free interface, ADF empowers the straightforward assembly of these pipelines, rendering them a prevalent and practical feature within the ADF ecosystem.

Why are Databricks and Synapse clusters more expensive than Batch Pools?

Databricks and Synapse alternatives demand the initiation of high-cost clusters equipped with a multitude of nodes. These clusters must remain fully operational for the entire computational process; there’s no provision for utilizing only a portion of the cluster for executing a single “small” command. As a result, operating such clusters could easily incur costs that are double those of running a basic Azure Batch pool while accomplishing a comparable task.

Databricks and Synapse adopt an approach where the hourly cost of the underlying instances is increased, effectively replacing conventional licensing fees. This premium holds merit when the pipeline logic necessitates the use of these tools. It’s important to recognize that this premium contributes a substantial share of the revenue that supports the development of these sophisticated tools within their respective domains.

Why is it common to operate Databricks and Synapse clusters unnecessarily?

Unfortunately, the no-code options in ADF default to using Databricks, or Synapse, as the data processing engine, as depicted in Figure 1.

Schema showing default no-code notebook options in ADF
Figure 1: Default no-code notebook options in ADF

For data scientists and non-developers, viable alternatives beyond ADF notebook pipeline components are limited. Employing these components, however, mandates the use of costly infrastructure. Avoiding these components can be done using code.

How does Azure Batch integrate with Azure Data Factory?

While a straightforward drag-and-drop interface allows for adding an Azure Batch component to an ADF pipeline, its actual operation mandates the specification of a shell script invoking a headless process, as depicted in Figure 2. To enable the reusability of notebook code within such a headless script, encapsulating the logic into a class library is commonly required, adhering to SOLID object-oriented best practices. The relevant patterns facilitating the sharing of code between ad-hoc notebooks and Azure Batch pipelines are outlined below.

Scheme showing Azure batch step specification within the no-code ADF pipeline specification UI.
Figure 2: Azure batch step specification within the no-code ADF pipeline specification UI.

Autoscaling vs rightsizing

Autoscaling is designed to adjust the number of instances based on the load observed by the underlying infrastructure. Within the context of autoscaling, the upscaling or downscaling is controlled by defining a metric, e.g. CPU utilization. When the observed metric is above or below a threshold, up or downscaling is performed by adding or removing instances, respectively. Both up and downscaling are performed using small increments, e.g. changing the number of instances by 5%-10% in each increment.

More often than not, pipelines may have known and predictable compute requirements, which can determine a priori the number of concurrent instances needed for every step. This leads to a “rightsizing” approach, where pipeline steps can adjust the node pool size according to precise needs, as shown in Figure 3.

Flowchart showing rightsizing using steps within a pipeline.
Figure 3: Rightsizing using steps within a pipeline.

Comparing this approach to a Databricks cluster, the differences become apparent. With Databricks, the number of nodes remains unchanged throughout the operation of a pipeline, and typically remains less than 100 worker nodes. In contrast, Azure Batch presents the advantage of effortless scaling to accommodate hundreds of nodes for specific steps, leveraging concurrency to its fullest potential. Once the step concludes, Azure Batch seamlessly scales down, as illustrated in Figure 3.

How to leverage Notebooks as batch steps within ADF: The recipe

To effectively employ notebook code within Azure Batch, a strategic blend of SOLID object-oriented programming best practices and meticulous source control is imperative. A typical architectural blueprint, outlined in Figure 4, for maximizing notebook reuse within Azure Batch steps is as follows:

  • Notebook initialization: Kickstart the notebook’s operation by cloning a designated code repository and configuring a virtual environment. For Python, commonly used tools like “virtual env” or “conda” serve this purpose.
  • Sequential step logic: Each ensuing step within the notebook should encompass the requisite imports, class instantiation, and method invocation. It’s crucial to encapsulate all the logic within class methods to maintain modularity and reusability.
  • External pipeline component design: Design and craft the pipeline components external to the notebook, using Integrated Development Environments (IDEs) like VSCode, PyCharm, or others that align with your workflow.
  • Repository integration: Integrate the code for these pipeline components into the same code repository accessed by the notebooks.
  • ADF Batch pipeline construction: Construct the ADF Batch pipeline utilizing the exact command sequence executed within the notebook. Reusability is attained by ensuring that every Notebook step corresponds to a Batch step that executes the same method. By simplifying the commands and encapsulating the logic within class methods, redundant boilerplate code is circumvented.

By embracing this methodology, you not only streamline the process of employing Notebooks as Batch Steps within ADF but also enhance reusability, maintainability, and consistency throughout your data pipeline operations.

Scheme showing the Architecture Pattern for co-existence of Notebooks with Azure Batch.
Figure 4: The Architecture Pattern for co-existence of Notebooks with Azure Batch.

Enhancing observability and debugging with Azure Batch compared to ADF Databricks

Azure Batch distinguishes itself from the operation of Databricks clusters by offering enhanced visibility and access to the underlying instances driving the pipeline steps. This distinction yields notable benefits, delivering a higher degree of control and insight than what a Databricks cluster typically affords. Beyond the cost considerations, Azure Batch operation yields several advantages, outlined as follows:

  • Azure Batch Explorer: The Azure Batch Explorer (accessible through the quick start option) provides a comprehensive user interface that enables monitoring and management of every facet of the pool. This encompasses aspects like configuring the operating system, defining startup scripts for nodes, controlling pool size, and configuring autoscaling logic.
  • Live job output: Azure Batch facilitates viewing the list of executing jobs along with live standard output (stdout) and standard error output (stderr) for each job, as illustrated in Figure 5. Direct access to this live output via the ADF pipeline can be granted by configuring a Shared Access Signature (SAS) token. This access presents the complete, untruncated output, which parallels the truncated output accessible through Databricks notebooks.
  • Node state insights: A detailed node view, depicted in Figure 6, along with the capability to establish connections and execute debugging sessions within nodes, as depicted in Figure 7, contributes to a more comprehensive understanding of the state and behavior of individual nodes.
  • Time-sensitive analysis: Facilitated through ADF APIs or SDKs, Azure Batch enables time-sensitive analysis of progress, states, and error conditions, for efficient troubleshooting and optimization.

How is maintenance improved with Azure Batch compared to ADF Databricks?

As depicted in Figure 4, all logic within a notebook is encapsulated within class methods. As such, the pipeline implementation can leverage several common engineering best practices:

  • SOLID object-oriented design: The adoption of complete SOLID object-oriented design patterns ensures modularity, clarity, and maintainability throughout the pipeline’s lifecycle.
  • Comprehensive testing: Enabling full unit and integration testing guarantees the robustness and reliability of each pipeline component, reducing the potential for errors and unexpected outcomes.
  • Reusable components: The thorough encapsulation of logic within class methods facilitates effortless component reuse, as well as the seamless integration and customization of elements across diverse pipelines.

While it is possible to achieve re-use and improve maintainability with Datanricks components, it is more challenging. The Databricks notebook approach lacks the inherent structure that encourages adherence to the disciplined SOLID and fully-tested development methodology. Just as it’s challenging yet possible to achieve weight loss through disciplined approaches, the journey towards best practices in the Databricks notebook paradigm demands determined effort.

Schema showing improved Observability and Debugging using Azure Batch.
Figure 5: Improved Observability and Debugging using Azure Batch.
Figure 6: Viewing state of each node in the pool.
Figure 7: Connecting to each node and running debug sessions within.

Conclusion

In this exploration, we delved into the potential advantages of choosing Azure Batch over Databricks within the ADF ecosystem. Our discussion encompassed architectural blueprints and operational distinctions between these two options.

Our observations have revealed that harnessing Azure Batch for data pipelines offers a trifecta of benefits — speed, quality, and cost-efficiency:

  • Cost optimization: Realizing reduced instance costs.
  • Enhanced code quality: Achieving code improvement through reuse, enhanced testability, and adherence to SOLID best practices.
  • Accelerated development and debugging: Expediting the development cycle and debugging phase through heightened observability utilizing Batch Explorer. This, in turn, translates to decreased engineering expenditures and more predictable timelines.

By embracing Azure Batch, you not only position yourself for more efficient and cost-effective data pipeline operations but also enable a higher caliber of code and enhanced development agility within the Azure Data Factory landscape.

Reach out to Grid Dynamics to learn more.

--

--

Grid Dynamics

We accelerate digital transformation of Fortune 1000 US enterprises. We bring expertise in customer experience, data analytics, and cloud, lean software, etc.