Azure Databricks : End To End Project — Part 1 — Unity Catalog & Project Setup

Amine Charot
9 min readMay 15, 2024

--

Unity Catalog in Azure Databricks is a game-changer for organizations looking to enhance their data governance frameworks. It provides a robust, centralized catalog for all data assets across different Databricks workspaces, ensuring that governance and security policies are uniformly applied. The addition of PySpark into this mix allows data scientists and engineers to execute complex data processing tasks while adhering to strict governance protocols.

In this blog post, we will delve deeper into how Unity Catalog and PySpark can be used together to create a secure and efficient data environment. We’ll cover setting up Unity Catalog, exploring its three-level namespace, and running a PySpark example that illustrates how to interact with this setup.

Unity Catalog: A Primer

Unity Catalog is designed to manage data access and security policies centrally. It extends across all data in the lakehouse, providing fine-grained access control and ensuring compliance with data governance standards. The three-level namespace hierarchy in Unity Catalog — consisting of catalogs, schemas, and tables — helps in organizing data assets effectively, making it easier for users to manage and access data according to their needs and permissions.

Three-Level Namespace Explained

Here’s a closer look at each level in Unity Catalog’s namespace:

  1. Catalog: The highest level, representing an overarching boundary like an entire department or a large project within an organization.
  2. Schema: A subdivision within a catalog that groups related tables, similar to a specific project team or a particular aspect of the business within the larger department.
  3. Table: The most detailed level, where data rows are stored. Tables hold the actual datasets used for analyses, such as sales data, user interactions, or operational logs.

Understanding its key features such as external locations, storage credentials is crucial for leveraging its full potential. Let’s explore each of these features in detail.

External Location

External locations refer to storage outside of the Databricks environment but managed via Unity Catalog. This allows for integrating and managing data stored in systems like Azure Data Lake Storage, AWS S3, and others within the Unity Catalog framework, thus centralizing data governance and security.

Storage Credential

Storage credentials are used to securely access data stored in external locations. Unity Catalog handles these credentials centrally, ensuring they are encrypted and not exposed to end-users, thus maintaining security and compliance.

Roles in Unity Catalog

Roles in Unity Catalog are pivotal in orchestrating access control and permissions within a data ecosystem. They are tailored to allocate specific rights and responsibilities to users or groups, ensuring efficient management and safeguarding sensitive data. Let’s dissect each role’s functionalities within Unity Catalog:

Account Admin:

Create Metastore and Link Workspaces: Account Admins can establish and manage metastores, pivotal in organizing metadata, and associate them with workspaces for streamlined data management.

User and Group Management: Account Admins are empowered to oversee user and group management, facilitating role assignments and ensuring efficient collaboration.

Billing and Cost: Account Admins have oversight of billing and cost management, crucial for resource optimization and budget adherence.

Metastore Admin:

Create and Manage Catalogs: Metastore Admins wield authority in crafting and overseeing catalogs, pivotal in structuring and organizing metadata repositories for efficient data governance.

Create and Manage External Locations: Metastore Admins possess the prerogative to create and manage external locations, facilitating seamless integration with external data sources and platforms.

Workspace Admin:

Create and Manage Workspaces: Workspace Admins are bestowed with the responsibility to create and manage workspaces, fostering an organized environment conducive to collaborative data initiatives.

Create and Manage Clusters: Workspace Admins are entrusted with the task of provisioning and managing clusters, pivotal in facilitating computational resources for data processing.

Workspace Users:

Create Tables, Schemas, Objects: Workspace Users are empowered to contribute to the data ecosystem by creating tables, schemas, and objects within their respective workspaces, fostering an environment of innovation and collaboration.

Unity Catalog Privileges

Unity Catalog Privileges are instrumental in delineating the permissions granted to users or groups on various objects within the Unity Catalog environment. These privileges can be assigned using SQL commands or through the Unity Catalog UI, providing flexibility and ease of management. Let’s delve into the components and functionalities of Unity Catalog Privileges:

Components of Unity Catalog Privileges:

Privilege Type:

  • Privilege types encompass a range of permissions that dictate the actions users or groups are authorized to perform on securable objects within Unity Catalog.

Securable Object:

  • Securable objects denote entities within Unity Catalog that are subject to access control permissions.
  • These objects encompass a broad spectrum, including schemas, tables, views, functions, and other metadata elements managed within the catalog.

Principal:

  • Principals represent users, groups, or roles to whom privileges are assigned within Unity Catalog.
  • Assigning privileges to principals determines the level of access granted to them on securable objects.

Example Syntax for Assigning Privileges:

The syntax for granting privileges follows a structured format:

GRANT <Privilege_Type> ON <Securable_Object> TO <Principal>
  • <Privilege_Type>: Specifies the type of privilege being granted (e.g., SELECT, CREATE, etc.).
  • <Securable_Object>: Denotes the object on which the privilege is being assigned (e.g., TABLE, SCHEMA, etc.).
  • <Principal>: Represents the user, group, or role to whom the privilege is being granted.

Example Usage:

-- Grant SELECT privilege on a table to a user
GRANT SELECT ON TABLE my_schema.my_table TO my_user;
-- Grant CREATE privilege on a schema to a group
GRANT CREATE ON SCHEMA my_schema TO my_group;

In the examples above:

  • my_user, my_group, and my_role represent specific users, groups, or roles.
  • my_schema and my_table are placeholders for the schema and table names within Unity Catalog.

Setup Project : Customer Orders Analysis

This project involves analyzing a dataset that contains five years of customer orders, from 2017 to 2021, featuring thousands of products sold. The dataset provides a rich source of information for mining insights into customer purchasing patterns and product performance over time.

In this project, we will explore different components of each platform. With Databricks, we will delve into features like Unity Catalog, which organizes and secures data across all Databricks workspaces, and Delta Lake, which provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Similarly, in Microsoft Fabric, we will examine its integration capabilities, data management, and analytics services to handle large-scale data efficiently.

The project’s analytical phase will include data ingestion, cleansing, integration, and transformation. Following these preparatory steps, we will focus on advanced analytics, employing techniques such as regression models to predict future trends based on historical data. This hands-on comparison aims to not only highlight each platform’s technical merits but also to demonstrate their practical application in a real-world business scenario, providing valuable insights into which platform might better suit different organizational needs.

Enable Unity Catalog : in Account Console — Databricks (azuredatabricks.net), Create a Metastore and assign the Unity Catalog to the workspace.

Note : This action is irreversible !

Let’s try to achieve now this :

Step 1 : Create Storage Credentials :

Let’s create a Storage Credential. First we will need a Connector created in Azure :

This connector gives us a Managed Identity;

Managed identities are effectively special types of Entra ID identities that are automatically managed by Azure. When you enable a managed identity for an Azure service (such as an Azure Virtual Machine, App Service, or Azure Functions), Entra ID creates an identity for that service instance in the background. This identity can be used to authenticate to any service that supports Azure AD authentication, without requiring you to embed credentials (like passwords or keys) in your code.

Then we need to give a Data Action Role to our Managed Identity inside the Azure Storage Data Lake, in our case it will be Blob Data Contributor:

Role Definition: A Blob Data Contributor has permissions to manage Blob data in Azure Blob Storage. This role allows the user to perform a wide range of actions including reading, writing, and deleting blobs.

Typical Use Cases: This role is suitable for users who need to upload new data, update existing files, or delete unnecessary blobs. It’s commonly assigned to users who manage day-to-day operations involving Blob storage.

Data Action: In Azure, “Data Action” typically refers to any operation that involves managing or manipulating data within a service. This can include actions such as reading, writing, and deleting data. The term is broadly used across various Azure services where data operations are performed. Data actions are central in defining roles and permissions, particularly when setting up access controls in Azure services.

Action in RBAC (Role-Based Access Control): “Action” within the context of Role-Based Access Control (RBAC) specifically refers to the operations that a user or service can perform, which are defined by their assigned role. RBAC is a method of regulating access to resources based on the roles of individual users within an organization. Each role is composed of a set of discrete permissions that can include managing access, creating resources, or modifying them. The term “action” in this context is part of defining those permissions.

For example, in RBAC, a specific “action” might be defined as Microsoft.Storage/storageAccounts/read, which would allow the role to read storage account details but not modify them.

Step 2 - Create an External Location :

As you see we need containers, let’s create this architecture :

where medallion will contains our medallion model (three folders Bronze, Silver and Gold);

The Landing which will contain our landing file from the data source;

The checkpoint which will check later.

Let’s create now External Locations using a notebook which will be used to Setup the project :

As result we will have our External Location :

Once is done we have our connections ready, let’s try now to achieve this :

Step 3 — Let’s create the dev Catalog :

Using our Setup Python Notebook, we can add the following :

Running it the result would be :

Step 3 — Create Schemas :

Now second level, Schemas, let’s create the Bronze, Silver and Gold Schemas using our Setup Notebook by adding :

Which gives this :

Step 4 — Create Bronze Tables :

we are using these datas got from Wholesale & Retail Orders Dataset (kaggle.com)

Orders.csv :

Products-Supplier.csv :

Let’s then create these tables for bronze :

which gives :

Full Project Setup Notebook :

Stay tuned for the part 2 !

--

--