Unity Catalog Unveiled: Exploring the Core Components of Data Isolation

Unity Catalog Unveiled: Exploring the Core Components of Data Isolation

Navigating the Depths of Unity Catalog's Infrastructure for Seamless Data Management

Unity Catalog is a fine-grained governance solution for data and AI on the Databricks platform. It helps simplify security and governance of your data by providing a central place to administer and audit data access. Delta Sharing is a secure data sharing platform that lets you share data in Azure Databricks with users outside your organization. It uses Unity Catalog to manage and audit sharing behavior.

Data governance and data isolation building blocks

To develop a data governance model and a data isolation plan that works for your organization, it helps to understand the primary building blocks that are available to you when you create your data governance solution in Azure Databricks.

The following diagram illustrates the primary data hierarchy in Unity Catalog.

Unity Catalog object model diagram

The objects in that hierarchy include the following:

Metastore: A metastore is the top-level container of objects in Unity Catalog. Metastores live at the account level and function as the top of the pyramid in the Azure Databricks data governance model.

All Databricks objects, such as tables, views, and the permissions governing their access, are stored in a metastore. Databricks account admins can create one metastore for each region in which they operate, and assign them to multiple Azure Databricks workspaces in the same region. Metastore admins can manage all objects in the metastore. They don’t have direct access to read and write to tables registered in the metastore, but they do have indirect access through their ability to transfer data object ownership.

  • Metastores provide regional isolation but are not intended as units of data isolation. Data isolation should begin at the catalog level.

Catalog: Catalogs are the highest level in the data hierarchy (catalog > schema > table/view/volume) managed by the Unity Catalog metastore. They are intended as the primary unit of data isolation in a typical Azure Databricks data governance model.

Catalogs represent a logical grouping of schemas, usually bounded by data access requirements. Catalogs often mirror organizational units or software development lifecycle scopes. You may choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

By default, access permissions for a securable object are inherited by the children of that object, with catalogs at the top of the hierarchy. This makes it easier to set up default access rules for your data and to specify different rules at each level of the hierarchy only where you need them.

Schema (Database): Schemas, also known as databases, are logical groupings of tabular data (tables and views), non-tabular data (volumes), functions, and machine learning models. They give you a way to organize and control access to data that is more granular than catalogs. Typically they represent a single use case, project, or team sandbox.

  • Schemas can be stored in the same physical storage as the parent catalog, or you can configure a schema to be stored separately from the rest of the parent catalog.

  • Metastore admins, parent catalog owners, and schema owners can manage access to schemas.

Tables: Tables reside in the third layer of Unity Catalog’s three-level namespace. They contains rows of data.

Unity Catalog lets you create managed tables and external tables.

  • For managed tables, Unity Catalog fully manages the lifecycle and file layout. By default, managed tables are stored in the root storage location that you configure when you create a metastore. You can choose instead to isolate storage for managed tables at the catalog or schema levels.

  • External tables are tables whose data lifecycle and file layout are managed using your cloud provider and other data platforms, not Unity Catalog. Typically you use external tables to register large amounts of your existing data, or if you also require write access to the data using tools outside of Azure Databricks clusters and Databricks SQL warehouses. Once an external table is registered in a Unity Catalog metastore, you can manage and audit Azure Databricks access to it just like you can with managed tables.

Parent catalog owners and schema owners can manage access to tables, as can metastore admins (indirectly).

Views: A view is a read-only object derived from one or more tables and views in a metastore.

  • Rows and columns: Row and column-level access, along with data masking, is granted using either dynamic views or row filters and column masks. Dynamic views are read-only.

Volumes: Volumes reside in the third layer of Unity Catalog’s three-level namespace. They manage non-tabular data. You can use volumes to store, organize, and access files in any format, including structured, semi-structured, and unstructured data. Files in volumes cannot be registered as tables.

Models and functions: Although they are not, strictly speaking, data assets, registered models and user-defined functions can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy. See Manage model lifecycle in Unity Catalog and User-defined functions (UDFs) in Unity Catalog.

Users can only gain access to data based on specified access rules

In Unity Catalog, users are granted access to data based on predefined rules. These rules are essential for organizations to adhere to internal policies and regulatory standards. For instance, sensitive data like employee salaries or credit card details requires stringent protection. Unity Catalog offers precise control over who can access which data within the catalog, ensuring compliance with industry regulations.

With Unity Catalog's controls, users are restricted to viewing and querying only the data they're authorized to access. This means that individuals can interact with data relevant to their roles and permissions, preventing unauthorized access to sensitive information. Additionally, these access controls are regularly audited to maintain data security and integrity over time. In essence, Unity Catalog empowers organizations to enforce strict data access policies, safeguarding sensitive information while promoting efficient data utilization.

  • In the centralized governance model, your governance administrators are owners of the metastore and can take ownership of any object and grant and revoke permissions.

Unity Catalog ownership and access

💡
Regardless of whether you choose the metastore or catalogs as your data domain, Databricks strongly recommends that you set a group as the metastore admin or catalog owner.

Data is physically separated in storage

An organization can require that data of certain types be stored within specific accounts or buckets in their cloud tenant.

Unity Catalog gives the ability to configure storage locations at the metastore, catalog, or schema level to satisfy such requirements.

To illustrate, suppose your company enforces a policy dictating that HR production data must reside in the container abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net. In Unity Catalog, you can accomplish this by creating a dedicated catalog named, for example, hr_prod. By assigning the storage location abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net/unity-catalog to this catalog, any tables or datasets created within it will automatically store their data in the specified container.

Unity Catalog storage hierarchy

The SQL command for creating a catalog named hr_prod with a managed location

CREATE CATALOG hr_prod MANAGED LOCATION 
'abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net/unity-catalog';
💡
When creating a catalog with a managed location in Unity Catalog, it's essential to ensure that the appropriate permissions are granted to access the specified location.

If such a storage isolation is not required, you can set a storage location at the metastore level. The result is that this location serves as a default location for storing managed tables and volumes across catalogs and schemas in the metastore.

The system evaluates the hierarchy of storage locations from schema to catalog to metastore.

For example, if a table myCatalog.mySchema.myTable is created in my-region-metastore, the table storage location is determined according to the following rule:

  1. If a location has been provided for mySchema, it will be stored there.

  2. If not, and a location has been provided on myCatalog, it will be stored there.

  3. Finally, if no location has been provided on myCatalog, it will be stored in the location associated with the my-region-metastore.

Each object within the catalog has an owner, who can read and modify its contents. To read any data within the metastore, you must have the following permissions, according to Databricks:

  • SELECT on the table or view

  • USAGE on the schema that owns the table

  • USAGE on the catalog that owns the schema

We will delve deeper into permission requirements in our upcoming articles.

Conclusion

In summary, Unity Catalog offers organizations precise control over data governance and isolation, ensuring compliance with regulations and internal policies. With features like metastores and customizable storage locations, Unity Catalog empowers efficient data management while safeguarding sensitive information. As organizations strive to optimize their data strategies, Unity Catalog emerges as a vital solution for maximizing data potential and security.

Stay tune

Stay tuned for our upcoming articles where we'll delve into the crucial topics of storage credentials and external locations, further enhancing your understanding of data management within Unity Catalog.