Navigating Unity's Catalog: Storage Credentials and External Locations

Exploring Storage Credentials and External Locations in Unity's Catalog

Navigating Unity's Catalog: Storage Credentials and External Locations

Introduction

Unity's catalog is a vital component for managing assets, resources, and data within Unity projects. As projects scale in complexity, efficient management of storage becomes crucial. Unity provides functionalities for handling storage credentials and external locations within its catalog, offering developers flexibility and control over asset management. In this article, we delve into the significance of storage credentials and external locations in Unity's catalog and explore how they empower developers to streamline their workflow.

Understanding Storage Credentials

  • A storage credential is a securable object representing an Azure managed identity or Microsoft Entra ID service principal.

    Once a storage credential is created access to it can be granted to principals (users and groups).

    Storage credentials are primarily used to create external locations, which scope access to a specific storage path.

    Storage credential names are unqualified and must be unique within the metastore.

Imagine you're building a Unity project that relies heavily on assets stored in Azure Blob Storage. Now, to access these assets securely from your Unity project, you need a way to authenticate yourself to Azure.

In Azure, you can set up what's called a storage credential. This is like a key or a passcode that grants your Unity project permission to access specific resources in Azure Blob Storage.

For instance, let's say you create a storage credential called "UnityProjectCredential". This credential essentially represents your Unity project within Azure.

Now, once this storage credential is set up, you can specify which resources or paths in Azure Blob Storage your Unity project can access. For example, you might grant access to a specific folder in your Azure Blob Storage where all your Unity assets are stored.

Furthermore, you can decide who else in your team or organization can use this storage credential. You can grant access to individual users or groups, allowing them to work collaboratively on the project.

Now, when your Unity project needs to access assets stored in Azure Blob Storage, it presents this storage credential, proving its identity to Azure. With the correct permissions granted to this storage credential, your Unity project can seamlessly access the required assets from Azure Blob Storage, ensuring a secure and efficient workflow.

Create a storage credential using a managed identity

You can use either an Azure managed identity or a service principal as the identity that authorizes access to your storage container. Managed identities are strongly recommended. They have the benefit of allowing Unity Catalog to access storage accounts protected by network rules, which isn’t possible using service principals, and they remove the need to manage and rotate secrets.

  • In the Azure portal, create an Azure Databricks access connector and assign it permissions on the storage container that you would like to access, using the instructions in Configure a managed identity for Unity Catalog (Access connector).

  • Log in to your Unity Catalog-enabled Azure Databricks workspace as a user with the CREATE STORAGE CREDENTIAL privilege.

    This privilege is included in roles like metastore admin and account admin.

  • Click on the Catalog icon within your Azure Databricks workspace..

  • Click the +Add button within the Catalog.

Select "Add a storage credential" from the menu. This option will only appear if you have the CREATE STORAGE CREDENTIAL privilege.

  • Enter a name for the credential, and enter the access connector’s resource ID in the format(Access connector resource ID):

      /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
    
  • (Optional) If you created the access connector using a user-assigned managed identity, enter the resource ID of the managed identity in the User-assigned managed identity ID field, in the format:

      /subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<managed-identity-name>
    

By following these steps in the Databricks UI, you should be able to navigate to the storage credentials section and view the storage credential you created.
Navigate to the "Data" section in the Databricks UI then:
DataCatalogStorage Credentials

Understanding external location

Let's simplify the concept of an external location using an analogy:

Think of an external location as a secure vault where you store valuable items, like treasures in a treasure chest. This vault is locked and can only be accessed by those who hold the key(storage credential).

In our case, the vault represents the external location, and the key represents the storage credential. So, to access the treasures (data) stored within the vault (external location), you need the key (storage credential).

When you create an external location, you're essentially setting up a new vault and assigning it a name. You also attach a specific key (storage credential) to this vault to authorize access.

Now, the person who creates this vault becomes its initial owner. As the owner, you have the authority to change the name of the vault, its location (URI), and even the key (storage credential) attached to it.

Once the vault is set up, you can decide who else gets access to it. You can grant permission to individual users or groups, allowing them to unlock the vault and access its contents.

The beauty of this system is that those who have permission to use the vault can access any treasures (data) stored within it, without needing direct access to the key (storage credential). This ensures security because users don't need to handle sensitive credentials directly.

To add an extra layer of security or control, you can further refine access by using external tables. These act as additional compartments within the vault, allowing you to encapsulate access to specific files or data within the external location.

Lastly, each vault (external location) is given a unique name, and no two vaults can have the same name within the system. Additionally, the system ensures that no vault's storage path overlaps with another vault's storage path or with any external tables' storage paths using explicit storage credentials. This prevents confusion and maintains the integrity of the storage system.

Creating external location

Accessing external cloud storage is easily done usingExternal locations.

This can be done using 3 simple SQL command:

  1. First, create a Storage credential. I'll contain the Service Principal required to access your cloud storage or Managed Identity.

  2. Create an External location using your Storage credential. It can be any cloud location (a sub folder)

  3. Finally, Grant permissions to your users to access this Storage Credential

EXTERNAL LOCATION using the following path:
abfss://deltalake@oneenvadls.dfs.core.windows.net/external_location/

💡
abfss://ContainerName@StorageAccountName.dfs.core.windows.net/external_location/subFolder

Using SQL inside a Notebook

CREATE EXTERNAL LOCATION [IF NOT EXISTS] `<location-name>`
URL '<bucket-path>'
WITH ([STORAGE] CREDENTIAL `<storage-credential-name>`)
[COMMENT '<comment-string>'];

CREATE EXTERNAL LOCATION IF NOT EXISTS `field_demos_external_location`
  URL 'abfss://deltalake@oneenvadls.dfs.core.windows.net/demofieldeng/external_location/' 
  WITH (CREDENTIAL `field_demos_credential`) # storage credential Name
  COMMENT 'External Location for demos' ;

You can create an external location manually using Catalog Explorer.

To create the external location:

  1. Click Catalog to open Catalog Explorer.

  2. Click the + Add button and select Add an external location.

  3. Enter an External location name.

  4. Optionally copy the container path from an existing mount point (Azure Data Lake Storage Gen2 containers only).

  5. If you aren’t copying from an existing mount point, use the URL field to enter the storage container as mention above.

  6. Select the storage credential that grants access to the external location.

  7. Click Create.

GRANT permissions on the external location

GRANT READ FILES, WRITE FILES 
ON EXTERNAL LOCATION `field_demos_external_location` --External location name
TO `account users`; -- users or groups

Accessing the data

#Reading the data from external location using pyspark API:
spark.read.csv('abfss://deltalake@oneenvadls.dfs.core.windows.net/external_location/test_write')\
.display()
  • To know more about Please refer the following Notebook

Why we use Storage credentials and External Locations

Storage credentials

  1. Authentication: Storage credentials authenticate Databricks clusters or notebooks to access Azure storage.

  2. Authorization: They control access levels to data based on roles and permissions.

  3. Secure Transfer: Credentials ensure secure data transfer between Databricks and Azure storage.

  4. Encryption: They enable encryption of data at rest in Azure storage.

  5. Integration: Storage credentials facilitate integration with other Azure services for seamless data workflows.

External Locations

  1. Ownership and Modification: The creator initially owns the external location and can modify its attributes, including name, URI, and storage credential.

  2. Access Control: Access to external locations can be granted to user or group principals, allowing them to access storage paths within the location without direct access to the storage credential.

  3. Access Refinement: Granular access control can be achieved using external tables to encapsulate access to individual files within an external location.

  4. Uniqueness and Containment: External location names must be unique within the system, and storage paths cannot overlap with other external locations or external tables' paths using explicit storage credentials.

Advantages of Storage Credentials:

  1. Security: Storage credentials ensure that only authorized users or applications can access storage resources, enhancing data security.

  2. Authentication Flexibility: They support various authentication methods, such as API keys or OAuth tokens, providing flexibility to developers.

  3. Granular Access Control: Access can be controlled at a granular level, allowing specific permissions to be assigned to different users or groups.

  4. Integration: Storage credentials enable seamless integration with external storage services, facilitating efficient data management workflows.

  5. Scalability: As projects grow, storage credentials facilitate scalable access to storage resources, accommodating increasing data volumes and user demands.

Advantages of External Locations:

  1. Centralized Management: External locations provide a centralized interface for managing storage paths and credentials, simplifying administration and configuration.

  2. Access Control Management: They enable centralized management of access controls, allowing permissions to be assigned or revoked easily across multiple storage paths.

  3. Enhanced Collaboration: External locations facilitate collaboration by providing a unified access point to shared storage resources, enabling teams to work on projects collaboratively.

  4. Security Isolation: External locations isolate storage credentials from direct access, reducing the risk of credential exposure and enhancing overall security posture.

  5. Flexibility: External locations offer flexibility in organizing and accessing storage resources, accommodating diverse storage requirements and usage scenarios.

Conclusion

In conclusion, storage credentials and external locations in Databricks on Azure are crucial components for securely accessing and managing data. By creating storage credentials and external locations, users can authenticate, organize, and control access to storage resources efficiently. External locations provide a centralized interface for accessing data, while storage credentials ensure secure authentication. Granting permissions to external locations enables granular access control, enhancing security and collaboration. However, while these features offer numerous advantages such as enhanced security and centralized management, they may also introduce complexities and potential overhead. Overall, storage credentials and external locations play a vital role in enabling secure, efficient, and collaborative data workflows within Databricks on Azure.