Understanding Databricks Managed and External Tables: A Comprehensive Guide

Understanding Databricks Managed and External Tables: A Comprehensive Guide

Introduction

In the dynamic landscape of data analytics and processing, Databricks has emerged as a cornerstone platform, empowering organizations to extract valuable insights from vast datasets with unparalleled efficiency. Founded by the creators of Apache Spark™, Databricks offers a unified analytics platform that seamlessly integrates data engineering, data science, and machine learning workflows, revolutionizing the way teams collaborate and innovate with data.

Within the Databricks ecosystem, the management of data tables serves as a fundamental component, shaping how data is stored, accessed, and manipulated across various analytical tasks. When working with Delta Lake tables, users are presented with a choice between two distinct types of tables: managed and unmanaged. In this article, we'll delve into the intricacies of these table types, exploring their unique characteristics, use cases, and implications within the Databricks environment.

Managed Tables: Simplified Data Management Within Databricks

Managed tables represent the backbone of Databricks' table management system. These tables are stored within the platform's managed storage, either at the metastore, catalog, or schema level, depending on how the schema and catalog are configured, providing a seamless and integrated solution for storing and querying structured data.

With managed tables, users enjoy the simplicity of centralized management, where Databricks handles the underlying storage and metadata operations transparently. This approach streamlines workflows, enabling users to focus on analysis rather than infrastructure management.

Managed tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Azure Databricks to manipulate files in these tables directly.

  • Managed tables always use the Delta table format.

  • When a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.

💡
Note: While it's possible to create managed tables without utilizing Unity Catalog, it's strongly recommended to use Unity Catalog for managing tables efficiently within the Databricks environment. This approach ensures streamlined organization, simplified management, and enhanced collaboration across your data assets.

Creating a Managed Table

Requirements

Using SQL:
Unity Catalog provides a three-level namespace for organizing data: catalogs, schemas (also called databases), and tables and views.

<catalog>.<schema>.<table>

CREATE TABLE IF NOT EXISTS  quickstart_catalog.quickstart_schema.quickstart_table
  (columnA Int, columnB String) PARTITIONED BY (columnA);

INSERT INTO TABLE  quickstart_catalog.quickstart_schema.quickstart_table
VALUES
  (1, "one"),
  (2, "two");

create managed table using SQL refer this notebook

create managed table using Python refer this notebook

To drop the Table

DROP TABLE IF EXISTS catalog_name.schema_name.table_name;

External Tables: Decoupling Data Storage for Enhanced Flexibility

In contrast to managed tables, external tables offer a flexible approach to data storage within Databricks, enabling users to decouple data storage from compute resources. External tables serve as pointers to data stored externally, typically in cloud storage services such as AWS S3, Azure Blob Storage, or Google Cloud Storage. This decoupling provides several advantages, including enhanced data independence, cost-effectiveness, and the ability to access data stored in different locations or formats.

  • External tables within Databricks can accommodate a variety of file formats, offering flexibility and compatibility for diverse data needs. These formats include: DELTA, CSV, JSON, AVRO, PARQUET, ORC, and TEXT.

  • When you execute the DROP TABLE command on an external table, the Unity Catalog doesn't remove the actual data stored externally. Instead, it removes only the table structure and metadata.

Create an external table

The data in an external table is stored in a path on your cloud tenant. To work with external tables, Unity Catalog introduces two objects to access and work with external cloud storage:

  • A storage credential contains an authentication method for accessing a cloud storage location.

  • An external location maps a storage credential with a cloud storage path to which it grants access. The external location grants access only to that cloud storage path and its contents.

💡
For further details on storage credentials and external locations, please refer to my article.

To create an external table, you must have:

  • The CREATE EXTERNAL TABLE privilege on an external location that grants access to the LOCATION accessed by the external table.

  • The USE SCHEMA permission on the table’s parent schema.

  • The USE CATALOG permission on the table’s parent catalog.

  • The CREATE TABLE permission on the table’s parent schema.

CREATE TABLE IF NOT EXISTS example_catalog.example_schema.trips_external
LOCATION 'abfss://<cloud_directory_path>'
AS SELECT * from samples.nyctaxi.trips;

Using Python

spark.sql("CREATE TABLE <catalog>.<schema>.<table-name> "
  "("
  "  <column-specification>"
  ") "
  "LOCATION 'abfss://<bucket-path>/<table-directory>'")
💡
create managed table using Python refer this notebook

comparison between managed tables and external tables

AspectManaged TableExternal Table
Storage LocationStored within Databricks environmentPointers to data stored externally (e.g., cloud storage)
Data ManagementFully managed by DatabricksRequires separate management of underlying data storage
Lifecycle ManagementDropping deletes underlying dataDropping does not delete externally stored data
Dropping TableBoth the table metadata and data are deleted from the storage layerOnly the table metadata is deleted, and the data remains intact in the external storage layer
File FormatsAlways uses Delta table formatSupports various file formats (e.g., CSV, JSON, Parquet)
Access ControlManaged within DatabricksMay involve managing permissions externally and internally
Data IndependenceTightly integrated with Databricks managed storageAllows access to data stored in different locations/formats
Cost ConsiderationMay incur additional costs for storage within DatabricksCan be more cost-effective by leveraging external storage

Conclusion

In conclusion, we've delved into the distinctions between managed and external tables within Databricks. Managed tables offer centralized management and reliability, making them ideal for transactional systems, collaborative data analysis, and structured data warehousing. On the other hand, external tables provide flexibility, scalability, and cost efficiency, catering to needs such as data lakes, big data processing, data integration, and cost optimization. Understanding and effectively utilizing table management in Databricks is essential for efficient data processing and analytics. By leveraging the strengths of managed and external tables, organizations can streamline workflows, enhance collaboration, and unlock valuable insights from their data assets.