Introduction
In the dynamic landscape of data analytics and processing, Databricks has emerged as a cornerstone platform, empowering organizations to extract valuable insights from vast datasets with unparalleled efficiency. Founded by the creators of Apache Spark™, Databricks offers a unified analytics platform that seamlessly integrates data engineering, data science, and machine learning workflows, revolutionizing the way teams collaborate and innovate with data.
Within the Databricks ecosystem, the management of data tables serves as a fundamental component, shaping how data is stored, accessed, and manipulated across various analytical tasks. When working with Delta Lake tables, users are presented with a choice between two distinct types of tables: managed and unmanaged. In this article, we'll delve into the intricacies of these table types, exploring their unique characteristics, use cases, and implications within the Databricks environment.
Managed Tables: Simplified Data Management Within Databricks
Managed tables represent the backbone of Databricks' table management system. These tables are stored within the platform's managed storage, either at the metastore, catalog, or schema level, depending on how the schema and catalog are configured, providing a seamless and integrated solution for storing and querying structured data.
With managed tables, users enjoy the simplicity of centralized management, where Databricks handles the underlying storage and metadata operations transparently. This approach streamlines workflows, enabling users to focus on analysis rather than infrastructure management.
Managed tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Azure Databricks to manipulate files in these tables directly.
Managed tables always use the Delta table format.
When a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.
Creating a Managed Table
Requirements
The workspace must be attached to a Unity Catalog metastore. See Set up and manage Unity Catalog.
This notebook must be attached to a cluster that uses Databricks Runtime 11.3LTS or higher in shared or single-user access mode. See Create clusters & SQL warehouses with Unity Catalog access.
Using SQL:
Unity Catalog provides a three-level namespace for organizing data: catalogs, schemas (also called databases), and tables and views.
<catalog>.<schema>.<table>
CREATE TABLE IF NOT EXISTS quickstart_catalog.quickstart_schema.quickstart_table
(columnA Int, columnB String) PARTITIONED BY (columnA);
INSERT INTO TABLE quickstart_catalog.quickstart_schema.quickstart_table
VALUES
(1, "one"),
(2, "two");
create managed table using SQL refer this notebook
create managed table using Python refer this notebook
To drop the Table
DROP TABLE IF EXISTS catalog_name.schema_name.table_name;
External Tables: Decoupling Data Storage for Enhanced Flexibility
In contrast to managed tables, external tables offer a flexible approach to data storage within Databricks, enabling users to decouple data storage from compute resources. External tables serve as pointers to data stored externally, typically in cloud storage services such as AWS S3, Azure Blob Storage, or Google Cloud Storage. This decoupling provides several advantages, including enhanced data independence, cost-effectiveness, and the ability to access data stored in different locations or formats.
External tables within Databricks can accommodate a variety of file formats, offering flexibility and compatibility for diverse data needs. These formats include: DELTA, CSV, JSON, AVRO, PARQUET, ORC, and TEXT.
When you execute the DROP TABLE command on an external table, the Unity Catalog doesn't remove the actual data stored externally. Instead, it removes only the table structure and metadata.
Create an external table
The data in an external table is stored in a path on your cloud tenant. To work with external tables, Unity Catalog introduces two objects to access and work with external cloud storage:
A storage credential contains an authentication method for accessing a cloud storage location.
An external location maps a storage credential with a cloud storage path to which it grants access. The external location grants access only to that cloud storage path and its contents.
To create an external table, you must have:
The
CREATE EXTERNAL TABLE
privilege on an external location that grants access to theLOCATION
accessed by the external table.The
USE SCHEMA
permission on the table’s parent schema.The
USE CATALOG
permission on the table’s parent catalog.The
CREATE TABLE
permission on the table’s parent schema.
CREATE TABLE IF NOT EXISTS example_catalog.example_schema.trips_external
LOCATION 'abfss://<cloud_directory_path>'
AS SELECT * from samples.nyctaxi.trips;
Using Python
spark.sql("CREATE TABLE <catalog>.<schema>.<table-name> "
"("
" <column-specification>"
") "
"LOCATION 'abfss://<bucket-path>/<table-directory>'")
comparison between managed tables and external tables
Aspect | Managed Table | External Table |
Storage Location | Stored within Databricks environment | Pointers to data stored externally (e.g., cloud storage) |
Data Management | Fully managed by Databricks | Requires separate management of underlying data storage |
Lifecycle Management | Dropping deletes underlying data | Dropping does not delete externally stored data |
Dropping Table | Both the table metadata and data are deleted from the storage layer | Only the table metadata is deleted, and the data remains intact in the external storage layer |
File Formats | Always uses Delta table format | Supports various file formats (e.g., CSV, JSON, Parquet) |
Access Control | Managed within Databricks | May involve managing permissions externally and internally |
Data Independence | Tightly integrated with Databricks managed storage | Allows access to data stored in different locations/formats |
Cost Consideration | May incur additional costs for storage within Databricks | Can be more cost-effective by leveraging external storage |
Conclusion
In conclusion, we've delved into the distinctions between managed and external tables within Databricks. Managed tables offer centralized management and reliability, making them ideal for transactional systems, collaborative data analysis, and structured data warehousing. On the other hand, external tables provide flexibility, scalability, and cost efficiency, catering to needs such as data lakes, big data processing, data integration, and cost optimization. Understanding and effectively utilizing table management in Databricks is essential for efficient data processing and analytics. By leveraging the strengths of managed and external tables, organizations can streamline workflows, enhance collaboration, and unlock valuable insights from their data assets.