Table of Contents

Delta Lake: From Parquet to a Reliable Lakehouse

To understand Delta Lake, let’s start from scratch. Imagine a typical data lake: a vast storage system filled with data files. To make querying this data efficient, companies often use Apache Parquet, a columnar file format. However, this approach has inherent problems: if a job writing new data fails, it can leave behind corrupted files. Updating or deleting a single record is extremely difficult, and there are no built-in guarantees of data integrity.

Delta Lake was created to solve these exact problems. It’s an open-source storage layer that adds a layer of reliability and database-like features on top of the data files in your lake.

The Foundation: Understanding Parquet

Parquet is an open-source, columnar storage file format. Instead of storing data in rows (like a CSV), it stores it in columns. This is highly efficient for analytics because a query only needs to read the specific columns it cares about, dramatically speeding up performance.

However, a Parquet file is just that—a file. It has no awareness of transactions, data versions, or consistency. It’s a great format for storing data, but it’s not a reliable system on its own.

The Solution: The Delta Lake Layer

Delta Lake doesn’t replace Parquet; it enhances it. The core idea is simple but powerful: A Delta Table stores its data in Parquet files but adds a transaction log to manage those files and provide reliability.

Physically, a Delta Table is a directory in your cloud storage that contains two things:

  1. Parquet Files: The actual data, stored efficiently.
  2. A _delta_log Directory: This is the “brain” of the table. It’s a log that records every single transaction (every INSERT, UPDATE, DELETE, MERGE) that has ever happened to the table.

This log is the single source of truth. When you query a Delta Table, the query engine first checks the log to determine exactly which Parquet files represent the most current, valid version of the data.

The Flow: Creating a Delta Table

This diagram shows how raw data is transformed into a reliable Delta Table. A job reads the source data and writes it out in the “delta” format, which creates both the Parquet data files and the transaction log.

                  +-----------------+
                  | Raw Data Source |
                  |  (JSON, CSV)    |
                  +-----------------+
                          |
                          |
                  +-----------------+
                  |   Spark Job     |
                  |  (Read & Write) |
                  +-----------------+
                          |
                          |  Writes data in "delta" format
                          |
          +-------------------------------------+
          |       Delta Table Directory         |
          |                                     |
          |  +----------------+  +------------+ |
          |  |  Data Files    |  | _delta_log | |
          |  | (Parquet)      |  | (JSON      | |
          |  | file1.parquet  |  |  commits)  | |
          |  | file2.parquet  |  | 00000.json | |
          |  | ...            |  | 00001.json | |
          |  +----------------+  +------------+ |
          |                                     |
          +-------------------------------------+
SQL

Storage and Redundancy

Does converting raw data (like CSV/JSON) into a Delta Table create redundant data? Yes, and this is an intentional and highly recommended architectural pattern.

In a well-designed Lakehouse (like one using the Medallion Architecture), you intentionally keep separate copies of the data at different stages of quality:

  • Raw Storage (Bronze Layer): The original, untouched raw files (e.g., CSV, JSON) are kept in a “landing” or “bronze” zone. This serves as your permanent, immutable backup. If a bug is ever found in your processing logic, you can always replay your pipelines from this raw source of truth.
  • Delta Table (Silver/Gold Layer): The Delta table, containing optimized Parquet files, is your high-performance, analytics-ready version. This is the version that BI tools, analysts, and data scientists will query.

This “redundancy” is a feature, not a bug. It separates the responsibility of immutable, raw data archiving from high-performance, reliable querying.

Feature Comparison: Delta Lake vs. Standard Parquet

The addition of the transaction log provides numerous benefits over using Parquet files alone.

FeatureStandard Parquet FilesDelta Lake
ACID Transactions
Time Travel (Data Versioning)
Schema Enforcement
MERGE, UPDATE, DELETE Support
Data Corruption from Failed JobsHigh Risk

Key Features Explained

DML Operations (MERGE, UPDATE, DELETE): Delta Lake enables standard SQL commands to modify the data in your lake. The MERGE command is especially powerful for “upsert” operations, which are common when syncing data from transactional systems.

ACID Transactions: Delta Lake guarantees that operations are atomic, consistent, isolated, and durable. This means a job that modifies the table will either complete entirely or fail without leaving any partial, corrupted data behind.

Time Travel: Since the log contains a full history of changes, you can query the data as it appeared at a previous version or time. This is critical for audits, debugging, and reproducing results. For example: SELECT * FROM my_table VERSION AS OF 5.

Schema Enforcement & Evolution: Delta Lake protects data quality by rejecting writes that don’t match the table’s schema. It also allows for schema evolution, meaning you can add new columns to a table as your data needs change.

Delta Tables

A Delta Table is the standard table format used in the Databricks Lakehouse Platform. It is not a separate type of server or database system; rather, it is a specific structure of files stored in your data lake that combines data files in the Parquet format with a transaction log to add reliability, performance, and governance features.

The Anatomy of a Delta Table

Physically, a Delta Table is simply a directory of files in your cloud storage. This directory contains two key components:

  • Data Files (Parquet Files): This is the actual data of the table, stored as one or more Parquet files. It’s important to note that these are not the original raw data files (like CSVs or JSONs). The raw data is the source that is read and then converted into the highly optimized, columnar Parquet format during the ingestion process. These resulting Parquet files are what live inside the Delta Table directory.
  • Transaction Log (_delta_log): This is the “brain” of a Delta Table. It is a subdirectory that contains an ordered collection of JSON and checkpoint files. This log records every transaction that has ever modified the table, providing a complete history of changes and enabling features like ACID transactions and Time Travel. It is the presence of this log that distinguishes a Delta Table from a standard directory of Parquet files.

Creating and Using Delta Tables

Delta Tables can be created and manipulated using both Spark APIs and standard SQL.

Writing to a Delta Table

You can create a Delta Table by saving a Spark DataFrame in the delta format.

Python Example:

# Assume 'sales_df' is an existing Spark DataFrame
sales_df.write.format("delta").mode("overwrite").save("/mnt/data/sales")
Python

SQL Example:

CREATE TABLE sales_table AS SELECT * FROM temporary_sales_view;
SQL

Reading from a Delta Table

Reading data is just as straightforward.

Python Example:

sales_data = spark.read.format("delta").load("/mnt/data/sales")
Python

SQL Example:

SELECT * FROM sales_table;
SQL

The Data Catalog: Organizing Your Tables

For a Delta Table to be queryable by name, its metadata (schema, location, etc.) must be registered in a data catalog. Databricks offers two main catalog options.

  • Hive Metastore: The traditional catalog provided by Databricks. Its primary limitation is that it is scoped to a single workspace. Tables registered in one workspace’s Hive Metastore are not visible or accessible from another workspace by default.
  • Unity Catalog: The modern, centralized governance solution for Databricks. Unity Catalog exists at the account level and can be attached to multiple workspaces. This allows you to manage data, permissions, auditing, and lineage in one place, providing a unified view of your data across the entire organization.

You can set your active context in SQL using the USE command:

SQL

-- Set the active catalog to 'main'
USE CATALOG main;

-- Set the active schema (database) within the 'main' catalog
USE SCHEMA production_data;
SQL

Managed vs. Unmanaged Tables

When creating a Delta Table in Databricks, it can be either managed or unmanaged. This is a critical distinction for data governance and lifecycle management.

  • Managed Tables: This is the default type. For a managed table, Databricks manages both the metadata (the table definition in the metastore) and the underlying data files. When you drop a managed table, Databricks deletes both the metadata and the data files from your cloud storage.
  • Unmanaged (External) Tables: For an unmanaged table, Databricks manages the metadata, but you specify the LOCATION of the data files in your cloud storage. If you drop an unmanaged table, only the metadata is removed. The data files remain untouched in your storage location, which prevents accidental data loss and allows multiple systems to query the same underlying data.
FeatureManaged TableUnmanaged Table
Metadata managed by Databricks
Data files managed by Databricks
DROP TABLE deletes data files
Typical Use CaseTemporary or scratch tablesPrimary data lake tables

Exploring and Inspecting Tables

Databricks provides several UI and command-based tools to explore and understand your Delta Tables.

Catalog Explorer

The Catalog Explorer is a UI-based tool in the Databricks workspace that lets you browse all data assets. You can navigate through catalogs, schemas, and tables, view table schemas and sample data, and manage permissions.

Notebook Cell Behavior

When running a cell in a Databricks notebook, only the output of the very last statement is rendered. If you want to see the output of multiple commands, they must be placed in separate cells.

DESCRIBE DETAIL Command

This command provides key physical metadata about a Delta Table, including its format (delta), location path in cloud storage, creation time, and the number of files that make up the current version of the table.

DESCRIBE HISTORY Command

This command allows you to inspect the transaction log of a Delta Table. It shows a history of every operation (write, update, delete, merge), including the user who performed it, the timestamp, the version number created, and the operation parameters.

Creating and Managing Delta Tables

ETL Jobs and Ownership

Delta Tables are typically created and maintained by automated ETL/ELT jobs. These jobs are built by data engineering teams to ingest raw source data (from files or streaming sources), clean and transform it, and load it into a series of Delta Tables (often following a Bronze-Silver-Gold medallion architecture). These jobs are scheduled and run using a tool like Databricks Workflows, ensuring data is kept fresh and reliable.

How Updates Work: Immutability and Redundancy

A foundational concept of the underlying data files is immutability. The Parquet files that store the table’s data are never changed once written. This leads to a key question: how do UPDATE or DELETE operations work?

Delta Lake uses a “copy-on-write” mechanism. When a command modifies rows, Delta performs the following steps:

  1. It identifies the Parquet file(s) containing the rows to be changed.
  2. It reads that entire file into memory.
  3. It writes a new Parquet file containing all the unchanged data from the original file plus the modified data.
  4. It records a transaction in the _delta_log that “un-marks” the old file and “marks” the new file as part of the current table version.

The old Parquet file is not immediately deleted from storage; it is simply ignored by any new queries. This creates a temporary form of data redundancy. These old, unreferenced files can be permanently deleted from cloud storage using the VACUUM command, which is often run periodically to reclaim storage space.

Advanced Delta Lake Concepts

Beyond the basics, Delta Lake provides a powerful set of features for managing, optimizing, and governing your data tables throughout their lifecycle. This chapter covers advanced concepts that are crucial for production data engineering workloads.

Time Travel: Querying Past Data

Delta Lake’s transaction log maintains a complete history of all changes made to a table. This enables Time Travel, which is the ability to query a table as it existed at a specific point in the past. This is not just a backup; it’s a live queryable feature.

TIMESTAMP AS OF

This syntax allows you to query the state of a table at a specific timestamp. Databricks will resolve the latest version of the table that was valid at or before the given timestamp.

Example:

-- Query the state of the table as it was on July 1st, 2025
SELECT * FROM sales_data TIMESTAMP AS OF '2025-07-01';
SQL

VERSION AS OF

Every transaction on a Delta table creates a new version number. You can query a specific version directly.

Example:

-- Query version 52 of the table
SELECT * FROM sales_data VERSION AS OF 52;
SQL

@ Syntax

A shorthand notation can be used directly in file paths when reading data with Spark. This is useful for comparing two versions of a table.

Example:

# Read version 51 of the table using the @ syntax
df_v51 = spark.read.format("delta").load("/path/to/sales_data@v51")
Python

RESTORE TABLE: Reverting to a Previous State

While Time Travel is for querying, RESTORE is used to revert a table to a previous state. This operation creates a new commit in the transaction log that restores the table to the specified version or timestamp. It’s an undo operation for your data.

Example using TIMESTAMP AS OF:

RESTORE TABLE sales_data TO TIMESTAMP AS OF '2025-07-01';
SQL

Example using VERSION AS OF:

RESTORE TABLE sales_data TO VERSION AS OF 52;
SQL

Performance Optimization

Delta Lake includes powerful commands for optimizing the physical layout of your data to improve query performance.

Compacting Small Files with OPTIMIZE

Data ingestion processes, especially streaming ones, can create many small files. This “small file problem” can significantly slow down query performance. The OPTIMIZE command addresses this by reading the small files and rewriting them into fewer, larger Parquet files.

Example:

OPTIMIZE sales_data;
SQL

This command can and should be run on a schedule (e.g., daily or hourly) as a regular maintenance job to keep table performance high.

Indexing with ZORDER BY

ZORDER is a technique that co-locates related data in the same set of files. It’s a form of multi-dimensional indexing. When you Z-order by columns that you frequently use in query filter predicates (i.e., in your WHERE clauses), Databricks can dramatically reduce the amount of data it needs to scan, leading to significant performance gains. ZORDER is performed as part of the OPTIMIZE command.

Example:

If queries often filter by sale_date and product_category, you would run:

OPTIMIZE sales_data ZORDER BY (sale_date, product_category);
SQL

Table Maintenance

REFRESH TABLE

This command is primarily used for tables that are defined on top of a cloud storage path. If changes are made to the files in that path by an external process outside of Databricks, the table’s metadata can become stale. REFRESH TABLE updates the table’s metadata to reflect the current state of the files at its location.

Example:

REFRESH TABLE sales_data;
SQL

VACUUM: Cleaning Up Old Files

When you update or delete data in a Delta table, the old data files are not immediately deleted; they are just marked as no longer part of the current version. The VACUUM command is used to permanently delete these old, unreferenced files from cloud storage. This is essential for managing storage costs and for compliance.

The command includes a RETAIN clause to prevent deleting files that are still within a certain time window. This is a safety mechanism to ensure that long-running queries or Time Travel operations do not fail. The default retention period is 7 days.

Example:

-- Delete old files, keeping only those within the last 100 hours
VACUUM sales_data RETAIN 100 HOURS;
SQL

Like OPTIMIZE, VACUUM can and should be run on a regular schedule. However, it’s critical to set a retention period that is long enough for your operational needs, as deleting files is irreversible.

Relational Entities in Databricks

Databricks organizes data using a three-level namespace: catalog.schema.table. This structure is fundamental to how data is governed and accessed. The information about these entities—their names, column data types, properties, and physical storage locations—is stored as metadata. This metadata is kept in a backend relational database, such as PostgreSQL, which is managed by Databricks within its control plane.

Catalog

A catalog is the top-most level in the data hierarchy.

  • In Unity Catalog (UC), the catalog is the primary unit of data isolation and governance. An organization might create separate catalogs for different environments (e.g., dev_catalog, prod_catalog).
  • In the legacy Hive Metastore, there is effectively only one catalog per workspace (hive_metastore), so the namespace is often treated as a two-level structure (schema.table).

Database (Schema)

In Databricks, the terms database and schema are synonymous and can be used interchangeably. A schema is a logical collection of tables, views, and functions that exists within a catalog.

A database can be assigned a default storage path, making it an external database.

SQL

CREATE DATABASE my_project.sales_data
LOCATION 's3://my-company-bucket/data/sales';
SQL

Tables, Locations, and Inheritance

A table is the entity that contains the actual data. It’s crucial to understand that the data files that constitute a Delta Table are Parquet files. These are created by processing your original raw source data (e.g., CSV, JSON), which typically resides in a separate landing location.

The distinction between managed and unmanaged (external) tables is determined by the LOCATION keyword and its interaction with the parent database’s location.

Unmanaged (External) Tables

An unmanaged table is created when you explicitly use the LOCATION keyword in the CREATE TABLE statement. This tells Databricks to only manage the table’s metadata while the data files reside in the specified path. If you drop an unmanaged table, the Parquet files at its location remain untouched. This is the recommended approach for primary data tables.

Managed Tables and Inheritance

A managed table is created without the LOCATION keyword. If you create a managed table within a standard database (one with no location), Databricks manages the location for you.

However, if you create a managed table inside an external database, an important inheritance rule applies: the table is still considered “managed,” but its data files will be stored in a subdirectory within the database’s external location. Dropping this table will delete its specific subdirectory of Parquet files, but not other data within the database’s location.

Example Scenario

Create an external database:

CREATE DATABASE external_db LOCATION 's3://my-bucket/external_db_path';
SQL

Create a managed table inside it (inherits location):

CREATE TABLE external_db.managed_table_in_external_db (id INT); 
-- This table's data is stored at 's3://my-bucket/external_db_path/managed_table_in_external_db' 
-- `DROP TABLE` will delete this subdirectory.
SQL

Create an unmanaged table inside it (overrides location):

CREATE TABLE external_db.unmanaged_table (id INT) LOCATION 's3://my-bucket/another_path/unmanaged_table_data'; 
-- This table's data is stored at its own unique path. 
-- `DROP TABLE` will not delete these files.
SQL

Advanced Table Management in Delta Lake

Differentiating Data, Log, and Metadata

To understand Delta Lake, it’s essential to differentiate between its three core components: data files, the transaction log, and metadata.

  • Data Files: These files store the actual data values in rows and columns. They are Parquet files, an open-source columnar format. They are stored in the table’s root directory in your cloud storage, often organized into subdirectories if the table is partitioned.
  • Transaction Log (_delta_log): This is the authoritative source of truth that defines the current state of a Delta Table. It records every transaction (write, update, delete) as an ordered commit, which enables ACID properties, versioning, and Time Travel. It is a subdirectory named _delta_log within the table’s root directory.
  • Metadata: This is the information that makes a table discoverable and queryable by name. It stores the table’s name, schema (column names and data types), partition information, and a pointer to the physical location of the data files. It is stored in a relational database managed within the Databricks control plane, accessible through either the Hive Metastore or Unity Catalog.

Restoring a Dropped Table

Whether a dropped table’s metadata can be restored depends on whether the table was managed or unmanaged.

  • If the table was unmanaged (external), then DROP TABLE only deletes the metadata. Since the underlying data and transaction log files are untouched, you can restore the table’s metadata by running a CREATE TABLE command pointing to the same LOCATION.
  • If the table was managed, DROP TABLE deletes both the metadata and the underlying data files. In this case, recovery is generally not possible outside of backups or Unity Catalog’s data retention feature.

CREATE TABLE AS SELECT (CTAS) Statements

A CREATE TABLE AS SELECT (CTAS) statement is a command that creates a new table and populates it with data from a SELECT query in a single atomic operation. The schema of the new table is automatically inferred from the schema of the query result.

It is conceptually similar to constructs like CREATE STREAM AS SELECT in ksqlDB, as it both defines and populates an entity from a query result simultaneously.

A CTAS table is a point-in-time snapshot of the query result. It is not a materialized view. New data arriving in the source table after the CTAS statement is run will not be automatically propagated to the new table. To refresh the data, the table must be manually updated or recreated.

Table Properties and Constraints

COMMENT

The COMMENT keyword adds a human-readable description to a table, which is visible in the Catalog Explorer and when using DESCRIBE TABLE.

PARTITIONED BY in Detail

The PARTITIONED BY keyword physically organizes the table’s data into subdirectories based on the values of the specified columns. For example, PARTITIONED BY (country) would create directories like /country=USA/, /country=CA/, etc., within the table’s main data directory.

  • The Benefit (Data Skipping): The primary advantage is performance improvement through data skipping. When a query includes a filter on a partition column (e.g., WHERE country = 'USA'), the query engine can read the partition information from the metadata and knows to scan only the files within the /country=USA/ directory, completely ignoring all other partitions. This can reduce the amount of data scanned from terabytes to gigabytes.
  • When to Use It: Partitioning is most effective on columns that have low-to-medium cardinality (a reasonable number of distinct values) and are frequently used in WHERE clauses. Good candidates include dates, countries, or regions.
  • When Not to Use It (The Anti-Pattern): Do not partition by a column with very high cardinality, like user_id or transaction_id. This will create millions of tiny partitions, each containing very small files. This “small file problem” harms performance because the overhead of discovering and opening many small files outweighs the benefit of data skipping.

Example:

CREATE TABLE sales (
  order_id INT,
  sale_amount DOUBLE,
  sale_date DATE
)
USING DELTA
PARTITIONED BY (sale_date)
COMMENT 'Sales data partitioned by date for efficient time-based queries.';
SQL

ALTER TABLE

The ALTER TABLE command is used to change the schema or properties of an existing table.

Examples:

-- Add a new column to the customers table
ALTER TABLE customers ADD COLUMN loyalty_tier STRING;

-- Rename a table
ALTER TABLE customers RENAME TO clients;

-- Drop a column
ALTER TABLE clients DROP COLUMN old_address;

-- Set a table property to enable optimized writes
ALTER TABLE sales SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');
SQL

Table Constraints

Delta Lake supports standard SQL constraints to enforce data quality.

  • The CHECK keyword is used to define a constraint that all rows in the table must satisfy.

When an INSERT or UPDATE operation attempts to commit data that violates a CHECK constraint, the entire transaction fails, and Databricks returns an error. The data is not written. It is the responsibility of the data producer (the job or user running the command) to catch this error and handle the invalid data accordingly.

Example:

ALTER TABLE clients ADD CONSTRAINT valid_email CHECK (email LIKE '%@%.%');
SQL

Cloning Delta Tables

Cloning creates a copy of a Delta table at a specific version.

DEEP CLONE

A deep clone is a fully independent copy of a table. It copies both the metadata and the data files from the source to a new location.

SHALLOW CLONE

A shallow clone creates a copy of the source table’s metadata and transaction log, but it does not copy the data files. Instead, the new table’s transaction log points to the Parquet files of the original table.

  • Why is it needed? It allows you to create an instant, zero-cost copy of a table for development or testing without any risk to the production source table.
  • File Location: When you write to a shallow clone, it begins creating its own new Parquet files in its own table directory, not the source directory. This “copy-on-write” behavior ensures that the source table remains completely isolated and untouched. The shallow clone only reads from the source files until those specific files are modified in the clone.

Views, Spark, and Execution Concepts in Databricks

Understanding Views in Databricks

A view in Databricks is a saved query that can be queried like a standard table. It is a virtual table that does not store any data itself; instead, it stores the query logic. When you run a query against a view, the underlying query definition is executed against the actual source tables at that moment.

Types of Views

Databricks supports three types of views, each with a different scope and lifecycle.

FeatureViewTemporary ViewGlobal Temp View
Stored InCatalog (Metastore)SparkSessionSparkSession
ScopeCatalog/Schema (Permanent)Notebook/Job (Session)Cluster (Spark Application)
CREATE StatementCREATE VIEWCREATE TEMPORARY VIEWCREATE GLOBAL TEMPORARY VIEW

Standard View (CREATE VIEW)

A standard view is a permanent database object registered in the metastore. It is visible and queryable by anyone who has the appropriate permissions on the catalog and schema where it is created.

  • Scenario: Creating a simplified or secure view of a complex table for business analysts.
  • Example:

SQL

CREATE VIEW main.reporting.daily_sales_summary AS
SELECT
  sale_date,
  SUM(sale_amount) as total_sales
FROM main.sales.transactions
GROUP BY sale_date;
SQL

Temporary View (CREATE TEMPORARY VIEW)

A temporary view exists only for the duration of the current SparkSession. This means it is private to the notebook or job that creates it and is dropped automatically when the session ends.

  • Scenario: Breaking down a complex data transformation within a single notebook without cluttering the main data catalog.
  • Example:

SQL

CREATE TEMPORARY VIEW filtered_customers AS
SELECT * FROM main.gold.customers WHERE country = 'SE';
SQL

Global Temporary View (CREATE GLOBAL TEMPORARY VIEW)

A global temporary view’s scope is the entire Spark application (the cluster). It is accessible across all notebooks attached to the same cluster but must be referenced using the special global_temp schema.

  • Scenario: Sharing an intermediate dataset between two different notebooks running on the same cluster.
  • Example:

SQL

-- In Notebook 1:
CREATE GLOBAL TEMPORARY VIEW shared_promo_codes AS
SELECT * FROM main.gold.promotions WHERE is_active = true;

-- In Notebook 2 (on the same cluster):
SELECT * FROM global_temp.shared_promo_codes;
SQL

Handling View Creation with Different Permissions

Imagine you have read-only access to a production schema (prod_catalog.sales) but have write access to your own development schema (dev_catalog.my_dev_space).

  • To create a standard view, you must create it in a location where you have write permissions. You can query data from where you have read access and define the view in your writable schema.SQLCREATE VIEW dev_catalog.my_dev_space.emea_sales AS SELECT * FROM prod_catalog.sales.transactions WHERE region = 'EMEA';
  • Temporary and Global Temporary views do not require write permissions on any catalog or schema because they are not saved in the metastore. They exist only in the temporary memory of the Spark application.

The Role of Apache Spark in Databricks

Apache Spark is the core distributed computation engine that powers the Databricks platform. It is the technology that actually executes your data transformations and queries in a parallel fashion across a cluster. Databricks enhances Spark with the Photon engine, a high-performance C++ based vectorized engine that transparently accelerates Spark operations.

The SparkSession

The SparkSession is the entry point to all Spark functionality. In Databricks, it’s automatically created and available in notebooks as the spark variable. When you run a command, it is sent to the Spark driver via the SparkSession, which then plans and distributes the work to the cluster.

  • Example:

Python

# Use the 'spark' session to read a table into a DataFrame
df = spark.read.table("main.reporting.daily_sales_summary")
df.show()
Python

Advanced Features: Cloning Delta Tables

DEEP CLONE

A deep clone is a fully independent copy of a table. It copies both the metadata and the data files from the source to a new location.

SHALLOW CLONE

A shallow clone creates a copy of the source table’s metadata and transaction log, but it does not copy the data files.

  • File Location: When you write to a shallow clone, it begins creating its own new Parquet files in its own table directory, not the source directory. This “copy-on-write” behavior ensures that the source table remains completely isolated and untouched. The shallow clone only reads from the source files until those specific files are modified in the clone.

The Execution Environment

Attaching and Detaching Notebooks

Attaching a notebook to a cluster connects it to a running Spark application and its SparkSession. Detaching disconnects it, but the cluster remains running and can be used by other notebooks or jobs.

Installing Python Packages

There are two primary ways to install Python packages:

  • Notebook-scoped (%pip install): Recommended for interactive development, this installs the library only for the current notebook session, providing isolation.
    • Example:
    %pip install scikit-learn==1.4.2
  • Cluster-level: Installed via the Cluster UI, this makes a library available to all notebooks and jobs running on that cluster. It is suitable for shared, standard libraries.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.