Table of Contents
Delta Lake: From Parquet to a Reliable Lakehouse
To understand Delta Lake, let’s start from scratch. Imagine a typical data lake: a vast storage system filled with data files. To make querying this data efficient, companies often use Apache Parquet, a columnar file format. However, this approach has inherent problems: if a job writing new data fails, it can leave behind corrupted files. Updating or deleting a single record is extremely difficult, and there are no built-in guarantees of data integrity.
Delta Lake was created to solve these exact problems. It’s an open-source storage layer that adds a layer of reliability and database-like features on top of the data files in your lake.
The Foundation: Understanding Parquet
Parquet is an open-source, columnar storage file format. Instead of storing data in rows (like a CSV), it stores it in columns. This is highly efficient for analytics because a query only needs to read the specific columns it cares about, dramatically speeding up performance.
However, a Parquet file is just that—a file. It has no awareness of transactions, data versions, or consistency. It’s a great format for storing data, but it’s not a reliable system on its own.
The Solution: The Delta Lake Layer
Delta Lake doesn’t replace Parquet; it enhances it. The core idea is simple but powerful: A Delta Table stores its data in Parquet files but adds a transaction log to manage those files and provide reliability.
Physically, a Delta Table is a directory in your cloud storage that contains two things:
- Parquet Files: The actual data, stored efficiently.
- A
_delta_log
Directory: This is the “brain” of the table. It’s a log that records every single transaction (everyINSERT
,UPDATE
,DELETE
,MERGE
) that has ever happened to the table.
This log is the single source of truth. When you query a Delta Table, the query engine first checks the log to determine exactly which Parquet files represent the most current, valid version of the data.
The Flow: Creating a Delta Table
This diagram shows how raw data is transformed into a reliable Delta Table. A job reads the source data and writes it out in the “delta” format, which creates both the Parquet data files and the transaction log.
+-----------------+
| Raw Data Source |
| (JSON, CSV) |
+-----------------+
|
|
+-----------------+
| Spark Job |
| (Read & Write) |
+-----------------+
|
| Writes data in "delta" format
|
+-------------------------------------+
| Delta Table Directory |
| |
| +----------------+ +------------+ |
| | Data Files | | _delta_log | |
| | (Parquet) | | (JSON | |
| | file1.parquet | | commits) | |
| | file2.parquet | | 00000.json | |
| | ... | | 00001.json | |
| +----------------+ +------------+ |
| |
+-------------------------------------+
SQLStorage and Redundancy
Does converting raw data (like CSV/JSON) into a Delta Table create redundant data? Yes, and this is an intentional and highly recommended architectural pattern.
In a well-designed Lakehouse (like one using the Medallion Architecture), you intentionally keep separate copies of the data at different stages of quality:
- Raw Storage (Bronze Layer): The original, untouched raw files (e.g., CSV, JSON) are kept in a “landing” or “bronze” zone. This serves as your permanent, immutable backup. If a bug is ever found in your processing logic, you can always replay your pipelines from this raw source of truth.
- Delta Table (Silver/Gold Layer): The Delta table, containing optimized Parquet files, is your high-performance, analytics-ready version. This is the version that BI tools, analysts, and data scientists will query.
This “redundancy” is a feature, not a bug. It separates the responsibility of immutable, raw data archiving from high-performance, reliable querying.
Feature Comparison: Delta Lake vs. Standard Parquet
The addition of the transaction log provides numerous benefits over using Parquet files alone.
Feature | Standard Parquet Files | Delta Lake |
ACID Transactions | ❌ | ✅ |
Time Travel (Data Versioning) | ❌ | ✅ |
Schema Enforcement | ❌ | ✅ |
MERGE , UPDATE , DELETE Support | ❌ | ✅ |
Data Corruption from Failed Jobs | High Risk | ❌ |
Key Features Explained
DML Operations (MERGE
, UPDATE
, DELETE
): Delta Lake enables standard SQL commands to modify the data in your lake. The MERGE
command is especially powerful for “upsert” operations, which are common when syncing data from transactional systems.
ACID Transactions: Delta Lake guarantees that operations are atomic, consistent, isolated, and durable. This means a job that modifies the table will either complete entirely or fail without leaving any partial, corrupted data behind.
Time Travel: Since the log contains a full history of changes, you can query the data as it appeared at a previous version or time. This is critical for audits, debugging, and reproducing results. For example: SELECT * FROM my_table VERSION AS OF 5
.
Schema Enforcement & Evolution: Delta Lake protects data quality by rejecting writes that don’t match the table’s schema. It also allows for schema evolution, meaning you can add new columns to a table as your data needs change.
Delta Tables
A Delta Table is the standard table format used in the Databricks Lakehouse Platform. It is not a separate type of server or database system; rather, it is a specific structure of files stored in your data lake that combines data files in the Parquet format with a transaction log to add reliability, performance, and governance features.
The Anatomy of a Delta Table
Physically, a Delta Table is simply a directory of files in your cloud storage. This directory contains two key components:
- Data Files (Parquet Files): This is the actual data of the table, stored as one or more Parquet files. It’s important to note that these are not the original raw data files (like CSVs or JSONs). The raw data is the source that is read and then converted into the highly optimized, columnar Parquet format during the ingestion process. These resulting Parquet files are what live inside the Delta Table directory.
- Transaction Log (
_delta_log
): This is the “brain” of a Delta Table. It is a subdirectory that contains an ordered collection of JSON and checkpoint files. This log records every transaction that has ever modified the table, providing a complete history of changes and enabling features like ACID transactions and Time Travel. It is the presence of this log that distinguishes a Delta Table from a standard directory of Parquet files.
Creating and Using Delta Tables
Delta Tables can be created and manipulated using both Spark APIs and standard SQL.
Writing to a Delta Table
You can create a Delta Table by saving a Spark DataFrame in the delta
format.
Python Example:
# Assume 'sales_df' is an existing Spark DataFrame
sales_df.write.format("delta").mode("overwrite").save("/mnt/data/sales")
PythonSQL Example:
CREATE TABLE sales_table AS SELECT * FROM temporary_sales_view;
SQLReading from a Delta Table
Reading data is just as straightforward.
Python Example:
sales_data = spark.read.format("delta").load("/mnt/data/sales")
PythonSQL Example:
SELECT * FROM sales_table;
SQLThe Data Catalog: Organizing Your Tables
For a Delta Table to be queryable by name, its metadata (schema, location, etc.) must be registered in a data catalog. Databricks offers two main catalog options.
- Hive Metastore: The traditional catalog provided by Databricks. Its primary limitation is that it is scoped to a single workspace. Tables registered in one workspace’s Hive Metastore are not visible or accessible from another workspace by default.
- Unity Catalog: The modern, centralized governance solution for Databricks. Unity Catalog exists at the account level and can be attached to multiple workspaces. This allows you to manage data, permissions, auditing, and lineage in one place, providing a unified view of your data across the entire organization.
You can set your active context in SQL using the USE
command:
SQL
-- Set the active catalog to 'main'
USE CATALOG main;
-- Set the active schema (database) within the 'main' catalog
USE SCHEMA production_data;
SQLManaged vs. Unmanaged Tables
When creating a Delta Table in Databricks, it can be either managed or unmanaged. This is a critical distinction for data governance and lifecycle management.
- Managed Tables: This is the default type. For a managed table, Databricks manages both the metadata (the table definition in the metastore) and the underlying data files. When you drop a managed table, Databricks deletes both the metadata and the data files from your cloud storage.
- Unmanaged (External) Tables: For an unmanaged table, Databricks manages the metadata, but you specify the
LOCATION
of the data files in your cloud storage. If you drop an unmanaged table, only the metadata is removed. The data files remain untouched in your storage location, which prevents accidental data loss and allows multiple systems to query the same underlying data.
Feature | Managed Table | Unmanaged Table |
Metadata managed by Databricks | ✅ | ✅ |
Data files managed by Databricks | ✅ | ❌ |
DROP TABLE deletes data files | ✅ | ❌ |
Typical Use Case | Temporary or scratch tables | Primary data lake tables |
Exploring and Inspecting Tables
Databricks provides several UI and command-based tools to explore and understand your Delta Tables.
Catalog Explorer
The Catalog Explorer is a UI-based tool in the Databricks workspace that lets you browse all data assets. You can navigate through catalogs, schemas, and tables, view table schemas and sample data, and manage permissions.
Notebook Cell Behavior
When running a cell in a Databricks notebook, only the output of the very last statement is rendered. If you want to see the output of multiple commands, they must be placed in separate cells.
DESCRIBE DETAIL
Command
This command provides key physical metadata about a Delta Table, including its format (delta), location path in cloud storage, creation time, and the number of files that make up the current version of the table.
DESCRIBE HISTORY
Command
This command allows you to inspect the transaction log of a Delta Table. It shows a history of every operation (write, update, delete, merge), including the user who performed it, the timestamp, the version number created, and the operation parameters.
Creating and Managing Delta Tables
ETL Jobs and Ownership
Delta Tables are typically created and maintained by automated ETL/ELT jobs. These jobs are built by data engineering teams to ingest raw source data (from files or streaming sources), clean and transform it, and load it into a series of Delta Tables (often following a Bronze-Silver-Gold medallion architecture). These jobs are scheduled and run using a tool like Databricks Workflows, ensuring data is kept fresh and reliable.
How Updates Work: Immutability and Redundancy
A foundational concept of the underlying data files is immutability. The Parquet files that store the table’s data are never changed once written. This leads to a key question: how do UPDATE
or DELETE
operations work?
Delta Lake uses a “copy-on-write” mechanism. When a command modifies rows, Delta performs the following steps:
- It identifies the Parquet file(s) containing the rows to be changed.
- It reads that entire file into memory.
- It writes a new Parquet file containing all the unchanged data from the original file plus the modified data.
- It records a transaction in the
_delta_log
that “un-marks” the old file and “marks” the new file as part of the current table version.
The old Parquet file is not immediately deleted from storage; it is simply ignored by any new queries. This creates a temporary form of data redundancy. These old, unreferenced files can be permanently deleted from cloud storage using the VACUUM
command, which is often run periodically to reclaim storage space.
Advanced Delta Lake Concepts
Beyond the basics, Delta Lake provides a powerful set of features for managing, optimizing, and governing your data tables throughout their lifecycle. This chapter covers advanced concepts that are crucial for production data engineering workloads.
Time Travel: Querying Past Data
Delta Lake’s transaction log maintains a complete history of all changes made to a table. This enables Time Travel, which is the ability to query a table as it existed at a specific point in the past. This is not just a backup; it’s a live queryable feature.
TIMESTAMP AS OF
This syntax allows you to query the state of a table at a specific timestamp. Databricks will resolve the latest version of the table that was valid at or before the given timestamp.
Example:
-- Query the state of the table as it was on July 1st, 2025
SELECT * FROM sales_data TIMESTAMP AS OF '2025-07-01';
SQLVERSION AS OF
Every transaction on a Delta table creates a new version number. You can query a specific version directly.
Example:
-- Query version 52 of the table
SELECT * FROM sales_data VERSION AS OF 52;
SQL@
Syntax
A shorthand notation can be used directly in file paths when reading data with Spark. This is useful for comparing two versions of a table.
Example:
# Read version 51 of the table using the @ syntax
df_v51 = spark.read.format("delta").load("/path/to/sales_data@v51")
PythonRESTORE TABLE
: Reverting to a Previous State
While Time Travel is for querying, RESTORE
is used to revert a table to a previous state. This operation creates a new commit in the transaction log that restores the table to the specified version or timestamp. It’s an undo operation for your data.
Example using TIMESTAMP AS OF
:
RESTORE TABLE sales_data TO TIMESTAMP AS OF '2025-07-01';
SQLExample using VERSION AS OF
:
RESTORE TABLE sales_data TO VERSION AS OF 52;
SQLPerformance Optimization
Delta Lake includes powerful commands for optimizing the physical layout of your data to improve query performance.
Compacting Small Files with OPTIMIZE
Data ingestion processes, especially streaming ones, can create many small files. This “small file problem” can significantly slow down query performance. The OPTIMIZE
command addresses this by reading the small files and rewriting them into fewer, larger Parquet files.
Example:
OPTIMIZE sales_data;
SQLThis command can and should be run on a schedule (e.g., daily or hourly) as a regular maintenance job to keep table performance high.
Indexing with ZORDER BY
ZORDER
is a technique that co-locates related data in the same set of files. It’s a form of multi-dimensional indexing. When you Z-order by columns that you frequently use in query filter predicates (i.e., in your WHERE
clauses), Databricks can dramatically reduce the amount of data it needs to scan, leading to significant performance gains. ZORDER
is performed as part of the OPTIMIZE
command.
Example:
If queries often filter by sale_date and product_category, you would run:
OPTIMIZE sales_data ZORDER BY (sale_date, product_category);
SQLTable Maintenance
REFRESH TABLE
This command is primarily used for tables that are defined on top of a cloud storage path. If changes are made to the files in that path by an external process outside of Databricks, the table’s metadata can become stale. REFRESH TABLE
updates the table’s metadata to reflect the current state of the files at its location.
Example:
REFRESH TABLE sales_data;
SQLVACUUM
: Cleaning Up Old Files
When you update or delete data in a Delta table, the old data files are not immediately deleted; they are just marked as no longer part of the current version. The VACUUM
command is used to permanently delete these old, unreferenced files from cloud storage. This is essential for managing storage costs and for compliance.
The command includes a RETAIN
clause to prevent deleting files that are still within a certain time window. This is a safety mechanism to ensure that long-running queries or Time Travel operations do not fail. The default retention period is 7 days.
Example:
-- Delete old files, keeping only those within the last 100 hours
VACUUM sales_data RETAIN 100 HOURS;
SQLLike OPTIMIZE
, VACUUM
can and should be run on a regular schedule. However, it’s critical to set a retention period that is long enough for your operational needs, as deleting files is irreversible.
Relational Entities in Databricks
Databricks organizes data using a three-level namespace: catalog.schema.table
. This structure is fundamental to how data is governed and accessed. The information about these entities—their names, column data types, properties, and physical storage locations—is stored as metadata. This metadata is kept in a backend relational database, such as PostgreSQL, which is managed by Databricks within its control plane.
Catalog
A catalog is the top-most level in the data hierarchy.
- In Unity Catalog (UC), the catalog is the primary unit of data isolation and governance. An organization might create separate catalogs for different environments (e.g.,
dev_catalog
,prod_catalog
). - In the legacy Hive Metastore, there is effectively only one catalog per workspace (
hive_metastore
), so the namespace is often treated as a two-level structure (schema.table
).
Database (Schema)
In Databricks, the terms database and schema are synonymous and can be used interchangeably. A schema is a logical collection of tables, views, and functions that exists within a catalog.
A database can be assigned a default storage path, making it an external database.
SQL
CREATE DATABASE my_project.sales_data
LOCATION 's3://my-company-bucket/data/sales';
SQLTables, Locations, and Inheritance
A table is the entity that contains the actual data. It’s crucial to understand that the data files that constitute a Delta Table are Parquet files. These are created by processing your original raw source data (e.g., CSV, JSON), which typically resides in a separate landing location.
The distinction between managed and unmanaged (external) tables is determined by the LOCATION
keyword and its interaction with the parent database’s location.
Unmanaged (External) Tables
An unmanaged table is created when you explicitly use the LOCATION
keyword in the CREATE TABLE
statement. This tells Databricks to only manage the table’s metadata while the data files reside in the specified path. If you drop an unmanaged table, the Parquet files at its location remain untouched. This is the recommended approach for primary data tables.
Managed Tables and Inheritance
A managed table is created without the LOCATION
keyword. If you create a managed table within a standard database (one with no location), Databricks manages the location for you.
However, if you create a managed table inside an external database, an important inheritance rule applies: the table is still considered “managed,” but its data files will be stored in a subdirectory within the database’s external location. Dropping this table will delete its specific subdirectory of Parquet files, but not other data within the database’s location.
Example Scenario
Create an external database:
CREATE DATABASE external_db LOCATION 's3://my-bucket/external_db_path';
SQLCreate a managed table inside it (inherits location):
CREATE TABLE external_db.managed_table_in_external_db (id INT);
-- This table's data is stored at 's3://my-bucket/external_db_path/managed_table_in_external_db'
-- `DROP TABLE` will delete this subdirectory.
SQLCreate an unmanaged table inside it (overrides location):
CREATE TABLE external_db.unmanaged_table (id INT) LOCATION 's3://my-bucket/another_path/unmanaged_table_data';
-- This table's data is stored at its own unique path.
-- `DROP TABLE` will not delete these files.
SQLAdvanced Table Management in Delta Lake
Differentiating Data, Log, and Metadata
To understand Delta Lake, it’s essential to differentiate between its three core components: data files, the transaction log, and metadata.
- Data Files: These files store the actual data values in rows and columns. They are Parquet files, an open-source columnar format. They are stored in the table’s root directory in your cloud storage, often organized into subdirectories if the table is partitioned.
- Transaction Log (
_delta_log
): This is the authoritative source of truth that defines the current state of a Delta Table. It records every transaction (write, update, delete) as an ordered commit, which enables ACID properties, versioning, and Time Travel. It is a subdirectory named_delta_log
within the table’s root directory. - Metadata: This is the information that makes a table discoverable and queryable by name. It stores the table’s name, schema (column names and data types), partition information, and a pointer to the physical location of the data files. It is stored in a relational database managed within the Databricks control plane, accessible through either the Hive Metastore or Unity Catalog.
Restoring a Dropped Table
Whether a dropped table’s metadata can be restored depends on whether the table was managed or unmanaged.
- If the table was unmanaged (external), then
DROP TABLE
only deletes the metadata. Since the underlying data and transaction log files are untouched, you can restore the table’s metadata by running aCREATE TABLE
command pointing to the sameLOCATION
. - If the table was managed,
DROP TABLE
deletes both the metadata and the underlying data files. In this case, recovery is generally not possible outside of backups or Unity Catalog’s data retention feature.
CREATE TABLE AS SELECT
(CTAS) Statements
A CREATE TABLE AS SELECT
(CTAS) statement is a command that creates a new table and populates it with data from a SELECT
query in a single atomic operation. The schema of the new table is automatically inferred from the schema of the query result.
It is conceptually similar to constructs like CREATE STREAM AS SELECT
in ksqlDB, as it both defines and populates an entity from a query result simultaneously.
A CTAS table is a point-in-time snapshot of the query result. It is not a materialized view. New data arriving in the source table after the CTAS statement is run will not be automatically propagated to the new table. To refresh the data, the table must be manually updated or recreated.
Table Properties and Constraints
COMMENT
The COMMENT
keyword adds a human-readable description to a table, which is visible in the Catalog Explorer and when using DESCRIBE TABLE
.
PARTITIONED BY
in Detail
The PARTITIONED BY
keyword physically organizes the table’s data into subdirectories based on the values of the specified columns. For example, PARTITIONED BY (country)
would create directories like /country=USA/
, /country=CA/
, etc., within the table’s main data directory.
- The Benefit (Data Skipping): The primary advantage is performance improvement through data skipping. When a query includes a filter on a partition column (e.g.,
WHERE country = 'USA'
), the query engine can read the partition information from the metadata and knows to scan only the files within the/country=USA/
directory, completely ignoring all other partitions. This can reduce the amount of data scanned from terabytes to gigabytes. - When to Use It: Partitioning is most effective on columns that have low-to-medium cardinality (a reasonable number of distinct values) and are frequently used in
WHERE
clauses. Good candidates include dates, countries, or regions. - When Not to Use It (The Anti-Pattern): Do not partition by a column with very high cardinality, like
user_id
ortransaction_id
. This will create millions of tiny partitions, each containing very small files. This “small file problem” harms performance because the overhead of discovering and opening many small files outweighs the benefit of data skipping.
Example:
CREATE TABLE sales (
order_id INT,
sale_amount DOUBLE,
sale_date DATE
)
USING DELTA
PARTITIONED BY (sale_date)
COMMENT 'Sales data partitioned by date for efficient time-based queries.';
SQLALTER TABLE
The ALTER TABLE
command is used to change the schema or properties of an existing table.
Examples:
-- Add a new column to the customers table
ALTER TABLE customers ADD COLUMN loyalty_tier STRING;
-- Rename a table
ALTER TABLE customers RENAME TO clients;
-- Drop a column
ALTER TABLE clients DROP COLUMN old_address;
-- Set a table property to enable optimized writes
ALTER TABLE sales SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');
SQLTable Constraints
Delta Lake supports standard SQL constraints to enforce data quality.
- The
CHECK
keyword is used to define a constraint that all rows in the table must satisfy.
When an INSERT
or UPDATE
operation attempts to commit data that violates a CHECK
constraint, the entire transaction fails, and Databricks returns an error. The data is not written. It is the responsibility of the data producer (the job or user running the command) to catch this error and handle the invalid data accordingly.
Example:
ALTER TABLE clients ADD CONSTRAINT valid_email CHECK (email LIKE '%@%.%');
SQLCloning Delta Tables
Cloning creates a copy of a Delta table at a specific version.
DEEP CLONE
A deep clone is a fully independent copy of a table. It copies both the metadata and the data files from the source to a new location.
SHALLOW CLONE
A shallow clone creates a copy of the source table’s metadata and transaction log, but it does not copy the data files. Instead, the new table’s transaction log points to the Parquet files of the original table.
- Why is it needed? It allows you to create an instant, zero-cost copy of a table for development or testing without any risk to the production source table.
- File Location: When you write to a shallow clone, it begins creating its own new Parquet files in its own table directory, not the source directory. This “copy-on-write” behavior ensures that the source table remains completely isolated and untouched. The shallow clone only reads from the source files until those specific files are modified in the clone.
Views, Spark, and Execution Concepts in Databricks
Understanding Views in Databricks
A view in Databricks is a saved query that can be queried like a standard table. It is a virtual table that does not store any data itself; instead, it stores the query logic. When you run a query against a view, the underlying query definition is executed against the actual source tables at that moment.
Types of Views
Databricks supports three types of views, each with a different scope and lifecycle.
Feature | View | Temporary View | Global Temp View |
Stored In | Catalog (Metastore) | SparkSession | SparkSession |
Scope | Catalog/Schema (Permanent) | Notebook/Job (Session) | Cluster (Spark Application) |
CREATE Statement | CREATE VIEW | CREATE TEMPORARY VIEW | CREATE GLOBAL TEMPORARY VIEW |
Standard View (CREATE VIEW
)
A standard view is a permanent database object registered in the metastore. It is visible and queryable by anyone who has the appropriate permissions on the catalog and schema where it is created.
- Scenario: Creating a simplified or secure view of a complex table for business analysts.
- Example:
SQL
CREATE VIEW main.reporting.daily_sales_summary AS
SELECT
sale_date,
SUM(sale_amount) as total_sales
FROM main.sales.transactions
GROUP BY sale_date;
SQLTemporary View (CREATE TEMPORARY VIEW
)
A temporary view exists only for the duration of the current SparkSession. This means it is private to the notebook or job that creates it and is dropped automatically when the session ends.
- Scenario: Breaking down a complex data transformation within a single notebook without cluttering the main data catalog.
- Example:
SQL
CREATE TEMPORARY VIEW filtered_customers AS
SELECT * FROM main.gold.customers WHERE country = 'SE';
SQLGlobal Temporary View (CREATE GLOBAL TEMPORARY VIEW
)
A global temporary view’s scope is the entire Spark application (the cluster). It is accessible across all notebooks attached to the same cluster but must be referenced using the special global_temp
schema.
- Scenario: Sharing an intermediate dataset between two different notebooks running on the same cluster.
- Example:
SQL
-- In Notebook 1:
CREATE GLOBAL TEMPORARY VIEW shared_promo_codes AS
SELECT * FROM main.gold.promotions WHERE is_active = true;
-- In Notebook 2 (on the same cluster):
SELECT * FROM global_temp.shared_promo_codes;
SQLHandling View Creation with Different Permissions
Imagine you have read-only access to a production schema (prod_catalog.sales
) but have write access to your own development schema (dev_catalog.my_dev_space
).
- To create a standard view, you must create it in a location where you have write permissions. You can query data from where you have read access and define the view in your writable schema.SQL
CREATE VIEW dev_catalog.my_dev_space.emea_sales AS SELECT * FROM prod_catalog.sales.transactions WHERE region = 'EMEA';
- Temporary and Global Temporary views do not require write permissions on any catalog or schema because they are not saved in the metastore. They exist only in the temporary memory of the Spark application.
The Role of Apache Spark in Databricks
Apache Spark is the core distributed computation engine that powers the Databricks platform. It is the technology that actually executes your data transformations and queries in a parallel fashion across a cluster. Databricks enhances Spark with the Photon engine, a high-performance C++ based vectorized engine that transparently accelerates Spark operations.
The SparkSession
The SparkSession is the entry point to all Spark functionality. In Databricks, it’s automatically created and available in notebooks as the spark
variable. When you run a command, it is sent to the Spark driver via the SparkSession
, which then plans and distributes the work to the cluster.
- Example:
Python
# Use the 'spark' session to read a table into a DataFrame
df = spark.read.table("main.reporting.daily_sales_summary")
df.show()
PythonAdvanced Features: Cloning Delta Tables
DEEP CLONE
A deep clone is a fully independent copy of a table. It copies both the metadata and the data files from the source to a new location.
SHALLOW CLONE
A shallow clone creates a copy of the source table’s metadata and transaction log, but it does not copy the data files.
- File Location: When you write to a shallow clone, it begins creating its own new Parquet files in its own table directory, not the source directory. This “copy-on-write” behavior ensures that the source table remains completely isolated and untouched. The shallow clone only reads from the source files until those specific files are modified in the clone.
The Execution Environment
Attaching and Detaching Notebooks
Attaching a notebook to a cluster connects it to a running Spark application and its SparkSession
. Detaching disconnects it, but the cluster remains running and can be used by other notebooks or jobs.
Installing Python Packages
There are two primary ways to install Python packages:
- Notebook-scoped (
%pip install
): Recommended for interactive development, this installs the library only for the current notebook session, providing isolation.- Example:
%pip install scikit-learn==1.4.2
- Cluster-level: Installed via the Cluster UI, this makes a library available to all notebooks and jobs running on that cluster. It is suitable for shared, standard libraries.