data-engineering-patterns-fabric-databricks

600+ patterns and concepts for Azure Databricks, Microsoft Fabric, and PySpark data engineering - covering lakehouse architecture, Delta Lake, pipelines, and production best practices.

Skill file

Preview skill file
---
name: data-engineering-patterns-fabric-databricks
description: 600+ patterns and concepts for Azure Databricks, Microsoft Fabric, and PySpark data engineering - covering lakehouse architecture, Delta Lake, pipelines, and production best practices.
triggers:
  - show me data engineering patterns for Fabric
  - how do I implement Delta Lake best practices
  - what are the Azure Databricks cluster optimization patterns
  - help me with PySpark transformation patterns
  - show me lakehouse architecture patterns
  - what are the Microsoft Fabric pipeline patterns
  - help with Unity Catalog governance patterns
  - show me production data engineering best practices
---

# Data Engineering Patterns - Fabric & Databricks

> Skill by [ara.so](https://ara.so) — Data Skills collection.

This skill provides access to 600+ field-tested data engineering patterns for Microsoft Fabric, Azure Databricks, and PySpark. These patterns cover everything from pipeline design and Delta Lake optimization to Unity Catalog governance and cost architecture.

## What This Project Provides

A comprehensive collection of patterns organized into 12 books covering:

**Microsoft Fabric (250 patterns):**
- Pipelines and Data Factory
- Lakehouse and PySpark
- Warehouse and SQL
- Power BI in Fabric
- Architecture Patterns

**Azure Databricks (350 patterns):**
- Clusters and Compute
- Delta Lake
- Workflows and Orchestration
- Structured Streaming and Auto Loader
- Unity Catalog
- Databricks SQL and Photon
- Platform and Cost Architecture

**PySpark:**
- 88 concepts for production Spark across both platforms

## Installation

Clone the repository to access all pattern PDFs:

```bash
git clone https://github.com/ssanjaychandra123/data-engineering-patterns.git
cd data-engineering-patterns
```

## Repository Structure

```
data-engineering-patterns/
├── Fabric Patterns/
│   ├── Fabric Engineering Patterns Book I - Pipelines and Data Factory.pdf
│   ├── Fabric Engineering Patterns Book II - Lakehouse and PySpark.pdf
│   ├── Fabric Engineering Patterns Book III - Warehouse and SQL.pdf
│   ├── Fabric Engineering Patterns Book IV - Power BI in Fabric.pdf
│   └── Fabric Engineering Patterns Book V - Architecture Patterns.pdf
├── Databricks Patterns/
│   ├── Azure Databricks Engineering Patterns Book I - Clusters and Compute.pdf
│   ├── Azure Databricks Engineering Patterns Book II - Delta Lake.pdf
│   ├── Azure Databricks Engineering Patterns Book III - Workflows and Orchestration.pdf
│   ├── Azure Databricks Engineering Patterns Book IV - Structured Streaming and Auto Loader.pdf
│   ├── Azure Databricks Engineering Patterns Book V - Unity Catalog.pdf
│   ├── Azure Databricks Engineering Patterns Book VI - Databricks SQL and Photon.pdf
│   └── Azure Databricks Engineering Patterns Book VII - Platform and Cost Architecture.pdf
└── PySpark/
    └── The PySpark Handbook for Fabric and Databricks.pdf
```

## Key Pattern Categories

### Microsoft Fabric Patterns

#### Pipeline and Data Factory Patterns

Common patterns include:
- Incremental data loading strategies
- Pipeline retry and error handling
- Parameter-driven pipeline design
- Activity dependencies and control flow
- Copy activity optimization
- Metadata-driven frameworks

Example incremental load pattern in Fabric Pipeline:

```python
# Notebook activity in Fabric pipeline
from datetime import datetime, timedelta

# Get pipeline parameters
watermark = spark.conf.get("pipeline.watermark")
table_name = spark.conf.get("pipeline.tableName")

# Read incremental data
df = spark.read.format("delta") \
    .load(f"abfss://source@storage.dfs.core.windows.net/{table_name}") \
    .filter(f"modified_date > '{watermark}'")

# Write to target
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save(f"Tables/{table_name}")

# Return new watermark
new_watermark = df.agg({"modified_date": "max"}).collect()[0][0]
mssparkutils.notebook.exit(str(new_watermark))
```

#### Lakehouse and PySpark Patterns

Key patterns for Fabric Lakehouse:

```python
# Pattern: Upsert (merge) operation in Fabric Lakehouse
from delta.tables import DeltaTable

# Source data
updates_df = spark.read.format("parquet").load("Files/updates/")

# Target Delta table
target_table = DeltaTable.forPath(spark, "Tables/customers")

# Merge logic
target_table.alias("target").merge(
    updates_df.alias("updates"),
    "target.customer_id = updates.customer_id"
).whenMatchedUpdate(set={
    "name": "updates.name",
    "email": "updates.email",
    "updated_at": "updates.updated_at"
}).whenNotMatchedInsert(values={
    "customer_id": "updates.customer_id",
    "name": "updates.name",
    "email": "updates.email",
    "created_at": "updates.created_at",
    "updated_at": "updates.updated_at"
}).execute()
```

Pattern: Optimize Delta tables in Fabric:

```python
# Optimize with Z-ordering for common query patterns
spark.sql(f"""
    OPTIMIZE lakehouse.customers
    ZORDER BY (customer_id, signup_date)
""")

# Vacuum old files (default 7 days retention)
spark.sql(f"""
    VACUUM lakehouse.customers RETAIN 168 HOURS
""")
```

#### Warehouse and SQL Patterns

Pattern: Create warehouse tables with proper partitioning:

```sql
-- Create partitioned warehouse table in Fabric
CREATE TABLE dw.fact_sales (
    sale_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    sale_amount DECIMAL(18,2),
    sale_date DATE,
    created_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (sale_date);

-- Insert with partition optimization
INSERT INTO dw.fact_sales
SELECT 
    sale_id,
    customer_id,
    product_id,
    sale_amount,
    CAST(sale_date AS DATE) as sale_date,
    created_at
FROM staging.sales
WHERE sale_date >= CURRENT_DATE - INTERVAL 7 DAYS;
```

### Azure Databricks Patterns

#### Cluster and Compute Patterns

Pattern: Configure autoscaling cluster for cost optimization:

```python
# Databricks cluster configuration (JSON)
{
  "cluster_name": "production-etl",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  },
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
  },
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "spot_bid_price_percent": 100
  }
}
```

#### Delta Lake Patterns

Pattern: Time travel and versioning:

```python
# Read historical version of Delta table
df_version_10 = spark.read.format("delta") \
    .option("versionAsOf", 10) \
    .load("/mnt/delta/customers")

# Read table as of timestamp
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-15 00:00:00") \
    .load("/mnt/delta/customers")

# Describe history
history_df = spark.sql("DESCRIBE HISTORY delta.`/mnt/delta/customers`")
history_df.select("version", "timestamp", "operation", "operationMetrics").show()
```

Pattern: Change Data Feed (CDF) for incremental processing:

```python
# Enable CDF on table
spark.sql("""
    ALTER TABLE delta.customers 
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Read changes between versions
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 10) \
    .option("endingVersion", 20) \
    .table("delta.customers")

# Process different change types
inserts = changes_df.filter("_change_type = 'insert'")
updates = changes_df.filter("_change_type = 'update_postimage'")
deletes = changes_df.filter("_change_type = 'delete'")
```

#### Structured Streaming Patterns

Pattern: Auto Loader with schema evolution:

```python
# Auto Loader with schema inference and evolution
checkpoint_path = "/mnt/checkpoints/raw_files"
target_path = "/mnt/delta/bronze/raw_data"

df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", checkpoint_path + "/schema") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .load("/mnt/landing/raw_files/")

# Write to Delta with checkpointing
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True) \
    .start(target_path)

query.awaitTermination()
```

Pattern: Streaming aggregations with watermarking:

```python
from pyspark.sql.functions import window, col

# Read streaming data
stream_df = spark.readStream.format("delta") \
    .table("events")

# Windowed aggregation with watermark
aggregated = stream_df \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("user_id")
    ) \
    .agg({
        "event_id": "count",
        "amount": "sum"
    })

# Write to Delta table
query = aggregated.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/mnt/checkpoints/aggregations") \
    .toTable("event_aggregations")
```

#### Unity Catalog Patterns

Pattern: Create governed table with row-level security:

```python
# Create schema with Unity Catalog
spark.sql("""
    CREATE SCHEMA IF NOT EXISTS main.finance
    COMMENT 'Finance department data'
    LOCATION 'abfss://data@storage.dfs.core.windows.net/finance'
""")

# Create managed table
spark.sql("""
    CREATE TABLE IF NOT EXISTS main.finance.transactions (
        transaction_id BIGINT,
        account_id BIGINT,
        amount DECIMAL(18,2),
        region STRING,
        transaction_date DATE
    )
    USING DELTA
    TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
""")

# Apply row filter for data access control
spark.sql("""
    CREATE FUNCTION main.finance.region_filter(region STRING)
    RETURN IF(
        IS_MEMBER('data_engineers'), 
        TRUE, 
        region = current_user()
    )
""")

spark.sql("""
    ALTER TABLE main.finance.transactions 
    SET ROW FILTER main.finance.region_filter ON (region)
""")
```

Pattern: Column masking with Unity Catalog:

```python
# Create masking function
spark.sql("""
    CREATE FUNCTION main.finance.mask_ssn(ssn STRING)
    RETURN CASE 
        WHEN IS_MEMBER('finance_managers') THEN ssn
        ELSE CONCAT('XXX-XX-', RIGHT(ssn, 4))
    END
""")

# Apply column mask
spark.sql("""
    ALTER TABLE main.finance.customers 
    ALTER COLUMN ssn 
    SET MASK main.finance.mask_ssn
""")
```

#### Workflows and Orchestration Patterns

Pattern: Create parameterized Databricks job:

```python
# In notebook: Get job parameters
dbutils.widgets.text("date", "")
dbutils.widgets.text("environment", "prod")

processing_date = dbutils.widgets.get("date")
env = dbutils.widgets.get("environment")

# Use parameters in processing
df = spark.read.format("delta") \
    .load(f"/mnt/{env}/data") \
    .filter(f"date = '{processing_date}'")

# Process and write results
result_df = df.groupBy("category").count()
result_df.write.format("delta").mode("overwrite") \
    .save(f"/mnt/{env}/results/{processing_date}")

# Return status for orchestration
dbutils.notebook.exit(f"Processed {result_df.count()} records")
```

Pattern: Job definition with retry logic:

```json
{
  "name": "daily-etl-pipeline",
  "tasks": [
    {
      "task_key": "extract",
      "notebook_task": {
        "notebook_path": "/Workflows/extract",
        "base_parameters": {
          "date": "{{job.start_time.date}}",
          "environment": "prod"
        }
      },
      "existing_cluster_id": "{{cluster_id}}",
      "max_retries": 2,
      "timeout_seconds": 3600
    },
    {
      "task_key": "transform",
      "depends_on": [{"task_key": "extract"}],
      "notebook_task": {
        "notebook_path": "/Workflows/transform",
        "base_parameters": {
          "date": "{{job.start_time.date}}"
        }
      },
      "existing_cluster_id": "{{cluster_id}}",
      "max_retries": 1
    },
    {
      "task_key": "load",
      "depends_on": [{"task_key": "transform"}],
      "notebook_task": {
        "notebook_path": "/Workflows/load"
      },
      "existing_cluster_id": "{{cluster_id}}"
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 2 * * ?",
    "timezone_id": "UTC"
  }
}
```

### PySpark Production Patterns

#### Broadcast Join Pattern

```python
from pyspark.sql.functions import broadcast

# Small dimension table (< 10GB)
dim_products = spark.table("dim.products")

# Large fact table
fact_sales = spark.table("fact.sales")

# Use broadcast join to avoid shuffle
result = fact_sales.join(
    broadcast(dim_products),
    fact_sales.product_id == dim_products.product_id,
    "left"
)
```

#### Partitioning and Bucketing Pattern

```python
# Write with optimal partitioning
df.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .option("maxRecordsPerFile", 1000000) \
    .save("/mnt/delta/partitioned_data")

# Create bucketed table for join optimization
df.write.format("delta") \
    .mode("overwrite") \
    .bucketBy(100, "customer_id") \
    .sortBy("transaction_date") \
    .saveAsTable("bucketed_transactions")
```

#### Error Handling Pattern

```python
from pyspark.sql.functions import col, when, lit
from pyspark.sql.utils import AnalysisException

try:
    # Attempt to read data with schema enforcement
    df = spark.read.format("delta") \
        .option("enforceSchema", "true") \
        .load("/mnt/delta/source")
    
    # Data quality checks
    valid_df = df.filter(col("amount") > 0) \
        .filter(col("customer_id").isNotNull())
    
    invalid_df = df.filter(
        (col("amount") <= 0) | 
        (col("customer_id").isNull())
    ).withColumn("error_reason", 
        when(col("amount") <= 0, lit("Invalid amount"))
        .when(col("customer_id").isNull(), lit("Missing customer_id"))
    )
    
    # Write valid records
    valid_df.write.format("delta").mode("append") \
        .save("/mnt/delta/target")
    
    # Write invalid records to quarantine
    if invalid_df.count() > 0:
        invalid_df.write.format("delta").mode("append") \
            .save("/mnt/delta/quarantine")
        
except AnalysisException as e:
    print(f"Schema mismatch: {str(e)}")
    # Handle schema evolution
    df = spark.read.format("delta") \
        .option("mergeSchema", "true") \
        .load("/mnt/delta/source")
```

#### Performance Optimization Pattern

```python
from pyspark.sql.functions import col, current_timestamp

# Cache frequently accessed data
df_cached = spark.table("dimension.products") \
    .filter(col("is_active") == True) \
    .cache()

# Use persist for complex operations
from pyspark.storagelevel import StorageLevel
df_persisted = large_df.repartition(200, "partition_key") \
    .persist(StorageLevel.MEMORY_AND_DISK)

# Adaptive Query Execution settings
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Dynamic partition pruning
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")
```

## Common Use Cases

### Medallion Architecture Pattern

```python
# Bronze layer: Raw data ingestion
bronze_df = spark.read.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/mnt/landing/") \
    .withColumn("ingestion_time", current_timestamp())

bronze_df.write.format("delta") \
    .mode("append") \
    .save("/mnt/delta/bronze/raw_events")

# Silver layer: Cleaned and conformed
from pyspark.sql.functions import col, to_timestamp

silver_df = spark.read.format("delta") \
    .load("/mnt/delta/bronze/raw_events") \
    .filter(col("event_type").isNotNull()) \
    .withColumn("event_timestamp", to_timestamp("timestamp")) \
    .dropDuplicates(["event_id"]) \
    .select("event_id", "event_type", "user_id", "event_timestamp", "properties")

silver_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/mnt/delta/silver/events")

# Gold layer: Business aggregates
gold_df = spark.read.format("delta") \
    .load("/mnt/delta/silver/events") \
    .groupBy("user_id", "event_type") \
    .agg({
        "event_id": "count",
        "event_timestamp": "max"
    })

gold_df.write.format("delta") \
    .mode("overwrite") \
    .save("/mnt/delta/gold/user_event_summary")
```

### SCD Type 2 Pattern

```python
from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_timestamp, lit

# Source changes
source_df = spark.read.format("parquet").load("/mnt/staging/customers")

# Target dimension
target_table = DeltaTable.forPath(spark, "/mnt/delta/dim_customers")

# Identify changed records
changes = source_df.alias("source") \
    .join(
        target_table.toDF().filter("is_current = true").alias("target"),
        "customer_id",
        "left"
    ) \
    .filter(
        col("target.customer_id").isNull() |  # New records
        (col("source.name") != col("target.name")) |  # Changed records
        (col("source.email") != col("target.email"))
    )

# Expire old records
target_table.alias("target").merge(
    changes.alias("changes"),
    "target.customer_id = changes.customer_id AND target.is_current = true"
).whenMatchedUpdate(set={
    "is_current": lit(False),
    "end_date": current_timestamp()
}).execute()

# Insert new versions
new_records = changes.select(
    col("customer_id"),
    col("name"),
    col("email"),
    current_timestamp().alias("start_date"),
    lit(None).alias("end_date"),
    lit(True).alias("is_current")
)

new_records.write.format("delta").mode("append") \
    .save("/mnt/delta/dim_customers")
```

## Configuration Best Practices

### Fabric Configuration

```python
# Set Fabric notebook session configuration
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Access Fabric environment variables
from notebookutils import mssparkutils

# Get secrets from Key Vault
storage_key = mssparkutils.credentials.getSecret(
    "https://keyvault.vault.azure.net/",
    "storage-account-key"
)

# Access workspace identity
workspace_id = mssparkutils.env.getWorkspaceId()
```

### Databricks Configuration

```python
# Access Databricks secrets
storage_account_key = dbutils.secrets.get(
    scope="azure-key-vault",
    key="storage-account-key"
)

# Mount storage with managed identity
dbutils.fs.mount(
    source=f"abfss://data@{storage_account}.dfs.core.windows.net/",
    mount_point="/mnt/data",
    extra_configs={
        "fs.azure.account.auth.type": "OAuth",
        "fs.azure.account.oauth.provider.type": 
            "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
        "fs.azure.account.oauth2.client.id": dbutils.secrets.get("azure-sp", "client-id"),
        "fs.azure.account.oauth2.client.secret": storage_account_key,
        "fs.azure.account.oauth2.client.endpoint": 
            f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
    }
)

# Optimize cluster for specific workload
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728")  # 128 MB
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
```

## Troubleshooting

### Performance Issues

**Problem:** Slow joins causing job timeouts

```python
# Check partition distribution
df.rdd.getNumPartitions()  # Should be 200-2000 for most workloads

# Identify data skew
df.groupBy("partition_key").count().orderBy(col("count").desc()).show()

# Solution: Repartition with salt for skewed keys
from pyspark.sql.functions import rand, concat

df_balanced = df.withColumn("salt", (rand() * 10).cast("int")) \
    .withColumn("salted_key", concat(col("partition_key"), lit("_"), col("salt"))) \
    .repartition(200, "salted_key")
```

**Problem:** Small file problem in Delta tables

```python
# Check file sizes
spark.sql("DESCRIBE DETAIL delta.`/mnt/delta/table`").select("numFiles", "sizeInBytes").show()

# Solution: Compact small files
spark.sql("OPTIMIZE delta.`/mnt/delta/table`")

# For partitioned tables
spark.sql("OPTIMIZE delta.`/mnt/delta/table` WHERE date >= '2024-01-01'")
```

### Schema Evolution Issues

**Problem:** Schema mismatch errors when appending data

```python
# Enable automatic schema merging
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/mnt/delta/table")

# Or allow schema overwrite
df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/delta/table")

# Check current schema
spark.read.format("delta").load("/mnt/delta/table").printSchema()
```

### Memory Issues

**Problem:** Out of memory errors during processing

```python
# Solution 1: Increase partition count to reduce partition size
df_repartitioned = df.repartition(400)

# Solution 2: Use iterative processing for large aggregations
from pyspark.sql.window import Window

window_spec = Window.partitionBy("category").orderBy("date")
df_windowed = df.withColumn("row_num", row_number().over(window_spec))

# Solution 3: Spill to disk instead of memory
spark.conf.set("spark.memory.fraction", "0.6")
spark.conf.set("spark.memory.storageFraction", "0.3")
```

### Streaming Issues

**Problem:** Checkpoint directory conflicts

```python
# Always use unique checkpoint locations per stream
checkpoint_base = "/mnt/checkpoints"
stream_id = "user_events_stream"

query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", f"{checkpoint_base}/{stream_id}") \
    .start("/mnt/delta/target")

# To restart stream from beginning, delete checkpoint
# dbutils.fs.rm(f"{checkpoint_base}/{stream_id}", True)
```

**Problem:** Watermark not advancing

```python
# Ensure event time column is properly formatted
from pyspark.sql.functions import to_timestamp

df_with_timestamp = df.withColumn(
    "event_time",
    to_timestamp(col("timestamp_string"), "yyyy-MM-dd HH:mm:ss")
)

# Set appropriate watermark delay
stream_df = df_with_timestamp.withWatermark("event_time", "30 minutes")
```

## Cost Optimization Patterns

### Databricks Cost Optimization

```python
# Use cluster pools for faster startup
cluster_config = {
    "instance_pool_id": "pool-abc123",
    "autotermination_minutes": 15,
    "autoscale": {
        "min_workers": 1,
        "max_workers": 10
    }
}

# Use spot instances for non-critical workloads
aws_attributes = {
    "availability": "SPOT_WITH_FALLBACK",
    "zone_id": "us-west-2a",
    "spot_bid_price_percent": 100
}

# Optimize table for reduced storage and faster queries
spark.sql("""
    OPTIMIZE prod.sales_transactions
    ZORDER BY (customer_id, transaction_date)
""")

# Remove old versions to reduce storage costs
spark.sql("VACUUM prod.sales_transactions RETAIN 168 HOURS")
```

### Fabric Cost Optimization

```python
# Use on-demand capacity for variable workloads
# Set idle timeout for capacity auto-pause

# Optimize pipeline runs
# - Use copy activity instead of foreach + copy for bulk operations
# - Batch small files before processing
# - Use incremental loads instead of full refreshes

# Compress data at rest
df.write.format("delta") \
    .option("compression", "zstd") \
    .mode("overwrite") \
    .save("Tables/compressed_data")
```

## Resources

- **Pattern PDFs:** All 12 books are available in the repository under `Fabric Patterns/`, `Databricks Patterns/`, and `PySpark/`
- **Microsoft Fabric Documentation:** https://learn.microsoft.com/fabric/
- **Azure Databricks Documentation:** https://learn.microsoft.com/azure/databricks/
- **Delta Lake Documentation:** https://docs.delta.io/
- **PySpark API Reference:** https://spark.apache.org/docs/latest/api/python/

## Author

Sanjay Chandra - Enterprise data platform architect and advisor
- LinkedIn: https://www.linkedin.com/in/ssanjaychandra/
- Website: http://www.ssanjaychandra.com

---

*These patterns are compiled from real production implementations across Microsoft Fabric and Azure Databricks platforms. The material is continuously updated as platforms evolve.*

Source

Creator's repository · aradotso/data-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk