databricks
diff --git a/Diff for: ‎examples/test_pykx/.terraform.lock.hcl
+39 b/Diff for: ‎examples/test_pykx/.terraform.lock.hcl
+39
diff --git a/Diff for: ‎examples/test_pykx/init.tf
+21 b/Diff for: ‎examples/test_pykx/init.tf
+21
diff --git a/Diff for: ‎examples/test_pykx/main.tf
+13 b/Diff for: ‎examples/test_pykx/main.tf
+13
diff --git a/Diff for: ‎examples/test_pykx/vars.tf
+21 b/Diff for: ‎examples/test_pykx/vars.tf
+21
diff --git a/Diff for: ‎modules/pykx/README.md
+67 b/Diff for: ‎modules/pykx/README.md
+67
diff --git a/Diff for: ‎modules/pykx/main.tf
+139 b/Diff for: ‎modules/pykx/main.tf
+139
diff --git a/Diff for: ‎modules/pykx/read_kdb_save_parquet.py
+83 b/Diff for: ‎modules/pykx/read_kdb_save_parquet.py
+83
@@ -0,0 +1,21 @@
+terraform {
+  backend "s3" {
+    bucket = "blueprints-pykx-rp"
+    key    = "test-pykx.tfstate"
+    region = "us-east-1"
+  }
+  required_providers {
+    databricks = {
+      source  = "databricks/databricks"
+      version = "~>1.23.0"
+    }
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~>4.54.0"
+    }
+  }
+}
+
+provider "aws" {
+  region = "us-east-1"
+}
@@ -0,0 +1,13 @@
+
+// Create Databricks-compliant VPC
+module "pykx" {
+  source                      = "../../modules/pykx/"
+  aws_spoke_databricks_username = var.aws_spoke_databricks_username
+  aws_spoke_databricks_password       = var.aws_spoke_databricks_password
+  aws_spoke_ws_url = var.aws_spoke_ws_url
+  aws_region = var.aws_region
+  }
+
+output "module_workspace_url" {
+  value = module.pykx.job_url
+}
@@ -0,0 +1,21 @@
+
+variable "aws_spoke_ws_url" {}
+
+variable "aws_spoke_databricks_username" {
+}
+variable "aws_spoke_databricks_password" {
+}
+
+variable "aws_region" {}
+
+locals {
+  prefix = "fs-lakehouse"
+}
+
+locals {
+  tags = { "org" = "fsi" }
+}
+
+locals {
+  ext_s3_bucket = "${local.prefix}-ext-bucket"
+}
@@ -0,0 +1,67 @@
+# Databricks PyKX Time Series Processing Automation with Terraform
+
+This repository contains Terraform configurations for automating the process of reading KDB data, converting it to Parquet format, and processing it using Databricks and Pandas UDF functions. The implementation leverages PyKX for efficient KDB interaction and Databricks for scalable data processing.
+
+## Overview
+
+The Terraform scripts in this repository automate the creation and configuration of Databricks notebooks and jobs for handling time-series data. The process involves:
+
+1. **Reading Data from KDB**: The first notebook (`read_kdb_save_parquet.py`) reads data from a KDB database.
+2. **Converting to Parquet**: The data is then converted to Parquet format for efficient processing.
+3. **Time Series Merge and Processing**: The second notebook (`ts_compute_delta_lake.py`) performs time series data merging and processing using Pandas UDFs.
+4. **Databricks Job Configuration**: A Databricks workflow is configured to execute these notebooks in sequence, using a shared Databricks jobs cluster.
+
+## Terraform Resources and Modules
+
+### Data Sources
+
+- `databricks_current_user`: Determines the current user's information for path configurations.
+- `databricks_spark_version`: Fetches the latest Spark version available in Databricks.
+- `databricks_node_type`: Retrieves the smallest node type with a local disk for cost efficiency.
+
+### Notebook Resources
+
+Two Databricks notebooks are created using `databricks_notebook` resources:
+
+1. `read_kdb_save_parquet.py`: Located at `/Users/<your email>/01-Load-PyKX-Delta`.
+2. `ts_compute_delta_lake.py`: Located at `/Users/<your email>/02-Merge-Q-Time-Series`.
+
+These notebooks are populated with the respective Python scripts from the local module path.
+
+### Databricks Job
+
+The `databricks_job` resource, `process_time_series_liquid_cluster`, is configured to run the above notebooks sequentially on a Databricks cluster. Key configurations include:
+
+- **Job Cluster**: Utilizes the latest Spark version, `r6i.xlarge` nodes, and enables elastic disk with SPOT instances for cost optimization.
+- **Tasks**: The job comprises two tasks, `load_data` and `process_data`, each corresponding to a notebook. The `process_data` task depends on the completion of `load_data`.
+- **Libraries**: Each task includes necessary Python libraries such as `numpy`, `pandas`, `pyarrow`, `pykx`, `pytz`, and `toml`.
+
+### Output
+
+- `job_url`: Outputs the URL of the created Databricks job, providing easy access to monitor and manage the job execution.
+
+
+
+## Terraform configuration:
+
+* Clone the Repository: Clone this repository to your local machine.
+
+* Configure Terraform Variables: Set the required variables in a terraform.tfvars file. This file should be located outside the GitHub project for security reasons.
+
+* Initialize Terraform: Run terraform init to initialize the working directory containing Terraform configuration files.
+
+* Apply Configuration: Execute terraform apply -var-file="<path_to_your_tfvars_file>/terraform.tfvars" to create the resources in your Databricks workspace.
+
+Ensure you have Terraform installed and configured with the necessary provider credentials to interact with your Databricks and AWS environments.
+
+### Key Features
+
+* Automated Data Pipeline: Streamlines the process of reading, converting, and processing time-series data from KDB to Databricks.
+* Scalable and Cost-Effective: Utilizes Databricks' scalable infrastructure with cost-effective options like SPOT instances.
+* Sequential Task Execution: Ensures orderly processing by configuring task dependencies within the Databricks job.
+* Library Management: Automates the installation of required Python libraries for data processing.
+
+### Prerequisites
+* Databricks workspace with necessary permissions.
+* Terraform installed and configured.
+* Access to a KDB database and appropriate credentials for data extraction.
@@ -0,0 +1,139 @@
+data "databricks_current_user" "me" {
+}
+
+data "databricks_spark_version" "latest" {}
+data "databricks_node_type" "smallest" {
+  local_disk = true
+}
+
+resource "databricks_notebook" "load" {
+    provider = databricks.spoke_aws_workspace
+  source = "${path.module}/read_kdb_save_parquet.py"
+  path   = "${data.databricks_current_user.me.home}/01-Load-PyKX-Delta"
+}
+
+resource "databricks_notebook" "time_series_merge" {
+    provider = databricks.spoke_aws_workspace
+  source = "${path.module}/ts_compute_delta_lake.py"
+  path   = "${data.databricks_current_user.me.home}/02-Merge-Q-Time-Series"
+}
+
+
+resource "databricks_job" "process_time_series_liquid_cluster" {
+  provider = databricks.spoke_aws_workspace
+  name = "Databricks PyKX Time Series Merge into Delta Lake (${data.databricks_current_user.me.alphanumeric})"
+
+    job_cluster {
+    job_cluster_key = "shared_cap_markets"
+    new_cluster {
+    spark_version       = data.databricks_spark_version.latest.id
+    node_type_id        = "r6i.xlarge"
+    enable_elastic_disk = true
+    num_workers         = 1
+    aws_attributes {
+      availability = "SPOT"
+    }
+    data_security_mode = "SINGLE_USER"
+    custom_tags        = { "clusterSource" = "lakehouse-blueprints" }
+  }
+  }
+
+  task {
+    task_key = "load_data"
+    notebook_task {
+      notebook_path = databricks_notebook.load.path
+      }
+
+    job_cluster_key = "shared_cap_markets"
+
+      library {
+    pypi {
+      package = "numpy~=1.22"
+    }
+  }
+
+  library {
+    pypi {
+      package = "pandas>=1.2"
+    }
+  }
+      library {
+    pypi {
+      package = "pyarrow>=3.0.0"
+    }
+  }
+
+  library {
+    pypi {
+      package = "pykx==2.1.1"
+    }
+  }
+
+      library {
+    pypi {
+      package = "pytz>=2022.1"
+    }
+  }
+
+  library {
+    pypi {
+      package = "toml~=0.10.2"
+    }
+  }
+    }
+
+    task {
+    task_key = "process_data"
+    //this task will only run after task a
+    depends_on {
+      task_key = "load_data"
+    }
+
+    notebook_task {
+      notebook_path = databricks_notebook.time_series_merge.path
+      }
+
+    job_cluster_key = "shared_cap_markets"
+
+      library {
+    pypi {
+      package = "numpy~=1.22"
+    }
+  }
+
+  library {
+    pypi {
+      package = "pandas>=1.2"
+    }
+  }
+      library {
+    pypi {
+      package = "pyarrow>=3.0.0"
+    }
+  }
+
+  library {
+    pypi {
+      package = "pykx==2.1.1"
+    }
+  }
+
+      library {
+    pypi {
+      package = "pytz>=2022.1"
+    }
+  }
+
+  library {
+    pypi {
+      package = "toml~=0.10.2"
+    }
+  }
+  }
+
+  }
+
+
+output "job_url" {
+  value = databricks_job.process_time_series_liquid_cluster.id
+}
@@ -0,0 +1,83 @@
+# Databricks notebook source
+# MAGIC %md
+# MAGIC
+# MAGIC ## Create Delta Lake Objects Governed by Unity Catalog
+# MAGIC
+# MAGIC This particular workflow shows parquet file generation directly from a `q` table in KDB. All tables saved in the cap_markets catalog, which can be inspected for all table definitions, lineage, and access controls.
+
+# COMMAND ----------
+
+# MAGIC %sql create catalog if not exists cap_markets; create schema if not exists q_on_dbx
+
+# COMMAND ----------
+
+# MAGIC %sql use catalog cap_markets; use schema q_on_dbx
+
+# COMMAND ----------
+
+# MAGIC %fs mkdirs /rp_ts
+
+# COMMAND ----------
+
+# MAGIC %md
+# MAGIC
+# MAGIC ### Create KDB Table and Save to Parquet
+
+# COMMAND ----------
+
+import os
+os.environ['QLIC'] = '/dbfs/tmp/license/'
+os.environ['QARGS'] = '-s 4'
+import pandas as pd
+import pykx as kx
+import os
+
+
+# COMMAND ----------
+
+# MAGIC %%q
+# MAGIC
+# MAGIC tbl: ([] sym: `symbol$(); int_f: `int$(); float_f: `float$(); real_f: `real$(); byte_f: `byte$(); char_f: `char$(); timestamp_f: `timestamp$(); month_f: `month$(); date_f: `date$(); time_f: `time$(); minute_f: `minute$(); second_f: `second$(); time_f: `time$())
+# MAGIC
+# MAGIC tbl
+
+# COMMAND ----------
+
+# MAGIC %%q
+# MAGIC
+# MAGIC tbl,: enlist[`a; 1; 1.1; 1e; 0x01; "a"; .z.p; 2023.01m; 2023.01.01;  12:00:00.000; 12:00; 12:00:00; 12:00:00.000]
+# MAGIC tbl,: enlist[`b; 2; 2.2; 2e; 0x02; "b"; .z.p; 2023.02m; 2023.02.02;  13:00:00.000; 13:00; 13:00:00; 13:00:00.000]
+# MAGIC tbl,: enlist[`c; 3; 3.3; 3e; 0x03; "c"; .z.p; 2023.03m; 2023.03.03;  14:00:00.000; 14:00; 14:00:00; 14:00:00.000]
+# MAGIC tbl,: enlist[`d; 4; 4.4; 4e; 0x04; "d"; .z.p; 2023.04m; 2023.04.04;  15:00:00.000; 15:00; 15:00:00; 15:00:00.000]
+# MAGIC tbl,: enlist[`e; 5; 5.5; 5e; 0x05; "e"; .z.p; 2023.05m; 2023.05.05;  16:00:00.000; 16:00; 16:00:00; 16:00:00.000]
+# MAGIC
+
+# COMMAND ----------
+
+# MAGIC %%q
+# MAGIC
+# MAGIC tbl
+
+# COMMAND ----------
+
+import pyarrow as pa
+import pyarrow.parquet as pq
+import pykx as kx
+
+# Fetch data from Kdb+
+kdb_table = kx.q('tbl')  # Replace 'tbl' with your Kdb+ table name
+
+
+arrow_table = kdb_table.pa()
+
+# Step 4: Write the modified table to the final Parquet file
+# In order to avoid downcasting errors, we need to use the int96_timestamp configuration as shown below
+pq.write_table(arrow_table, '/dbfs/rp_ts/output.parquet', use_deprecated_int96_timestamps=True)
+
+# COMMAND ----------
+
+display(spark.read.format("parquet").load("dbfs:/rp_ts/output.parquet"))
+
+# COMMAND ----------
+
+spark.read.format("parquet").load("dbfs:/rp_ts/output.parquet").write.saveAsTable("cap_markets.q_on_dbx.sample_dbx_table")