Overview
Analytics pipelines are used to process structured data in ETL-like pipeline. This should be the way of designing pipelines dealing with structured data.
- csv, json, parquet, orc, icebereg, delta, txt are supported formats ;
- data can be read from and written to S3, Kafka, PostgreSQL and compatible databases (Clickhouse), ElasticSearch, OpenSearch, NATS, etc.
- analytics operations like GROUP BY, AGGREGATE, JOIN are supported through SQL queries proceessing the flow of data coming from datasets.
- custom user functions are easy to add, currently in Python, without dealing with the underlying Spark engine or plumbing between tasks.
The execution engine is based on Temporal over Apache Spark and PySpark, and SQL capabilities are provided through the Spark SQL engine.
In this mode, a Pipeline is instanciated as a Kubernetes pod and Spark is configured to use all CPUs available to the pod.
To use the analytics mode, your pipeline specification should contain the following :
runtime: pyspark # (1)!
mode: stream # (8)!
globalConfig:
image: kast-registry.kast.wip:5000/analyticsruntime:latest # (3)!
imagePullPolicy: Always
parallelism: 1 # (9)!
extraResources: # (2)!
addr: 192.168.100.237:31318
topics: s3a://extra
user: minioadmin
password: minioadmin
dag:
- id: read-source # (5)!
type: fs-source # (4)!
out: [...] # (6)!
...
- ... # (7)!
- The other possible value is
duckdb. - The
extraResourcesfield is used to specify additional udf that will be called by the Flow. Thetopicsfield is a comma-separated list of s3 buckets where the zip files containing the udf are located (see UDF). - Adjust according to your environment.
- The list of available KFlow function types is described here.
- Choose a unique id for each Task in the DAG.
- The
outfield is a list of ids of the Tasks in the DAG. - Add more Tasks to the DAG as needed.
- Stream processing is the default mode. If you want to run the pipelinee in batch mode, you should specify
mode: batchin your pipeline. - The
parallelismfield is used to specify the number of partitions to use for Spark computations. The default value is 1.