Overview

Analytics pipelines are used to process structured data in ETL-like pipeline. This should be the way of designing pipelines dealing with structured data.

csv, json, parquet, orc, icebereg, delta, txt are supported formats ;
data can be read from and written to S3, Kafka, PostgreSQL and compatible databases (Clickhouse), ElasticSearch, OpenSearch, NATS, etc.
analytics operations like GROUP BY, AGGREGATE, JOIN are supported through SQL queries proceessing the flow of data coming from datasets.
custom user functions are easy to add, currently in Python, without dealing with the underlying Spark engine or plumbing between tasks.

The execution engine is based on Temporal over Apache Spark and PySpark, and SQL capabilities are provided through the Spark SQL engine.

In this mode, a Pipeline is instanciated as a Kubernetes pod and Spark is configured to use all CPUs available to the pod.

To use the analytics mode, your pipeline specification should contain the following :

runtime: pyspark # (1)!
mode: stream # (8)!
globalConfig:
  image: kast-registry.kast.wip:5000/analyticsruntime:latest # (3)!
  imagePullPolicy: Always
  parallelism: 1 # (9)!
  extraResources: # (2)!
    addr: 192.168.100.237:31318
    topics: s3a://extra
    user: minioadmin
    password: minioadmin
dag:
    - id: read-source # (5)!
      type: fs-source # (4)!
      out: [...] # (6)!
      ...
    - ... # (7)!

The other possible value is duckdb.
The extraResources field is used to specify additional udf that will be called by the Flow. The topics field is a comma-separated list of s3 buckets where the zip files containing the udf are located (see UDF).
Adjust according to your environment.
The list of available KFlow function types is described here.
Choose a unique id for each Task in the DAG.
The out field is a list of ids of the Tasks in the DAG.
Add more Tasks to the DAG as needed.
Stream processing is the default mode. If you want to run the pipelinee in batch mode, you should specify mode: batch in your pipeline.
The parallelism field is used to specify the number of partitions to use for Spark computations. The default value is 1.