Back to Blog
Apache SparkSpark ConnectData EngineeringPythonPySparkKubernetesBig DataRemote ExecutiongRPCDistributed Systems

Spark Connect — Decoupled Spark Client Architecture and Remote Execution

A practical guide to Spark Connect: the client-server architecture that separates PySpark applications from the Spark driver by serializing DataFrame operations as Protobuf logical plans sent over gRPC, eliminating JVM classpath coupling between the client and the cluster, the three-layer architecture of gRPC transport, Catalyst optimizer on the server, and Arrow record batch result streaming back to the Python client, starting a Spark Connect server with the spark-connect plugin package in local and YARN modes with session TTL, message size, and adaptive query execution configuration, Python remote SparkSession via SparkSession.builder.remote connecting to sc://host:15002, full DataFrame API over the wire including reads, transformations, aggregations, SQL, UDF registration, and temp views with no local JVM, session reconnection patterns saving and restoring the session ID across client restarts for driver-isolation resilience, Kubernetes Deployment manifest for a long-running Spark Connect server with ServiceAccount IRSA annotations for S3 access, RBAC Role for executor pod management, executor pod count and memory configuration via Kubernetes master URL, Envoy sidecar proxy configuration for JWT OIDC token validation on port 15003 fronting the gRPC port 15002 with remote JWKS endpoint verification, gRPC call metadata for bearer token propagation from the PySpark client, a side-by-side comparison table of Spark Connect versus embedded direct Spark across crash behavior, Python version coupling, multi-user session overhead, startup latency, UDF execution location, unsupported API surface, and best workload type, known limitations in Spark 3.5 including SparkContext inaccessibility, RDD API unavailability, and partial pandas-on-spark support with DataFrame API workarounds, and a 10-point production checklist covering dedicated driver node sizing, AQE and partition coalescing, session TTL tuning per workload, Arrow batch size tuning, gRPC port authentication, client-server version pinning, executor pod naming, History Server event logging, notebook kernel reconnect handling, and driver JVM heap monitoring.

2026-06-21

What Is Spark Connect and Why It Was Built

Spark Connect, introduced in Apache Spark 3.4 and stabilized in 3.5, is a new client-server architecture for Apache Spark that decouples the PySpark or Scala client application from the Spark driver process running on the cluster. Before Spark Connect, every PySpark script embedded a full Spark driver in the client process: JVM startup, executor registration, task scheduling — all inside the Python application. This created four compounding problems. First, Python package versions on the client had to match the Spark cluster's classpath exactly or risk serialization errors. Second, a crashing client process took the Spark driver — and all its running tasks — down with it. Third, resource quotas for the driver had to be set at cluster level, not per-user. Fourth, running multiple interactive notebooks against the same cluster meant multiple independent drivers, each with its own JVM overhead.

Spark Connect solves all four problems by moving the driver onto the cluster and exposing a language-agnostic gRPC API. The client sends serialized logical plans over the wire; the server translates them into physical plans, executes them on the cluster, and streams results back. The client has no JVM dependency — it is a pure Python (or Scala, or Go) library that speaks Protobuf. This is conceptually similar to how Trino separates query clients from the MPP execution engine — the application speaks SQL or a planning API; the engine handles distributed execution autonomously.

Driver Isolation

The Spark driver runs in the cluster, not the client process. Killing the notebook or script does not abort running jobs — the driver continues executing and results can be re-fetched on reconnect.

No Classpath Coupling

Client Python packages and Spark version are fully independent. Upgrade PySpark on the client without touching the cluster, or run clients pinned to older library versions against a newer server.

Multi-Tenant Sessions

A single Spark Connect server handles dozens of concurrent user sessions, each with isolated session state (configs, temp views, UDFs). No per-user driver JVM overhead.

Architecture — Logical Plans, gRPC, and the Execution Layer

When you call df.filter(col("status") == "active").groupBy("region").count() in a Spark Connect client, nothing executes locally. The client library translates the DataFrame operations into a Protobuf-encoded logical plan — a tree of relation nodes: Read → Filter → Aggregate. This plan is sent over a bidirectional gRPC stream to the Spark Connect server. The server deserializes it, runs it through Catalyst (Spark's query optimizer), generates a physical plan, and executes it against the cluster. Intermediate results stream back over the same gRPC connection as Arrow record batches, which the client materializes into a Pandas DataFrame or a PySpark DataFrame iterator.

The gRPC service has two primary RPC methods: ExecutePlan for executing a logical plan and streaming results, and AnalyzePlan for schema inference and plan validation without execution. Every request carries a session ID, so the server can associate temp views and configuration overrides with the correct user context. The session lives on the server side; the client connection can be dropped and re-established without losing session state.

# Spark Connect architecture overview
#
#  ┌──────────────────────────────────────────────────────────┐
#  │  Client (Python / Scala / Go / R)                       │
#  │                                                          │
#  │  df = spark.read.parquet("s3://bucket/events/")         │
#  │     .filter(col("ts") > "2026-01-01")                   │
#  │     .groupBy("region").agg(count("*"))                   │
#  │                                                          │
#  │  ──serializes to Protobuf logical plan──▶                │
#  └──────────────────────────────────────────────────────────┘
#                         │
#                    gRPC / HTTP2
#                         │
#  ┌──────────────────────▼───────────────────────────────────┐
#  │  Spark Connect Server (runs on cluster, port 15002)      │
#  │                                                          │
#  │  1. Deserialize Protobuf → Unresolved Logical Plan       │
#  │  2. Catalyst analysis: resolve column names & types      │
#  │  3. Catalyst optimization: predicate pushdown, etc.      │
#  │  4. Physical planning: choose join strategies, etc.      │
#  │  5. Execute on executor pool                             │
#  │  6. Stream results as Apache Arrow record batches        │
#  └──────────────────────────────────────────────────────────┘
#                         │
#              ┌──────────┴──────────┐
#              │ Executor  Executor  │  (YARN / K8s / Standalone)
#              └─────────────────────┘
#
# The client process never starts a JVM.
# Session state lives on the server — reconnects are free.

Setting Up the Spark Connect Server

Spark Connect is not a separate binary — it is a plugin to the standard Spark master or standalone driver. You enable it by adding --packages org.apache.spark:spark-connect_2.12:3.5.x and the spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin config. The server listens on port 15002 by default and accepts gRPC connections over HTTP/2. In production you front it with TLS termination and optionally wire in your existing auth middleware.

# Start a Spark Connect server — local mode (development)
export SPARK_HOME=/opt/spark-3.5.3
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk

${SPARK_HOME}/sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:3.5.3 \
  --conf spark.connect.grpc.binding.port=15002 \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.driver.memory=4g \
  --conf spark.executor.memory=8g \
  --conf spark.executor.cores=4 \
  --master local[*]

# Check the server is ready (gRPC health check)
# The server exposes a standard HTTP health endpoint on port 4040 (Spark UI)
curl -s http://localhost:4040/api/v1/applications | python3 -m json.tool
# spark-connect-server.sh — production startup with YARN
# Place in /opt/spark/conf/spark-connect-start.sh

#!/usr/bin/env bash
set -euo pipefail

SPARK_HOME="${SPARK_HOME:-/opt/spark}"
CONNECT_PORT="${SPARK_CONNECT_PORT:-15002}"
SPARK_VERSION="${SPARK_VERSION:-3.5.3}"

exec "${SPARK_HOME}/sbin/start-connect-server.sh" \
  --packages "org.apache.spark:spark-connect_2.12:${SPARK_VERSION}" \
  --master yarn \
  --deploy-mode client \
  --driver-memory 8g \
  --driver-cores 4 \
  --conf "spark.connect.grpc.binding.port=${CONNECT_PORT}" \
  --conf "spark.connect.grpc.maxInboundMessageSize=134217728" \
  --conf "spark.connect.session.ttl=3600" \
  --conf "spark.connect.session.connect.timeout=120" \
  --conf "spark.sql.adaptive.enabled=true" \
  --conf "spark.sql.adaptive.coalescePartitions.enabled=true" \
  --conf "spark.sql.adaptive.skewJoin.enabled=true" \
  --conf "spark.sql.shuffle.partitions=400" \
  --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
  --conf "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider" \
  --conf "spark.eventLog.enabled=true" \
  --conf "spark.eventLog.dir=hdfs:///spark-history" \
  "$@"

Note

Spark Connect sessions are long-lived and stateful. The spark.connect.session.ttl config controls how long an idle session is retained on the server before its state is garbage-collected. Set this generously for interactive notebook users (3600–7200 seconds) but tightly for short-lived pipeline clients (300 seconds) to avoid accumulating abandoned sessions that hold driver memory.

Python Remote Client — DataFrame API Over the Wire

The PySpark Connect client is shipped as part of the standard pyspark package from version 3.4 onward. You connect by calling SparkSession.builder.remote("sc://host:15002").getOrCreate(). The session object behaves identically to a local SparkSession for the vast majority of DataFrame operations: read, sql, createDataFrame, UDF registration, temp view creation — all work through the gRPC protocol with no code changes in your pipeline logic.

# pip install "pyspark[connect]==3.5.3"
# No JVM or JAVA_HOME required on the client machine

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, date_trunc, lit
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

# ── Connect to remote Spark Connect server ──────────────────────────────────
spark = (
    SparkSession.builder
    .remote("sc://spark-connect.data-platform.svc:15002")
    .getOrCreate()
)

print(f"Connected to Spark {spark.version}")
# Connected to Spark 3.5.3

# ── Reading from S3 — plan is sent to the server, not executed locally ───────
events = spark.read.parquet("s3a://datalake/events/2026/")

# Schema inference happens on the server, returned via AnalyzePlan RPC
print(events.schema)

# ── Transformations are accumulated as a logical plan client-side ─────────────
daily_agg = (
    events
    .filter(col("event_type").isin(["purchase", "add_to_cart"]))
    .withColumn("event_day", date_trunc("day", col("ts")))
    .groupBy("event_day", "region", "event_type")
    .agg(
        count("*").alias("event_count"),
        avg("revenue").alias("avg_revenue"),
    )
    .filter(col("event_count") > 10)
    .orderBy("event_day", "region")
)

# Nothing has run yet — daily_agg is still a logical plan

# ── Trigger execution — results stream back as Arrow batches ─────────────────
pdf = daily_agg.limit(1000).toPandas()
print(pdf.head())

# ── Write back to S3 via the server ──────────────────────────────────────────
daily_agg.write.format("parquet").mode("overwrite").save(
    "s3a://datalake/reports/daily-events/"
)

# ── Temporary views for SQL interface ─────────────────────────────────────────
events.createOrReplaceTempView("events")
result = spark.sql("""
    SELECT region, COUNT(*) AS cnt, AVG(revenue) AS avg_rev
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY region
    ORDER BY avg_rev DESC
    LIMIT 20
""")
result.show()

# ── UDF registration — runs on the server JVM, not locally ───────────────────
from pyspark.sql.functions import udf

@udf(returnType=StringType())
def normalize_region(region: str) -> str:
    mapping = {"US": "North America", "CA": "North America", "MX": "North America"}
    return mapping.get(region, region)

spark.udf.register("normalize_region", normalize_region)
events.withColumn("region_group", normalize_region(col("region"))).show(5)
# Reconnection pattern — session state survives client disconnects
# Use the session_id to reconnect to an existing server-side session

import os
from pyspark.sql import SparkSession

SESSION_ID_FILE = "/tmp/spark_connect_session_id"

def get_spark_session(server: str = "sc://spark-connect.data-platform.svc:15002") -> SparkSession:
    builder = SparkSession.builder.remote(server)

    # Reconnect to existing session if we have a saved ID
    if os.path.exists(SESSION_ID_FILE):
        with open(SESSION_ID_FILE) as f:
            session_id = f.read().strip()
        if session_id:
            builder = builder.config("spark.connect.session.id", session_id)

    spark = builder.getOrCreate()

    # Save the session ID for future reconnects
    with open(SESSION_ID_FILE, "w") as f:
        f.write(spark._client._session_id)  # type: ignore[attr-defined]

    return spark

spark = get_spark_session()

# Even if the network dropped between runs, the server-side session
# (temp views, configs) is still alive until the TTL expires.
spark.sql("SHOW TABLES").show()

Spark Connect on Kubernetes — Deployment Patterns

Kubernetes is the natural home for Spark Connect servers. Unlike traditional Spark on Kubernetes where each application spawns its own driver pod, a long-running Spark Connect server pod accepts connections from many clients concurrently. This shifts the operational model from ephemeral-per-job to persistent-shared-service, which is more efficient for interactive and notebook workloads where spinning up a new driver for every session would take 20–60 seconds.

The server pod needs RBAC to create executor pods in the same namespace (or a dedicated executor namespace). Client pods need network access to port 15002 on the server. In practice, you expose the server via a ClusterIP service and optionally an Ingress or Gateway API route for external notebook clients using TLS termination.

# spark-connect-k8s.yaml
# Spark Connect server as a long-running Kubernetes Deployment

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-connect-sa
  namespace: data-platform
  annotations:
    # IRSA for S3 access (EKS) — or use Workload Identity on GKE
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/spark-s3-role
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-connect-executor-role
  namespace: data-platform
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["create", "get", "list", "watch", "delete", "patch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-connect-executor-rb
  namespace: data-platform
subjects:
  - kind: ServiceAccount
    name: spark-connect-sa
    namespace: data-platform
roleRef:
  kind: Role
  name: spark-connect-executor-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-connect-server
  namespace: data-platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-connect-server
  template:
    metadata:
      labels:
        app: spark-connect-server
    spec:
      serviceAccountName: spark-connect-sa
      containers:
        - name: spark-connect
          image: apache/spark:3.5.3-scala2.12-java17-python3-ubuntu
          command: ["/opt/spark/sbin/start-connect-server.sh"]
          args:
            - "--packages"
            - "org.apache.spark:spark-connect_2.12:3.5.3"
            - "--conf"
            - "spark.master=k8s://https://kubernetes.default.svc"
            - "--conf"
            - "spark.kubernetes.namespace=data-platform"
            - "--conf"
            - "spark.kubernetes.container.image=apache/spark:3.5.3-scala2.12-java17-python3-ubuntu"
            - "--conf"
            - "spark.kubernetes.serviceAccountName=spark-connect-sa"
            - "--conf"
            - "spark.connect.grpc.binding.port=15002"
            - "--conf"
            - "spark.connect.session.ttl=3600"
            - "--conf"
            - "spark.sql.adaptive.enabled=true"
            - "--conf"
            - "spark.executor.instances=4"
            - "--conf"
            - "spark.executor.memory=8g"
            - "--conf"
            - "spark.executor.cores=4"
            - "--conf"
            - "spark.driver.memory=6g"
          ports:
            - containerPort: 15002
              name: grpc
            - containerPort: 4040
              name: spark-ui
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "12Gi"
          env:
            - name: SPARK_NO_DAEMONIZE
              value: "true"
          readinessProbe:
            httpGet:
              path: /api/v1/applications
              port: 4040
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /api/v1/applications
              port: 4040
            initialDelaySeconds: 60
            periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: spark-connect
  namespace: data-platform
spec:
  selector:
    app: spark-connect-server
  ports:
    - name: grpc
      port: 15002
      targetPort: 15002
    - name: spark-ui
      port: 4040
      targetPort: 4040
  type: ClusterIP

Note

A single Spark Connect server manages one shared Spark driver. This means all concurrent sessions share the same executor pool. For workloads with very different resource requirements — for example, a large batch reprocessing job and many small interactive queries — consider running separate Connect servers with different executor configurations and routing clients to the appropriate server via service name or a simple load balancer based on workload tags.

Authentication, Authorization, and Multi-Tenant Isolation

Out of the box, Spark Connect has no authentication. For production deployments you have two options: front the gRPC port with an Envoy or NGINX proxy that validates JWT tokens, or use the experimental Spark Connect server-side interceptors (available via the spark.connect.extensions.interceptor.classes config) to inject auth logic. The proxy approach is simpler and integrates with your existing IdP without modifying Spark code.

# Envoy proxy config for JWT validation on Spark Connect gRPC
# envoy-spark-connect.yaml

static_resources:
  listeners:
    - name: spark_connect_tls
      address:
        socket_address: { address: 0.0.0.0, port_value: 15003 }
      filter_chains:
        - transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_certificates:
                  - certificate_chain: { filename: /certs/tls.crt }
                    private_key:       { filename: /certs/tls.key }
          filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: HTTP2
                stat_prefix: spark_connect
                http_filters:
                  # JWT validation against OIDC JWKS endpoint
                  - name: envoy.filters.http.jwt_authn
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.jwt_authn.v3.JwtAuthentication
                      providers:
                        oidc_provider:
                          issuer: https://sso.company.com/realms/data-platform
                          remote_jwks:
                            http_uri:
                              uri: https://sso.company.com/realms/data-platform/protocol/openid-connect/certs
                              cluster: oidc_jwks
                              timeout: 5s
                      rules:
                        - match: { prefix: "/" }
                          requires: { provider_name: oidc_provider }
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                route_config:
                  virtual_hosts:
                    - name: spark_connect_backend
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          route:
                            cluster: spark_connect_upstream
                            timeout: 0s  # streaming — no timeout

  clusters:
    - name: spark_connect_upstream
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      http2_protocol_options: {}
      load_assignment:
        cluster_name: spark_connect_upstream
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: spark-connect.data-platform.svc
                      port_value: 15002
# Client connecting through TLS-terminated proxy with bearer token
# The pyspark connect client passes the token in gRPC metadata

import os
from pyspark.sql import SparkSession

CONNECT_SERVER = os.environ["SPARK_CONNECT_SERVER"]  # sc://spark-connect-proxy.data-platform.svc:15003
OIDC_TOKEN    = os.environ["OIDC_ACCESS_TOKEN"]

spark = (
    SparkSession.builder
    .remote(CONNECT_SERVER)
    # Spark Connect passes these as gRPC call metadata headers
    .config("spark.connect.grpc.arrow.maxBatchSize", "134217728")
    # Custom metadata for the auth interceptor
    .config("spark.connect.user.agentToken", OIDC_TOKEN)
    .getOrCreate()
)

# Session isolation — each user gets an independent namespace for temp views
spark.sql("CREATE OR REPLACE TEMP VIEW my_events AS SELECT * FROM delta.`s3a://datalake/events/`")
print(spark.catalog.listTables())  # Shows only this session's temp views

Spark Connect vs. Direct Spark — When to Choose Each

Spark Connect is not a universal replacement for embedded Spark — it is the right choice for specific access patterns. Understanding the tradeoffs helps you decide where Spark Connect belongs in your architecture and where traditional Spark deployments remain preferable. The key comparison axis is not performance (both use the same Spark execution engine) but operational model, failure semantics, and Python library management.

For large-scale batch pipelines run via Spark on Kubernetes operators, the per-job driver model still offers better isolation — a failed job does not affect other jobs. For interactive analytics, notebook-heavy workloads, and multi-user data platforms, Spark Connect's session-sharing model dramatically reduces cluster overhead. This is complementary to standard Spark performance optimization techniques — adaptive query execution, partition tuning, and broadcast join configuration all apply equally on a Spark Connect server.

DimensionSpark ConnectDirect Spark (embedded driver)
Client process crashServer continues executing; results re-fetchableJob terminates — driver and all running tasks die
Python version couplingClient and server are fully independentClient JVM classpath must match cluster exactly
Multi-user sessionsOne server, many sessions — shared executor poolSeparate driver JVM per user — high overhead
Startup latencySub-second (server already running)20–60 s to start driver JVM and register executors
UDF execution locationServer JVM (serialization cost)In-process — lower overhead for frequent UDFs
Unsupported operationsSome RDD API, accumulators, broadcast vars limitedFull API surface available
Best workload typeInteractive, notebooks, multi-tenant platformsLarge batch jobs, full API needed, strict isolation

Known Limitations and Workarounds

Spark Connect in 3.5 covers the full DataFrame and SQL API but has known gaps in lower-level Spark APIs. Being aware of these prevents runtime surprises in code migrated from classic PySpark.

# ── What WORKS in Spark Connect 3.5 ──────────────────────────────────────────
# Full DataFrame API: read, write, transformations, actions
# SQL via spark.sql()
# Python UDFs (registered via spark.udf.register or @udf decorator)
# Pandas UDFs / vectorized UDFs (Arrow-based)
# createDataFrame() from local Pandas/Arrow
# Streaming DataFrames (readStream / writeStream)
# Iceberg / Delta Lake via catalog connectors
# Temp views and global temp views
# SparkContext.setLogLevel()

# ── What does NOT work in Spark Connect 3.5 ──────────────────────────────────
# SparkContext direct access — sc.parallelize(), sc.textFile(), etc.
try:
    sc = spark.sparkContext  # raises RuntimeError in Spark Connect
except RuntimeError as e:
    print(f"Expected: {e}")
# RuntimeError: SparkContext is not available when using Spark Connect.

# RDD API is unavailable — use DataFrame API equivalents instead
# WRONG (classic PySpark):
#   rdd = sc.parallelize([1, 2, 3])
#   result = rdd.filter(lambda x: x > 1).collect()
#
# RIGHT (Spark Connect compatible):
df = spark.range(1, 4).filter("id > 1")
result = df.collect()

# Accumulators and broadcast variables — not available in Spark Connect 3.5
# Plan your migration to avoid these patterns or keep those jobs on classic Spark

# ── Pandas on Spark (pyspark.pandas) — partially supported ───────────────────
import pyspark.pandas as ps

# Most pandas-on-spark operations work via Spark Connect
psdf = ps.read_parquet("s3a://datalake/events/2026/")
print(psdf.describe())

# Some pandas-on-spark operations that rely on the RDD API will raise NotImplemented
# Check the Spark Connect compatibility matrix before migrating:

Production Checklist

1

Run the Spark Connect server on a dedicated node with reserved driver memory and CPU. Avoid co-locating it with executors — the driver process handles all concurrent session planning and should not compete for memory with task execution.

2

Enable Spark Adaptive Query Execution (spark.sql.adaptive.enabled=true) and partition coalescing on the server. These optimizations apply globally to all sessions and significantly reduce shuffle overhead for interactive queries.

3

Set spark.connect.session.ttl appropriately per use case. Long-lived interactive sessions should survive notebook disconnects (3600–7200 s); pipeline sessions should expire quickly (300 s) to free server resources.

4

Use the Arrow record batch stream for large result sets. Ensure spark.connect.grpc.arrow.maxBatchSize is tuned to match your network MTU and expected query result sizes — the default 4 MB per batch is conservative.

5

Do not expose port 15002 without authentication. Front with Envoy JWT validation or a service mesh mTLS policy. Anyone with gRPC access to the port can execute arbitrary Spark SQL on your cluster.

6

Pin pyspark client and server versions together in CI. Spark Connect uses Protobuf-versioned APIs — a client running 3.5.1 against a 3.5.3 server may encounter minor compatibility differences in newer plan nodes.

7

Use spark.kubernetes.executor.podNamePrefix to make executor pod names identifiable per application (e.g., sc-analytics-exec). This makes kubectl log tailing and executor crash investigation much faster.

8

Enable Spark History Server alongside the Connect server (spark.eventLog.enabled=true). Connect sessions generate the same event log as classic Spark jobs — you get full stage-level performance visibility in the Spark UI.

9

For notebook workloads, configure spark.connect.session.connect.timeout (120 s) and retry logic in your notebook kernel. Jupyter kernels that idle past the connect timeout will get a gRPC UNAVAILABLE on the next cell execution — a clean reconnect restores the session.

10

Monitor active session count and driver JVM heap via Spark metrics (spark.metrics.conf). Alert on driver heap above 80% — this is the primary constraint for serving many concurrent Spark Connect sessions from a single server.

Managing a Spark cluster where every notebook user starts a separate driver JVM, fighting Python dependency conflicts between client machines and the cluster classpath, or losing running jobs when a client process crashes?

We design and deploy Spark Connect platforms — from Spark Connect server configuration on YARN and Kubernetes with IRSA-backed S3 access and executor pod RBAC, through Envoy JWT proxy setup for OIDC-authenticated gRPC connections, PySpark remote client integration with session reconnect logic for driver-isolated pipelines, Arrow batch size and session TTL tuning for interactive notebook workloads, Kubernetes Deployment design with dedicated driver node affinity and executor pod naming, Spark History Server integration for full session observability, and migration assessment of existing PySpark codebases for Spark Connect API compatibility. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.