Back to Blog
PandasPythonData EngineeringPyArrowPerformanceCopy-on-WriteMemory OptimizationData ScienceNumPyAnalytics

Pandas 2.x Performance Guide — Copy-on-Write, PyArrow Backends, and Memory Efficiency

A practical guide to pandas 2.x performance: Copy-on-Write semantics that replace the SettingWithCopyWarning era with predictable mutation behavior where every derived DataFrame is always independent, the three activation paths for the PyArrow dtype backend — dtype_backend='pyarrow' at read time, convert_dtypes(dtype_backend='pyarrow') on existing DataFrames, and explicit pd.ArrowDtype column construction with pa.string(), pa.int64(), pa.float32(), pa.dictionary(), and pa.timestamp() types, memory comparison showing PyArrow dictionary-encoded string columns using 14x less RAM than NumPy object arrays for low-cardinality categorical data, read_parquet with PyArrow engine and dtype_backend='pyarrow' for zero-copy Arrow deserialization without intermediate NumPy conversion, column projection via the columns parameter to avoid decompressing unused Parquet columns, predicate pushdown with the filters parameter using row group statistics to skip non-matching data before decompression, nullable integer types including Int8 through Int64 and ArrowDtype(pa.int64()) that stay integer even when containing None without silent float upcast, convert_dtypes() for automatic nullable type inference on legacy DataFrames, a memory optimization utility function downcasting int64 columns to the smallest fitting integer type and converting low-cardinality object columns to categorical, ordered pd.Categorical with explicit category lists for status columns enabling comparison operators, chunked CSV processing with chunksize and the C engine for streaming files larger than RAM with per-chunk filtering and running aggregation accumulation, DataFrame.eval() and query() with numexpr for multi-column arithmetic that eliminates intermediate array allocations with 2-4x speedup, PyArrow string operations executing in C++ rather than Python per-element loops for 5-15x faster str.lower() str.contains() str.extract() and str.split() on million-row string columns, DuckDB integration querying PyArrow-backed DataFrames via registered views with zero-copy access and returning results via .df() or fetch_arrow_table(), and a benchmarking pattern using tracemalloc and time.perf_counter to measure real memory and latency gains before and after dtype optimization.

2026-06-25

What Changed in Pandas 2.x

Pandas 2.0 was a landmark release that restructured the library's memory model after nearly a decade of NumPy-only internals. The three changes that matter most for production data engineering work are: Copy-on-Write semantics that eliminate a whole class of silent mutation bugs, the PyArrow backend that replaces NumPy arrays for string and nullable numeric storage, and first-class support for Arrow-native I/O that makes Parquet reads dramatically faster. Pandas 2.1 and 2.2 built on those foundations, making CoW opt-out instead of opt-in and stabilizing the Arrow dtype APIs. Polars uses Arrow columnar memory natively from the ground up — pandas 2.x narrows that gap significantly by giving you Arrow-backed columns as a first-class option within the familiar pandas API surface.

Copy-on-Write

Views no longer silently modify the parent DataFrame. Every operation that modifies data returns a new object backed by independent memory, making mutation behavior predictable and testable.

PyArrow Backend

Columns backed by PyArrow arrays instead of NumPy. Strings store as dictionary-encoded Arrow arrays — 3–8× less memory than Python object arrays. Native null support without sentinel NaN values.

Arrow I/O

read_parquet with PyArrow engine returns Arrow-backed columns with zero-copy deserialization. No intermediate conversion through NumPy — data lands directly in Arrow memory layout.

Installation and Environment Setup

Pandas 2.x requires Python 3.9 or later. The PyArrow backend requires pyarrow>=10.0 installed in the same environment. Installing them together via pip resolves the correct compatible versions automatically.

# Install pandas 2.x with PyArrow and optional performance extras
pip install "pandas>=2.2" pyarrow

# Verify versions
python -c "import pandas as pd; import pyarrow as pa; print(pd.__version__, pa.__version__)"
# 2.2.3  17.0.0

# Optional: numexpr for accelerated eval() and query()
pip install numexpr bottleneck

# Full data engineering stack
pip install "pandas>=2.2" pyarrow pyarrow-hotfix numexpr bottleneck fastparquet
import pandas as pd
import pyarrow as pa
import numpy as np

# Check Copy-on-Write status — enabled by default from pandas 2.0+
pd.options.mode.copy_on_write  # True in pandas 2.2

# Explicitly enable for pandas 2.0/2.1 if you pinned an older version
pd.options.mode.copy_on_write = True

# Confirm default dtype inference settings
pd.options.future.infer_string  # True — uses StringDtype by default in 3.x

Note

Copy-on-Write is enabled by default in pandas 2.2 and will be the only behavior in pandas 3.0. If you see SettingWithCopyWarning in pandas 1.x code, migrating to CoW semantics is the fix — not suppressing the warning. CoW guarantees that df2 = df[mask] always returns an independent object with no shared state.

Copy-on-Write — What It Means in Practice

Before CoW, pandas chained indexing like df[mask]['col'] = value might or might not modify the original DataFrame depending on whether the first indexing operation returned a view or a copy — a detail that depended on the dtype, the shape of the selection, and whether pandas internally optimized the access. CoW makes this consistent: any in-place modification on a derived object writes to a new copy, never the parent.

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "score": [95, 87, 92, 78],
    "grade": ["A", "B", "A", "C"],
})

# Old behavior (pre-CoW): this might modify df, might not
# df[df.grade == "A"]["score"] = 100  # SettingWithCopyWarning — unreliable

# CoW behavior: slice is always independent
df_a_students = df[df.grade == "A"].copy()  # explicit copy for mutation
df_a_students["score"] = 100
print(df["score"].tolist())  # [95, 87, 92, 78] — unchanged

# Preferred CoW pattern: use .loc for in-place modification on the original
df.loc[df.grade == "A", "score"] = 100
print(df["score"].tolist())  # [100, 87, 100, 78] — correct

# assign() always returns a new DataFrame — idiomatic CoW pattern
df_modified = df.assign(score_normalized=lambda x: x.score / x.score.max())
print(df.columns.tolist())   # original unchanged
print(df_modified.columns.tolist())  # includes score_normalized
# CoW and method chaining — the idiomatic pandas 2.x style
import pandas as pd

result = (
    pd.read_csv("sales.csv")
    .rename(columns={"sale_amt": "revenue"})
    .query("revenue > 0")
    .assign(
        revenue_k=lambda x: x.revenue / 1000,
        month=lambda x: pd.to_datetime(x.date).dt.to_period("M"),
    )
    .groupby("month", as_index=False)
    .agg(
        total_revenue=("revenue_k", "sum"),
        order_count=("order_id", "nunique"),
    )
    .sort_values("month")
)

# Each method returns a new DataFrame — CoW ensures no implicit shared memory
# No copy overhead unless data is actually written — CoW defers copies lazily

PyArrow Backend — dtype_backend and ArrowDtype

The PyArrow backend replaces NumPy arrays with Apache Arrow arrays as the underlying column storage. The practical impact is most visible for string columns: NumPy stores Python strings as object arrays — each element is a Python heap object with 56 bytes of overhead per string, plus the string data itself. PyArrow stores strings as dictionary-encoded or contiguous byte arrays with 4–8 bytes of overhead per entry. DuckDB queries Parquet files directly in Arrow columnar format — pandas PyArrow-backed DataFrames can receive query results from DuckDB without any format conversion using the fetchdf() method.

import pandas as pd
import pyarrow as pa
import sys

# Demonstrate memory difference: NumPy object vs PyArrow string
n = 1_000_000
categories = ["orders", "returns", "exchanges", "pending", "fulfilled"]

# NumPy object array (default pre-pandas 2.0)
df_numpy = pd.DataFrame({
    "status": pd.array(
        [categories[i % 5] for i in range(n)],
        dtype=object,
    )
})

# PyArrow string array (dictionary-encoded)
df_arrow = pd.DataFrame({
    "status": pd.array(
        [categories[i % 5] for i in range(n)],
        dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string())),
    )
})

print(f"NumPy object:  {df_numpy.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"PyArrow dict:  {df_arrow.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# NumPy object:  56.0 MB
# PyArrow dict:   4.1 MB  — ~14x less memory
import pandas as pd
import pyarrow as pa

# Three ways to get PyArrow-backed DataFrames

# 1. dtype_backend parameter at read time — recommended for new pipelines
df = pd.read_parquet("events.parquet", dtype_backend="pyarrow")
df = pd.read_csv("events.csv",         dtype_backend="pyarrow")
df = pd.read_json("events.json",       dtype_backend="pyarrow")

# 2. convert_dtypes() on an existing DataFrame
df_numpy = pd.read_csv("events.csv")
df_arrow = df_numpy.convert_dtypes(dtype_backend="pyarrow")

# 3. Explicit ArrowDtype column-by-column
df_explicit = pd.DataFrame({
    "id":        pd.array([1, 2, 3], dtype=pd.ArrowDtype(pa.int64())),
    "name":      pd.array(["Alice", "Bob", None], dtype=pd.ArrowDtype(pa.string())),
    "score":     pd.array([9.5, 8.2, None], dtype=pd.ArrowDtype(pa.float64())),
    "active":    pd.array([True, False, None], dtype=pd.ArrowDtype(pa.bool_())),
    "timestamp": pd.array(
        ["2026-01-01", "2026-01-02", None],
        dtype=pd.ArrowDtype(pa.timestamp("us", tz="UTC")),
    ),
})

print(df_explicit.dtypes)
# id            int64[pyarrow]
# name         string[pyarrow]
# score        double[pyarrow]
# active         bool[pyarrow]
# timestamp    timestamp[us, tz=UTC][pyarrow]

Note

PyArrow-backed columns handle None / pd.NA natively without coercing integers to float. NumPy's int64 cannot represent missing values, so pandas 1.x silently upcast nullable integer columns to float64. With pd.ArrowDtype(pa.int64()), an integer column stays integer even when it contains None — no silent precision loss.

Parquet I/O — Zero-Copy Reads and Predicate Pushdown

Parquet is the standard storage format for pandas workloads in data lakes. Pandas 2.x with PyArrow engine reads Parquet files into Arrow-backed DataFrames without any intermediate NumPy conversion. The filters parameter pushes predicates into the Parquet reader, skipping row groups that don't match before they're even decompressed. When data exceeds single-machine memory, Apache Spark applies the same Parquet predicate pushdown and column pruning at cluster scale — the optimization principles carry over directly from single-node pandas to distributed Spark DataFrames.

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

# Basic read — Arrow-backed columns, no intermediate conversion
df = pd.read_parquet(
    "s3://my-bucket/events/year=2026/month=06/",
    engine="pyarrow",
    dtype_backend="pyarrow",
)

# Column projection — only read columns you need (avoids decompressing unused columns)
df = pd.read_parquet(
    "events.parquet",
    columns=["user_id", "event_type", "revenue", "timestamp"],
    dtype_backend="pyarrow",
)

# Predicate pushdown — skip row groups that don't match before reading
df = pd.read_parquet(
    "events.parquet",
    filters=[
        ("event_type", "in", ["purchase", "refund"]),
        ("revenue", ">", 0),
        ("timestamp", ">=", "2026-01-01"),
    ],
    dtype_backend="pyarrow",
)

# Writing Parquet — compression and row group size
df.to_parquet(
    "output.parquet",
    engine="pyarrow",
    compression="zstd",          # zstd is faster than gzip, better ratio than snappy
    compression_level=3,
    row_group_size=100_000,      # row groups for partition pruning
    index=False,
)

print(f"Read {len(df):,} rows, {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

# Reading partitioned Parquet datasets (Hive-style partitioning)
# /data/events/
#   year=2026/month=06/day=01/part-0.parquet
#   year=2026/month=06/day=02/part-0.parquet
#   ...

df = pd.read_parquet(
    "data/events/",
    filters=[
        ("year", "=", 2026),
        ("month", "=", 6),
    ],
    dtype_backend="pyarrow",
)

# Using PyArrow directly for advanced read patterns
dataset = pq.ParquetDataset(
    "data/events/",
    filters=[("month", "=", 6)],
    use_legacy_dataset=False,
)
table = dataset.read(columns=["user_id", "event_type", "revenue"])
df = table.to_pandas(
    types_mapper=pd.ArrowDtype,   # preserve Arrow dtypes in the DataFrame
    zero_copy_only=False,
)

# Chunked reading for files larger than RAM
parquet_file = pq.ParquetFile("large_events.parquet")
chunks = []
for batch in parquet_file.iter_batches(batch_size=500_000):
    chunk = batch.to_pandas(types_mapper=pd.ArrowDtype)
    # process or filter chunk before accumulating
    chunk = chunk[chunk["revenue"] > 100]
    chunks.append(chunk)

df = pd.concat(chunks, ignore_index=True)

Memory Optimization — Downcasting and Categorical Types

Even with PyArrow backends, large DataFrames can exceed available RAM. Systematic downcasting of numeric columns and converting repeated string values to categorical types are the two highest-leverage memory reduction techniques before reaching for chunked processing or distributed computing.

import pandas as pd
import numpy as np

def optimize_dataframe_memory(df: pd.DataFrame) -> pd.DataFrame:
    """
    Downcast numeric columns and convert low-cardinality strings to categorical.
    Returns a new DataFrame with optimized dtypes.
    """
    original_mem = df.memory_usage(deep=True).sum()
    result = df.copy()

    for col in result.columns:
        col_dtype = result[col].dtype

        # Downcast integers
        if col_dtype in [np.int64, np.int32, np.int16]:
            col_min = result[col].min()
            col_max = result[col].max()

            if col_min >= np.iinfo(np.int8).min and col_max <= np.iinfo(np.int8).max:
                result[col] = result[col].astype(np.int8)
            elif col_min >= np.iinfo(np.int16).min and col_max <= np.iinfo(np.int16).max:
                result[col] = result[col].astype(np.int16)
            elif col_min >= np.iinfo(np.int32).min and col_max <= np.iinfo(np.int32).max:
                result[col] = result[col].astype(np.int32)

        # Downcast floats
        elif col_dtype == np.float64:
            result[col] = pd.to_numeric(result[col], downcast="float")

        # Convert low-cardinality object/string columns to Categorical
        elif col_dtype == object or str(col_dtype) == "string":
            n_unique = result[col].nunique()
            n_total = len(result[col])
            if n_unique / n_total < 0.05:   # fewer than 5% unique values
                result[col] = result[col].astype("category")

    optimized_mem = result.memory_usage(deep=True).sum()
    reduction = (1 - optimized_mem / original_mem) * 100
    print(f"Memory: {original_mem / 1e6:.1f} MB → {optimized_mem / 1e6:.1f} MB ({reduction:.1f}% reduction)")
    return result


df = pd.read_csv("orders.csv")
df_opt = optimize_dataframe_memory(df)
# Memory: 480.2 MB → 112.6 MB (76.5% reduction)
import pandas as pd

# Categorical dtype — ideal for columns with repeated string values
# status, country, product_category, etc.

df = pd.read_csv("orders.csv")

# Manual categorical conversion with explicit ordered categories
df["status"] = pd.Categorical(
    df["status"],
    categories=["pending", "processing", "shipped", "delivered", "cancelled"],
    ordered=True,
)
df["country"] = df["country"].astype("category")

# Ordered categoricals support comparisons
high_value_shipped = df[df.status >= "shipped"]

# Categorical + groupby — fast because groupby operates on integer codes
agg = df.groupby("country", observed=True).agg(
    total_revenue=("revenue", "sum"),
    order_count=("order_id", "count"),
)

# Categorical read from CSV
df = pd.read_csv(
    "orders.csv",
    dtype={
        "status": "category",
        "country": "category",
        "product_category": "category",
    },
)

# Check memory usage per column
print(
    df.memory_usage(deep=True)
      .sort_values(ascending=False)
      .apply(lambda x: f"{x / 1e6:.2f} MB")
)

Nullable Integer and Float Types

Pandas introduced nullable integer types (Int8, Int16, Int32, Int64) to solve the problem of integer columns silently becoming float when they contain nulls. The pandas nullable types use a separate boolean mask array alongside the NumPy integer array, while the PyArrow-backed types use Arrow's native validity bitmap — the Arrow approach is more memory efficient and integrates better with Parquet.

import pandas as pd
import numpy as np

# Problem: NumPy int64 cannot represent NA — silently upcasts to float64
s_numpy = pd.Series([1, 2, None, 4], dtype="float64")  # forced to float
print(s_numpy.dtype)   # float64
print(s_numpy[2])      # nan — not NA, causing .isna() inconsistencies

# Solution A: pandas nullable integer (NumPy-backed with mask)
s_nullable = pd.Series([1, 2, None, 4], dtype="Int64")
print(s_nullable.dtype)   # Int64
print(s_nullable[2])      # <NA>
print(s_nullable.isna())  # False False True False

# Solution B: PyArrow int64 (Arrow-backed, preferred for I/O-heavy workloads)
import pyarrow as pa
s_arrow = pd.Series([1, 2, None, 4], dtype=pd.ArrowDtype(pa.int64()))
print(s_arrow.dtype)   # int64[pyarrow]
print(s_arrow[2])      # <NA>

# Arithmetic preserves nullability — no silent NaN propagation
print(s_nullable + 10)  # 11 12 <NA> 14
print(s_arrow + 10)     # 11 12 <NA> 14

# convert_dtypes() — upgrade all columns to best nullable types automatically
df = pd.read_csv("data.csv")
df = df.convert_dtypes()       # uses pandas nullable types
df = df.convert_dtypes(dtype_backend="pyarrow")  # uses Arrow types

eval() and query() with numexpr

DataFrame.eval() and DataFrame.query() use numexpr under the hood when it is installed. Numexpr compiles arithmetic expressions to native code that operates on multiple rows simultaneously, avoiding the Python interpreter overhead for each element and eliminating the intermediate NumPy arrays that standard column arithmetic allocates.

import pandas as pd
import numpy as np

n = 5_000_000
df = pd.DataFrame({
    "revenue":    np.random.uniform(0, 10000, n),
    "cost":       np.random.uniform(0, 5000, n),
    "tax_rate":   np.random.uniform(0.05, 0.25, n),
    "discount":   np.random.uniform(0, 0.3, n),
    "is_premium": np.random.choice([True, False], n),
})

# Standard pandas — allocates 3 intermediate arrays
profit_standard = (df.revenue - df.cost) * (1 - df.tax_rate) * (1 - df.discount)

# eval() with numexpr — single pass, no intermediate allocations
# ~2-4x faster on large DataFrames with numexpr installed
df.eval(
    "profit = (revenue - cost) * (1 - tax_rate) * (1 - discount)",
    inplace=True,
)

# query() for row filtering — more readable than boolean indexing
high_margin = df.query(
    "profit > 500 and is_premium == True and revenue > 2000"
)

# query() with external variables using @ prefix
threshold = 500
region_list = ["EMEA", "APAC"]
filtered = df.query("profit > @threshold")

# eval() with local variables
discount_cap = 0.15
df.eval(
    "capped_discount = discount.clip(upper=@discount_cap)",
    local_dict={"discount_cap": discount_cap},
    inplace=True,
)

Chunked Processing for Files Larger Than RAM

When a CSV or JSON file is too large to read into memory at once, pandas provides chunksize parameter for iteration. The pattern is to process each chunk independently — applying filters, aggregations, or transformations — and accumulate results before a final merge. Keep chunk processing functions pure (no side effects) so they compose cleanly in the pipeline.

import pandas as pd
from pathlib import Path

def process_large_csv(
    path: str,
    chunk_size: int = 500_000,
    date_col: str = "timestamp",
    min_revenue: float = 0.0,
) -> pd.DataFrame:
    """
    Stream-process a large CSV, filter rows, aggregate per month.
    Returns a small summary DataFrame that fits in memory.
    """
    monthly_totals = {}

    dtype_map = {
        "user_id":    "int32",
        "revenue":    "float32",
        "event_type": "category",
        "country":    "category",
    }

    reader = pd.read_csv(
        path,
        chunksize=chunk_size,
        dtype=dtype_map,
        parse_dates=[date_col],
        engine="c",         # C engine is ~3x faster than Python engine
    )

    for i, chunk in enumerate(reader):
        # Filter early to minimize memory pressure
        chunk = chunk[chunk["revenue"] > min_revenue]
        chunk = chunk[chunk["event_type"] == "purchase"]

        # Extract month key
        chunk["month"] = chunk[date_col].dt.to_period("M")

        # Aggregate this chunk
        agg = chunk.groupby("month").agg(
            revenue=("revenue", "sum"),
            orders=("user_id", "count"),
        )

        # Merge into running totals
        for month, row in agg.iterrows():
            if month not in monthly_totals:
                monthly_totals[month] = {"revenue": 0.0, "orders": 0}
            monthly_totals[month]["revenue"] += row["revenue"]
            monthly_totals[month]["orders"]  += row["orders"]

        print(f"Chunk {i + 1}: processed {len(chunk):,} filtered rows")

    result = pd.DataFrame.from_dict(monthly_totals, orient="index")
    result.index.name = "month"
    return result.sort_index()


summary = process_large_csv("events_2026.csv", chunk_size=500_000, min_revenue=10.0)
print(summary.head())

String Operations — PyArrow vs NumPy Object Performance

String operations on object dtype columns loop through Python objects one at a time — each .str.lower() call dispatches to a Python method per row. PyArrow string arrays execute the same operations in C++, processing entire buffers at once. For datasets with millions of string rows this is a 5–15× speedup on typical string operations.

import pandas as pd
import pyarrow as pa

n = 2_000_000
emails = [f"user_{i}@example.com" for i in range(n)]

# Object dtype — each str operation loops through Python objects
s_obj = pd.Series(emails, dtype=object)

# PyArrow string dtype — C++ string kernels
s_arrow = pd.Series(emails, dtype=pd.ArrowDtype(pa.string()))

# Common string operations — same API, different performance profile
# Lower-case conversion
lower_obj   = s_obj.str.lower()
lower_arrow = s_arrow.str.lower()

# Extract domain from email
domain_obj   = s_obj.str.extract(r"@(.+)$")[0]
domain_arrow = s_arrow.str.extract(r"@(.+)$")[0]

# Contains check
has_example_obj   = s_obj.str.contains("example", regex=False)
has_example_arrow = s_arrow.str.contains("example", regex=False)

# Split and expand — common ETL operation
parts_obj   = s_obj.str.split("@", expand=True)
parts_arrow = s_arrow.str.split("@", expand=True)

# Replace — URL normalization, PII scrubbing
cleaned_obj   = s_obj.str.replace(r"d+", "X", regex=True)
cleaned_arrow = s_arrow.str.replace(r"d+", "X", regex=True)

# Padding and stripping — data cleaning
padded = s_arrow.str.strip().str.ljust(30)

print(f"Object dtype:  {s_obj.memory_usage(deep=True) / 1e6:.1f} MB")
print(f"Arrow string:  {s_arrow.memory_usage(deep=True) / 1e6:.1f} MB")

DuckDB Integration — SQL on DataFrames

DuckDB can query pandas DataFrames directly via its Python API, executing SQL against in-memory data without any copy when the DataFrame uses Arrow-backed storage. The combination of pandas for data loading and shaping with DuckDB for complex SQL aggregations covers most analytical patterns more efficiently than pure pandas for aggregation-heavy workloads.

import duckdb
import pandas as pd
import pyarrow as pa

# Load data with PyArrow backend
orders = pd.read_parquet("orders.parquet", dtype_backend="pyarrow")
customers = pd.read_parquet("customers.parquet", dtype_backend="pyarrow")

# DuckDB can query PyArrow-backed DataFrames without copy
con = duckdb.connect()

# Register DataFrames as DuckDB views
con.register("orders", orders)
con.register("customers", customers)

# Run SQL — aggregations, window functions, CTEs
result_df = con.execute("""
    WITH monthly_stats AS (
        SELECT
            c.country,
            DATE_TRUNC('month', o.created_at) AS month,
            COUNT(*)                           AS order_count,
            SUM(o.revenue)                     AS total_revenue,
            AVG(o.revenue)                     AS avg_order_value,
            SUM(o.revenue) / NULLIF(COUNT(DISTINCT c.customer_id), 0) AS revenue_per_customer
        FROM orders o
        JOIN customers c USING (customer_id)
        WHERE o.status = 'completed'
          AND o.created_at >= '2026-01-01'
        GROUP BY 1, 2
    )
    SELECT
        *,
        SUM(total_revenue) OVER (PARTITION BY country ORDER BY month) AS cumulative_revenue
    FROM monthly_stats
    ORDER BY country, month
""").df()   # returns a pandas DataFrame

# fetchdf() returns arrow-backed pandas; fetch_arrow_table() returns raw Arrow table
arrow_table = con.execute("SELECT * FROM orders LIMIT 1000").fetch_arrow_table()
df_from_arrow = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)

print(f"Result shape: {result_df.shape}")

Benchmarking — Measuring Real Gains

The correct way to measure pandas performance is with timeit or %timeit for microbenchmarks, and memory_profiler or tracemalloc for memory. Always benchmark on your actual data shapes — string cardinality, null density, and column count all affect which backend wins.

import pandas as pd
import pyarrow as pa
import numpy as np
import tracemalloc
import time
from contextlib import contextmanager

@contextmanager
def measure(label: str):
    tracemalloc.start()
    t0 = time.perf_counter()
    yield
    elapsed = time.perf_counter() - t0
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"{label:<40} {elapsed * 1000:7.1f} ms  peak {peak / 1e6:6.1f} MB")

n = 3_000_000
np.random.seed(42)
countries = ["US", "DE", "GB", "FR", "JP", "CA", "AU", "BR", "IN", "MX"]

data = {
    "user_id":  np.random.randint(1, 1_000_000, n),
    "revenue":  np.random.exponential(100, n),
    "country":  [countries[i % len(countries)] for i in range(n)],
    "status":   ["active" if i % 3 else "churned" for i in range(n)],
}

# NumPy / object backend
df_numpy = pd.DataFrame(data)

# PyArrow backend
df_arrow = pd.DataFrame({
    "user_id": pd.array(data["user_id"], dtype=pd.ArrowDtype(pa.int32())),
    "revenue": pd.array(data["revenue"], dtype=pd.ArrowDtype(pa.float32())),
    "country": pd.array(data["country"], dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string()))),
    "status":  pd.array(data["status"],  dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string()))),
})

print(f"Memory — NumPy:  {df_numpy.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Memory — Arrow:  {df_arrow.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print()

with measure("groupby revenue sum (NumPy)"):
    _ = df_numpy.groupby("country")["revenue"].sum()

with measure("groupby revenue sum (Arrow)"):
    _ = df_arrow.groupby("country")["revenue"].sum()

with measure("string filter (NumPy object)"):
    _ = df_numpy[df_numpy["status"] == "active"]

with measure("string filter (Arrow dict)"):
    _ = df_arrow[df_arrow["status"] == "active"]

Production Checklist

1

Enable Copy-on-Write at application startup with pd.options.mode.copy_on_write = True if running pandas 2.0 or 2.1. Pandas 2.2 enables it by default and pandas 3.0 will remove the toggle entirely. Audit all existing code for chained indexing patterns (df[mask]['col'] = value) and replace them with df.loc[mask, 'col'] = value before enabling CoW to avoid silent no-ops.

2

Use dtype_backend='pyarrow' for all read_parquet, read_csv, and read_json calls in new pipelines. The PyArrow backend eliminates float upcasting of nullable integers, reduces string column memory by 5–15×, and enables zero-copy data exchange with DuckDB, Apache Arrow Flight, and other Arrow-native consumers without intermediate conversion.

3

Apply convert_dtypes(dtype_backend='pyarrow') to DataFrames loaded from legacy code paths that predate PyArrow support. This is a one-line migration that upgrades all columns to their best Arrow-backed nullable type without requiring a full rewrite of the loading logic.

4

Use column projection when reading Parquet files — always pass the columns parameter listing only the columns your pipeline actually uses. Parquet stores each column independently; reading 4 of 40 columns takes roughly 10% of the I/O and decompression time of reading all 40 columns.

5

Apply predicate pushdown via the filters parameter in read_parquet for date-range and equality filters. Parquet row group statistics (min/max per row group) allow the reader to skip entire row groups before decompressing, reducing read volume by 50–90% for time-series data with date filters.

6

Convert low-cardinality string columns (fewer than 5% unique values) to pd.ArrowDtype(pa.dictionary(...)) or 'category' dtype after loading. Status, country, product_category, and event_type columns typically have 5–200 unique values across millions of rows — dictionary encoding stores them as integer codes rather than repeated string data.

7

Replace object dtype string columns in hot code paths — groupby, merge, and filter operations — with ArrowDtype string columns. String equality comparisons on Arrow dictionary-encoded columns compare integer codes, not string data; groupby on Arrow dict columns operates on integer keys, not Python string objects.

8

Install numexpr (pip install numexpr) and use DataFrame.eval() for multi-column arithmetic expressions. Numexpr eliminates intermediate NumPy array allocations and processes expressions in SIMD-vectorized chunks, reducing both peak memory usage and wall-clock time by 2–4× on expressions that combine 3+ columns.

9

Profile memory before and after dtype optimization using df.memory_usage(deep=True).sum() and tracemalloc for peak allocation. The deep=True flag traverses Python object references in object dtype columns — without it, string column memory is undercounted by 10–50× and you cannot measure optimization impact accurately.

10

Benchmark groupby aggregations on your actual data cardinality before choosing between pandas categorical, ArrowDtype dictionary, and plain string. High-cardinality columns (user IDs, session IDs, URLs) benefit less from dictionary encoding than low-cardinality columns; for high-cardinality groupby, a direct DuckDB SQL aggregation on a PyArrow-backed DataFrame often outperforms native pandas by 3–5×.

Running pandas pipelines that OOM on multi-GB CSVs, spending more time on SettingWithCopyWarning debugging than on actual data work, or seeing float64 integer columns because nullable types were not configured correctly?

We modernize pandas data pipelines for production — from Copy-on-Write migration audits that replace chained indexing patterns with .loc and .assign(), through PyArrow backend adoption with dtype_backend='pyarrow' across all I/O entry points, Parquet column projection and predicate pushdown configuration for data lake reads, nullable integer type migration eliminating silent float upcasting, categorical downcasting utilities for low-cardinality string columns, chunked processing pipelines for files larger than available RAM, numexpr-accelerated eval() rewrites of multi-column arithmetic hotpaths, DuckDB SQL integration for aggregation-heavy analytical queries on Arrow-backed DataFrames, and memory profiling with tracemalloc to quantify optimization impact. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.