What Changed in Pandas 2.x
Pandas 2.0 was a landmark release that restructured the library's memory model after nearly a decade of NumPy-only internals. The three changes that matter most for production data engineering work are: Copy-on-Write semantics that eliminate a whole class of silent mutation bugs, the PyArrow backend that replaces NumPy arrays for string and nullable numeric storage, and first-class support for Arrow-native I/O that makes Parquet reads dramatically faster. Pandas 2.1 and 2.2 built on those foundations, making CoW opt-out instead of opt-in and stabilizing the Arrow dtype APIs. Polars uses Arrow columnar memory natively from the ground up — pandas 2.x narrows that gap significantly by giving you Arrow-backed columns as a first-class option within the familiar pandas API surface.
Copy-on-Write
Views no longer silently modify the parent DataFrame. Every operation that modifies data returns a new object backed by independent memory, making mutation behavior predictable and testable.
PyArrow Backend
Columns backed by PyArrow arrays instead of NumPy. Strings store as dictionary-encoded Arrow arrays — 3–8× less memory than Python object arrays. Native null support without sentinel NaN values.
Arrow I/O
read_parquet with PyArrow engine returns Arrow-backed columns with zero-copy deserialization. No intermediate conversion through NumPy — data lands directly in Arrow memory layout.
Installation and Environment Setup
Pandas 2.x requires Python 3.9 or later. The PyArrow backend requires pyarrow>=10.0 installed in the same environment. Installing them together via pip resolves the correct compatible versions automatically.
# Install pandas 2.x with PyArrow and optional performance extras
pip install "pandas>=2.2" pyarrow
# Verify versions
python -c "import pandas as pd; import pyarrow as pa; print(pd.__version__, pa.__version__)"
# 2.2.3 17.0.0
# Optional: numexpr for accelerated eval() and query()
pip install numexpr bottleneck
# Full data engineering stack
pip install "pandas>=2.2" pyarrow pyarrow-hotfix numexpr bottleneck fastparquetimport pandas as pd
import pyarrow as pa
import numpy as np
# Check Copy-on-Write status — enabled by default from pandas 2.0+
pd.options.mode.copy_on_write # True in pandas 2.2
# Explicitly enable for pandas 2.0/2.1 if you pinned an older version
pd.options.mode.copy_on_write = True
# Confirm default dtype inference settings
pd.options.future.infer_string # True — uses StringDtype by default in 3.xNote
SettingWithCopyWarning in pandas 1.x code, migrating to CoW semantics is the fix — not suppressing the warning. CoW guarantees that df2 = df[mask] always returns an independent object with no shared state.Copy-on-Write — What It Means in Practice
Before CoW, pandas chained indexing like df[mask]['col'] = value might or might not modify the original DataFrame depending on whether the first indexing operation returned a view or a copy — a detail that depended on the dtype, the shape of the selection, and whether pandas internally optimized the access. CoW makes this consistent: any in-place modification on a derived object writes to a new copy, never the parent.
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"score": [95, 87, 92, 78],
"grade": ["A", "B", "A", "C"],
})
# Old behavior (pre-CoW): this might modify df, might not
# df[df.grade == "A"]["score"] = 100 # SettingWithCopyWarning — unreliable
# CoW behavior: slice is always independent
df_a_students = df[df.grade == "A"].copy() # explicit copy for mutation
df_a_students["score"] = 100
print(df["score"].tolist()) # [95, 87, 92, 78] — unchanged
# Preferred CoW pattern: use .loc for in-place modification on the original
df.loc[df.grade == "A", "score"] = 100
print(df["score"].tolist()) # [100, 87, 100, 78] — correct
# assign() always returns a new DataFrame — idiomatic CoW pattern
df_modified = df.assign(score_normalized=lambda x: x.score / x.score.max())
print(df.columns.tolist()) # original unchanged
print(df_modified.columns.tolist()) # includes score_normalized# CoW and method chaining — the idiomatic pandas 2.x style
import pandas as pd
result = (
pd.read_csv("sales.csv")
.rename(columns={"sale_amt": "revenue"})
.query("revenue > 0")
.assign(
revenue_k=lambda x: x.revenue / 1000,
month=lambda x: pd.to_datetime(x.date).dt.to_period("M"),
)
.groupby("month", as_index=False)
.agg(
total_revenue=("revenue_k", "sum"),
order_count=("order_id", "nunique"),
)
.sort_values("month")
)
# Each method returns a new DataFrame — CoW ensures no implicit shared memory
# No copy overhead unless data is actually written — CoW defers copies lazilyPyArrow Backend — dtype_backend and ArrowDtype
The PyArrow backend replaces NumPy arrays with Apache Arrow arrays as the underlying column storage. The practical impact is most visible for string columns: NumPy stores Python strings as object arrays — each element is a Python heap object with 56 bytes of overhead per string, plus the string data itself. PyArrow stores strings as dictionary-encoded or contiguous byte arrays with 4–8 bytes of overhead per entry. DuckDB queries Parquet files directly in Arrow columnar format — pandas PyArrow-backed DataFrames can receive query results from DuckDB without any format conversion using the fetchdf() method.
import pandas as pd
import pyarrow as pa
import sys
# Demonstrate memory difference: NumPy object vs PyArrow string
n = 1_000_000
categories = ["orders", "returns", "exchanges", "pending", "fulfilled"]
# NumPy object array (default pre-pandas 2.0)
df_numpy = pd.DataFrame({
"status": pd.array(
[categories[i % 5] for i in range(n)],
dtype=object,
)
})
# PyArrow string array (dictionary-encoded)
df_arrow = pd.DataFrame({
"status": pd.array(
[categories[i % 5] for i in range(n)],
dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string())),
)
})
print(f"NumPy object: {df_numpy.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"PyArrow dict: {df_arrow.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# NumPy object: 56.0 MB
# PyArrow dict: 4.1 MB — ~14x less memoryimport pandas as pd
import pyarrow as pa
# Three ways to get PyArrow-backed DataFrames
# 1. dtype_backend parameter at read time — recommended for new pipelines
df = pd.read_parquet("events.parquet", dtype_backend="pyarrow")
df = pd.read_csv("events.csv", dtype_backend="pyarrow")
df = pd.read_json("events.json", dtype_backend="pyarrow")
# 2. convert_dtypes() on an existing DataFrame
df_numpy = pd.read_csv("events.csv")
df_arrow = df_numpy.convert_dtypes(dtype_backend="pyarrow")
# 3. Explicit ArrowDtype column-by-column
df_explicit = pd.DataFrame({
"id": pd.array([1, 2, 3], dtype=pd.ArrowDtype(pa.int64())),
"name": pd.array(["Alice", "Bob", None], dtype=pd.ArrowDtype(pa.string())),
"score": pd.array([9.5, 8.2, None], dtype=pd.ArrowDtype(pa.float64())),
"active": pd.array([True, False, None], dtype=pd.ArrowDtype(pa.bool_())),
"timestamp": pd.array(
["2026-01-01", "2026-01-02", None],
dtype=pd.ArrowDtype(pa.timestamp("us", tz="UTC")),
),
})
print(df_explicit.dtypes)
# id int64[pyarrow]
# name string[pyarrow]
# score double[pyarrow]
# active bool[pyarrow]
# timestamp timestamp[us, tz=UTC][pyarrow]Note
None / pd.NA natively without coercing integers to float. NumPy's int64 cannot represent missing values, so pandas 1.x silently upcast nullable integer columns to float64. With pd.ArrowDtype(pa.int64()), an integer column stays integer even when it contains None — no silent precision loss.Parquet I/O — Zero-Copy Reads and Predicate Pushdown
Parquet is the standard storage format for pandas workloads in data lakes. Pandas 2.x with PyArrow engine reads Parquet files into Arrow-backed DataFrames without any intermediate NumPy conversion. The filters parameter pushes predicates into the Parquet reader, skipping row groups that don't match before they're even decompressed. When data exceeds single-machine memory, Apache Spark applies the same Parquet predicate pushdown and column pruning at cluster scale — the optimization principles carry over directly from single-node pandas to distributed Spark DataFrames.
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
# Basic read — Arrow-backed columns, no intermediate conversion
df = pd.read_parquet(
"s3://my-bucket/events/year=2026/month=06/",
engine="pyarrow",
dtype_backend="pyarrow",
)
# Column projection — only read columns you need (avoids decompressing unused columns)
df = pd.read_parquet(
"events.parquet",
columns=["user_id", "event_type", "revenue", "timestamp"],
dtype_backend="pyarrow",
)
# Predicate pushdown — skip row groups that don't match before reading
df = pd.read_parquet(
"events.parquet",
filters=[
("event_type", "in", ["purchase", "refund"]),
("revenue", ">", 0),
("timestamp", ">=", "2026-01-01"),
],
dtype_backend="pyarrow",
)
# Writing Parquet — compression and row group size
df.to_parquet(
"output.parquet",
engine="pyarrow",
compression="zstd", # zstd is faster than gzip, better ratio than snappy
compression_level=3,
row_group_size=100_000, # row groups for partition pruning
index=False,
)
print(f"Read {len(df):,} rows, {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
# Reading partitioned Parquet datasets (Hive-style partitioning)
# /data/events/
# year=2026/month=06/day=01/part-0.parquet
# year=2026/month=06/day=02/part-0.parquet
# ...
df = pd.read_parquet(
"data/events/",
filters=[
("year", "=", 2026),
("month", "=", 6),
],
dtype_backend="pyarrow",
)
# Using PyArrow directly for advanced read patterns
dataset = pq.ParquetDataset(
"data/events/",
filters=[("month", "=", 6)],
use_legacy_dataset=False,
)
table = dataset.read(columns=["user_id", "event_type", "revenue"])
df = table.to_pandas(
types_mapper=pd.ArrowDtype, # preserve Arrow dtypes in the DataFrame
zero_copy_only=False,
)
# Chunked reading for files larger than RAM
parquet_file = pq.ParquetFile("large_events.parquet")
chunks = []
for batch in parquet_file.iter_batches(batch_size=500_000):
chunk = batch.to_pandas(types_mapper=pd.ArrowDtype)
# process or filter chunk before accumulating
chunk = chunk[chunk["revenue"] > 100]
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)Memory Optimization — Downcasting and Categorical Types
Even with PyArrow backends, large DataFrames can exceed available RAM. Systematic downcasting of numeric columns and converting repeated string values to categorical types are the two highest-leverage memory reduction techniques before reaching for chunked processing or distributed computing.
import pandas as pd
import numpy as np
def optimize_dataframe_memory(df: pd.DataFrame) -> pd.DataFrame:
"""
Downcast numeric columns and convert low-cardinality strings to categorical.
Returns a new DataFrame with optimized dtypes.
"""
original_mem = df.memory_usage(deep=True).sum()
result = df.copy()
for col in result.columns:
col_dtype = result[col].dtype
# Downcast integers
if col_dtype in [np.int64, np.int32, np.int16]:
col_min = result[col].min()
col_max = result[col].max()
if col_min >= np.iinfo(np.int8).min and col_max <= np.iinfo(np.int8).max:
result[col] = result[col].astype(np.int8)
elif col_min >= np.iinfo(np.int16).min and col_max <= np.iinfo(np.int16).max:
result[col] = result[col].astype(np.int16)
elif col_min >= np.iinfo(np.int32).min and col_max <= np.iinfo(np.int32).max:
result[col] = result[col].astype(np.int32)
# Downcast floats
elif col_dtype == np.float64:
result[col] = pd.to_numeric(result[col], downcast="float")
# Convert low-cardinality object/string columns to Categorical
elif col_dtype == object or str(col_dtype) == "string":
n_unique = result[col].nunique()
n_total = len(result[col])
if n_unique / n_total < 0.05: # fewer than 5% unique values
result[col] = result[col].astype("category")
optimized_mem = result.memory_usage(deep=True).sum()
reduction = (1 - optimized_mem / original_mem) * 100
print(f"Memory: {original_mem / 1e6:.1f} MB → {optimized_mem / 1e6:.1f} MB ({reduction:.1f}% reduction)")
return result
df = pd.read_csv("orders.csv")
df_opt = optimize_dataframe_memory(df)
# Memory: 480.2 MB → 112.6 MB (76.5% reduction)import pandas as pd
# Categorical dtype — ideal for columns with repeated string values
# status, country, product_category, etc.
df = pd.read_csv("orders.csv")
# Manual categorical conversion with explicit ordered categories
df["status"] = pd.Categorical(
df["status"],
categories=["pending", "processing", "shipped", "delivered", "cancelled"],
ordered=True,
)
df["country"] = df["country"].astype("category")
# Ordered categoricals support comparisons
high_value_shipped = df[df.status >= "shipped"]
# Categorical + groupby — fast because groupby operates on integer codes
agg = df.groupby("country", observed=True).agg(
total_revenue=("revenue", "sum"),
order_count=("order_id", "count"),
)
# Categorical read from CSV
df = pd.read_csv(
"orders.csv",
dtype={
"status": "category",
"country": "category",
"product_category": "category",
},
)
# Check memory usage per column
print(
df.memory_usage(deep=True)
.sort_values(ascending=False)
.apply(lambda x: f"{x / 1e6:.2f} MB")
)Nullable Integer and Float Types
Pandas introduced nullable integer types (Int8, Int16, Int32, Int64) to solve the problem of integer columns silently becoming float when they contain nulls. The pandas nullable types use a separate boolean mask array alongside the NumPy integer array, while the PyArrow-backed types use Arrow's native validity bitmap — the Arrow approach is more memory efficient and integrates better with Parquet.
import pandas as pd
import numpy as np
# Problem: NumPy int64 cannot represent NA — silently upcasts to float64
s_numpy = pd.Series([1, 2, None, 4], dtype="float64") # forced to float
print(s_numpy.dtype) # float64
print(s_numpy[2]) # nan — not NA, causing .isna() inconsistencies
# Solution A: pandas nullable integer (NumPy-backed with mask)
s_nullable = pd.Series([1, 2, None, 4], dtype="Int64")
print(s_nullable.dtype) # Int64
print(s_nullable[2]) # <NA>
print(s_nullable.isna()) # False False True False
# Solution B: PyArrow int64 (Arrow-backed, preferred for I/O-heavy workloads)
import pyarrow as pa
s_arrow = pd.Series([1, 2, None, 4], dtype=pd.ArrowDtype(pa.int64()))
print(s_arrow.dtype) # int64[pyarrow]
print(s_arrow[2]) # <NA>
# Arithmetic preserves nullability — no silent NaN propagation
print(s_nullable + 10) # 11 12 <NA> 14
print(s_arrow + 10) # 11 12 <NA> 14
# convert_dtypes() — upgrade all columns to best nullable types automatically
df = pd.read_csv("data.csv")
df = df.convert_dtypes() # uses pandas nullable types
df = df.convert_dtypes(dtype_backend="pyarrow") # uses Arrow typeseval() and query() with numexpr
DataFrame.eval() and DataFrame.query() use numexpr under the hood when it is installed. Numexpr compiles arithmetic expressions to native code that operates on multiple rows simultaneously, avoiding the Python interpreter overhead for each element and eliminating the intermediate NumPy arrays that standard column arithmetic allocates.
import pandas as pd
import numpy as np
n = 5_000_000
df = pd.DataFrame({
"revenue": np.random.uniform(0, 10000, n),
"cost": np.random.uniform(0, 5000, n),
"tax_rate": np.random.uniform(0.05, 0.25, n),
"discount": np.random.uniform(0, 0.3, n),
"is_premium": np.random.choice([True, False], n),
})
# Standard pandas — allocates 3 intermediate arrays
profit_standard = (df.revenue - df.cost) * (1 - df.tax_rate) * (1 - df.discount)
# eval() with numexpr — single pass, no intermediate allocations
# ~2-4x faster on large DataFrames with numexpr installed
df.eval(
"profit = (revenue - cost) * (1 - tax_rate) * (1 - discount)",
inplace=True,
)
# query() for row filtering — more readable than boolean indexing
high_margin = df.query(
"profit > 500 and is_premium == True and revenue > 2000"
)
# query() with external variables using @ prefix
threshold = 500
region_list = ["EMEA", "APAC"]
filtered = df.query("profit > @threshold")
# eval() with local variables
discount_cap = 0.15
df.eval(
"capped_discount = discount.clip(upper=@discount_cap)",
local_dict={"discount_cap": discount_cap},
inplace=True,
)Chunked Processing for Files Larger Than RAM
When a CSV or JSON file is too large to read into memory at once, pandas provides chunksize parameter for iteration. The pattern is to process each chunk independently — applying filters, aggregations, or transformations — and accumulate results before a final merge. Keep chunk processing functions pure (no side effects) so they compose cleanly in the pipeline.
import pandas as pd
from pathlib import Path
def process_large_csv(
path: str,
chunk_size: int = 500_000,
date_col: str = "timestamp",
min_revenue: float = 0.0,
) -> pd.DataFrame:
"""
Stream-process a large CSV, filter rows, aggregate per month.
Returns a small summary DataFrame that fits in memory.
"""
monthly_totals = {}
dtype_map = {
"user_id": "int32",
"revenue": "float32",
"event_type": "category",
"country": "category",
}
reader = pd.read_csv(
path,
chunksize=chunk_size,
dtype=dtype_map,
parse_dates=[date_col],
engine="c", # C engine is ~3x faster than Python engine
)
for i, chunk in enumerate(reader):
# Filter early to minimize memory pressure
chunk = chunk[chunk["revenue"] > min_revenue]
chunk = chunk[chunk["event_type"] == "purchase"]
# Extract month key
chunk["month"] = chunk[date_col].dt.to_period("M")
# Aggregate this chunk
agg = chunk.groupby("month").agg(
revenue=("revenue", "sum"),
orders=("user_id", "count"),
)
# Merge into running totals
for month, row in agg.iterrows():
if month not in monthly_totals:
monthly_totals[month] = {"revenue": 0.0, "orders": 0}
monthly_totals[month]["revenue"] += row["revenue"]
monthly_totals[month]["orders"] += row["orders"]
print(f"Chunk {i + 1}: processed {len(chunk):,} filtered rows")
result = pd.DataFrame.from_dict(monthly_totals, orient="index")
result.index.name = "month"
return result.sort_index()
summary = process_large_csv("events_2026.csv", chunk_size=500_000, min_revenue=10.0)
print(summary.head())String Operations — PyArrow vs NumPy Object Performance
String operations on object dtype columns loop through Python objects one at a time — each .str.lower() call dispatches to a Python method per row. PyArrow string arrays execute the same operations in C++, processing entire buffers at once. For datasets with millions of string rows this is a 5–15× speedup on typical string operations.
import pandas as pd
import pyarrow as pa
n = 2_000_000
emails = [f"user_{i}@example.com" for i in range(n)]
# Object dtype — each str operation loops through Python objects
s_obj = pd.Series(emails, dtype=object)
# PyArrow string dtype — C++ string kernels
s_arrow = pd.Series(emails, dtype=pd.ArrowDtype(pa.string()))
# Common string operations — same API, different performance profile
# Lower-case conversion
lower_obj = s_obj.str.lower()
lower_arrow = s_arrow.str.lower()
# Extract domain from email
domain_obj = s_obj.str.extract(r"@(.+)$")[0]
domain_arrow = s_arrow.str.extract(r"@(.+)$")[0]
# Contains check
has_example_obj = s_obj.str.contains("example", regex=False)
has_example_arrow = s_arrow.str.contains("example", regex=False)
# Split and expand — common ETL operation
parts_obj = s_obj.str.split("@", expand=True)
parts_arrow = s_arrow.str.split("@", expand=True)
# Replace — URL normalization, PII scrubbing
cleaned_obj = s_obj.str.replace(r"d+", "X", regex=True)
cleaned_arrow = s_arrow.str.replace(r"d+", "X", regex=True)
# Padding and stripping — data cleaning
padded = s_arrow.str.strip().str.ljust(30)
print(f"Object dtype: {s_obj.memory_usage(deep=True) / 1e6:.1f} MB")
print(f"Arrow string: {s_arrow.memory_usage(deep=True) / 1e6:.1f} MB")DuckDB Integration — SQL on DataFrames
DuckDB can query pandas DataFrames directly via its Python API, executing SQL against in-memory data without any copy when the DataFrame uses Arrow-backed storage. The combination of pandas for data loading and shaping with DuckDB for complex SQL aggregations covers most analytical patterns more efficiently than pure pandas for aggregation-heavy workloads.
import duckdb
import pandas as pd
import pyarrow as pa
# Load data with PyArrow backend
orders = pd.read_parquet("orders.parquet", dtype_backend="pyarrow")
customers = pd.read_parquet("customers.parquet", dtype_backend="pyarrow")
# DuckDB can query PyArrow-backed DataFrames without copy
con = duckdb.connect()
# Register DataFrames as DuckDB views
con.register("orders", orders)
con.register("customers", customers)
# Run SQL — aggregations, window functions, CTEs
result_df = con.execute("""
WITH monthly_stats AS (
SELECT
c.country,
DATE_TRUNC('month', o.created_at) AS month,
COUNT(*) AS order_count,
SUM(o.revenue) AS total_revenue,
AVG(o.revenue) AS avg_order_value,
SUM(o.revenue) / NULLIF(COUNT(DISTINCT c.customer_id), 0) AS revenue_per_customer
FROM orders o
JOIN customers c USING (customer_id)
WHERE o.status = 'completed'
AND o.created_at >= '2026-01-01'
GROUP BY 1, 2
)
SELECT
*,
SUM(total_revenue) OVER (PARTITION BY country ORDER BY month) AS cumulative_revenue
FROM monthly_stats
ORDER BY country, month
""").df() # returns a pandas DataFrame
# fetchdf() returns arrow-backed pandas; fetch_arrow_table() returns raw Arrow table
arrow_table = con.execute("SELECT * FROM orders LIMIT 1000").fetch_arrow_table()
df_from_arrow = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
print(f"Result shape: {result_df.shape}")Benchmarking — Measuring Real Gains
The correct way to measure pandas performance is with timeit or %timeit for microbenchmarks, and memory_profiler or tracemalloc for memory. Always benchmark on your actual data shapes — string cardinality, null density, and column count all affect which backend wins.
import pandas as pd
import pyarrow as pa
import numpy as np
import tracemalloc
import time
from contextlib import contextmanager
@contextmanager
def measure(label: str):
tracemalloc.start()
t0 = time.perf_counter()
yield
elapsed = time.perf_counter() - t0
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"{label:<40} {elapsed * 1000:7.1f} ms peak {peak / 1e6:6.1f} MB")
n = 3_000_000
np.random.seed(42)
countries = ["US", "DE", "GB", "FR", "JP", "CA", "AU", "BR", "IN", "MX"]
data = {
"user_id": np.random.randint(1, 1_000_000, n),
"revenue": np.random.exponential(100, n),
"country": [countries[i % len(countries)] for i in range(n)],
"status": ["active" if i % 3 else "churned" for i in range(n)],
}
# NumPy / object backend
df_numpy = pd.DataFrame(data)
# PyArrow backend
df_arrow = pd.DataFrame({
"user_id": pd.array(data["user_id"], dtype=pd.ArrowDtype(pa.int32())),
"revenue": pd.array(data["revenue"], dtype=pd.ArrowDtype(pa.float32())),
"country": pd.array(data["country"], dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string()))),
"status": pd.array(data["status"], dtype=pd.ArrowDtype(pa.dictionary(pa.int8(), pa.string()))),
})
print(f"Memory — NumPy: {df_numpy.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Memory — Arrow: {df_arrow.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print()
with measure("groupby revenue sum (NumPy)"):
_ = df_numpy.groupby("country")["revenue"].sum()
with measure("groupby revenue sum (Arrow)"):
_ = df_arrow.groupby("country")["revenue"].sum()
with measure("string filter (NumPy object)"):
_ = df_numpy[df_numpy["status"] == "active"]
with measure("string filter (Arrow dict)"):
_ = df_arrow[df_arrow["status"] == "active"]Production Checklist
Enable Copy-on-Write at application startup with pd.options.mode.copy_on_write = True if running pandas 2.0 or 2.1. Pandas 2.2 enables it by default and pandas 3.0 will remove the toggle entirely. Audit all existing code for chained indexing patterns (df[mask]['col'] = value) and replace them with df.loc[mask, 'col'] = value before enabling CoW to avoid silent no-ops.
Use dtype_backend='pyarrow' for all read_parquet, read_csv, and read_json calls in new pipelines. The PyArrow backend eliminates float upcasting of nullable integers, reduces string column memory by 5–15×, and enables zero-copy data exchange with DuckDB, Apache Arrow Flight, and other Arrow-native consumers without intermediate conversion.
Apply convert_dtypes(dtype_backend='pyarrow') to DataFrames loaded from legacy code paths that predate PyArrow support. This is a one-line migration that upgrades all columns to their best Arrow-backed nullable type without requiring a full rewrite of the loading logic.
Use column projection when reading Parquet files — always pass the columns parameter listing only the columns your pipeline actually uses. Parquet stores each column independently; reading 4 of 40 columns takes roughly 10% of the I/O and decompression time of reading all 40 columns.
Apply predicate pushdown via the filters parameter in read_parquet for date-range and equality filters. Parquet row group statistics (min/max per row group) allow the reader to skip entire row groups before decompressing, reducing read volume by 50–90% for time-series data with date filters.
Convert low-cardinality string columns (fewer than 5% unique values) to pd.ArrowDtype(pa.dictionary(...)) or 'category' dtype after loading. Status, country, product_category, and event_type columns typically have 5–200 unique values across millions of rows — dictionary encoding stores them as integer codes rather than repeated string data.
Replace object dtype string columns in hot code paths — groupby, merge, and filter operations — with ArrowDtype string columns. String equality comparisons on Arrow dictionary-encoded columns compare integer codes, not string data; groupby on Arrow dict columns operates on integer keys, not Python string objects.
Install numexpr (pip install numexpr) and use DataFrame.eval() for multi-column arithmetic expressions. Numexpr eliminates intermediate NumPy array allocations and processes expressions in SIMD-vectorized chunks, reducing both peak memory usage and wall-clock time by 2–4× on expressions that combine 3+ columns.
Profile memory before and after dtype optimization using df.memory_usage(deep=True).sum() and tracemalloc for peak allocation. The deep=True flag traverses Python object references in object dtype columns — without it, string column memory is undercounted by 10–50× and you cannot measure optimization impact accurately.
Benchmark groupby aggregations on your actual data cardinality before choosing between pandas categorical, ArrowDtype dictionary, and plain string. High-cardinality columns (user IDs, session IDs, URLs) benefit less from dictionary encoding than low-cardinality columns; for high-cardinality groupby, a direct DuckDB SQL aggregation on a PyArrow-backed DataFrame often outperforms native pandas by 3–5×.
Running pandas pipelines that OOM on multi-GB CSVs, spending more time on SettingWithCopyWarning debugging than on actual data work, or seeing float64 integer columns because nullable types were not configured correctly?
We modernize pandas data pipelines for production — from Copy-on-Write migration audits that replace chained indexing patterns with .loc and .assign(), through PyArrow backend adoption with dtype_backend='pyarrow' across all I/O entry points, Parquet column projection and predicate pushdown configuration for data lake reads, nullable integer type migration eliminating silent float upcasting, categorical downcasting utilities for low-cardinality string columns, chunked processing pipelines for files larger than available RAM, numexpr-accelerated eval() rewrites of multi-column arithmetic hotpaths, DuckDB SQL integration for aggregation-heavy analytical queries on Arrow-backed DataFrames, and memory profiling with tracemalloc to quantify optimization impact. Let’s talk.
Let's Talk