Options Backtesting Code Documentation

1. Multi-Index Lookup: O(n) → O(1)

The Problem:

Every time you need to find options for a specific entry date and expiration date, the original code was scanning through potentially millions of rows checking if each row matches both dates.

Before (v2):

mask = (date_arr == entry_day) & (exdate_arr == expiry_day)
local_options = options.loc[mask]

This creates a boolean mask by checking every single row - if you have 5 million options contracts, you're doing 5 million comparisons every lookup.

After (v3):

lookup_key = (change_to_categorical(entry_day), change_to_categorical(expiry_day))
if lookup_key in options_indexed.index:
    local_options = options_indexed.loc[lookup_key].reset_index()

What Changed:

You pre-built a multi-index on ['date', 'exdate']. Now pandas maintains a hash table internally. When you lookup (entry_day, expiry_day), it's like using a Python dictionary - instant access.

Why It's Huge:

In backtesting, you're doing this lookup potentially thousands of times
Linear scan: 1000 lookups × 5M rows = 5 billion comparisons
Index lookup: 1000 lookups × O(1) = 1000 hash lookups
This is often 100-1000x faster in practice

The Catch:

You need to build the index once upfront. But if you're doing repeated lookups (which you are in backtesting), it pays for itself immediately.

2. Merge vs isin(): O(n×m) → O(n+m)

The Problem:

You have a large options dataset (n rows) and need to filter it to only symbols you're trading (m symbols). .isin() has quadratic-ish behavior.

Before (v3_2):

filtered_options = _options[_options['symbol'].isin(all_syms)]

What `.isin()` does internally:

For each of the n rows in _options, it checks if that symbol exists in all_syms (m items). Worst case: n × m comparisons.

                        Example:
                        5M options rows
500 unique symbols you trade
5M × 500 = 2.5 billion comparisons

                    

After (v3_3):

all_syms_df = pd.DataFrame({'symbol': all_syms})
filtered_options = _options.merge(all_syms_df, on='symbol', how='inner')

What `.merge()` does internally:

Pandas uses a hash join algorithm:

Build hash table from smaller dataset (all_syms_df) - O(m)
Probe hash table for each row in larger dataset (_options) - O(n)
Total: O(n + m)

                        Same example:
                        5M + 500 = ~5M operations (basically just scanning options once)

Why It's Huge:

The gap widens dramatically as your data grows:

Small data (10K rows, 50 symbols): maybe 2x faster
Medium data (1M rows, 200 symbols): 10-20x faster
Large data (5M rows, 500 symbols): 50x faster

Real-world impact:

If filtering was taking 30 seconds with .isin(), it now takes 0.5 seconds with .merge().

3. Vectorization: Python Loop → Pandas Batch Operation

The Problem:

You need to set a flag to 0 for specific dates. Python loops are notoriously slow because of interpreter overhead.

Before (v1):

business_days = pd.date_range(start=start_date, end=end_date, freq=BDay())

for day in business_days:
    if day in opt['cond_exit']:
        dict_options[o].at[day, 'nopnl'] = 0

What's expensive here:

Python for loop: interpreter overhead for each iteration
.at[day, 'nopnl']: individual cell access, no optimization
if day in opt['cond_exit']: checking membership in each iteration

If you have 250 business days, that's 250 Python-level operations, each with full overhead.

After (v2):

business_days = pd.date_range(start=start_date, end=end_date, freq=BDay())
cond_exit_days = opt['cond_exit'].index

update_days = business_days.intersection(cond_exit_days)

dict_options[o].loc[update_days, 'nopnl'] = 0

What changed:

.intersection(): Set operation done in C code (pandas/numpy), not Python
.loc[update_days, 'nopnl'] = 0: Single vectorized assignment
- Pandas identifies all rows matching update_days in one pass
- Updates entire column slice in optimized C code
- Memory is written contiguously, cache-friendly

Why It's Huge:

No Python interpreter overhead per iteration
Single function call instead of 250+ function calls
Pandas can optimize memory access patterns
Typically 5-10x faster, sometimes more

The principle:

Whenever you see a loop modifying a DataFrame, ask: "Can I identify all the rows I need to modify, then update them all at once?" That's vectorization.

⚡ Performance Optimization Deep Dive

⚡ Performance Optimization Deep Dive

The Top 3 Performance Optimizations

1. Multi-Index Lookup: O(n) → O(1)

The Problem:

Before (v2):

After (v3):

What Changed:

The Catch:

2. Merge vs isin(): O(n×m) → O(n+m)

The Problem:

Before (v3_2):

What `.isin()` does internally:

After (v3_3):

What `.merge()` does internally:

Real-world impact:

3. Vectorization: Python Loop → Pandas Batch Operation

The Problem:

Before (v1):

What's expensive here:

After (v2):

What changed:

The principle:

⚡ Performance Optimization Deep Dive

The Top 3 Performance Optimizations

1. Multi-Index Lookup: O(n) → O(1)

The Problem:

Before (v2):

After (v3):

What Changed:

The Catch:

2. Merge vs isin(): O(n×m) → O(n+m)

The Problem:

Before (v3_2):

What .isin() does internally:

After (v3_3):

What .merge() does internally:

Real-world impact:

3. Vectorization: Python Loop → Pandas Batch Operation

The Problem:

Before (v1):

What's expensive here:

After (v2):

What changed:

The principle:

What `.isin()` does internally:

What `.merge()` does internally: