The data boundary: why the agent never sees raw prices

When you let an AI agent drive a research platform through QuanterLab's MCP server, an obvious question arrives quickly: what does the agent actually get to read? The honest answer is "less than you might expect, on purpose." Ask the agent how a backtest did and it learns the result was in a promising tier with a moderate drawdown band over many trades. It does not learn the equity curve, the dollar P&L, the per-name weights, or the closing prices of anything. That is not an oversight. It is the boundary doing its job.

There are two reasons the boundary exists, and they are easy to conflate. One is a licensing obligation we take seriously. The other is a quieter research benefit that turns out to point in the same direction. This article walks through both, and then through the mechanism — because the most reassuring thing about a boundary is being able to see exactly where it is drawn.

The core idea

The MCP server returns categorical results — tiers, bands, deciles, regime labels, counts — and never raw prices, dollar values, per-name fundamentals, or equity curves. Everything an agent reads has been coarsened to the level of structure and relative outcome, not specific numbers.

Reason one: we don't get to re-export someone else's data

QuanterLab consumes market data under enterprise licenses (FMP for fundamentals and prices, with FRED- and CBOE-relayed series for macro and volatility indicators). Those licenses let us use the data inside the platform. They do not let us turn around and redistribute the raw feed through a second pipe — and an MCP server that streamed back daily OHLCV or per-quarter fundamental levels would be exactly that second pipe. So the egress layer strips it.

The subtlety, which the platform learned the hard way through several rounds of internal red-teaming, is that you cannot close this leak by blocking obviously-named fields. A field called prices is easy to drop. The problem is reconstruction across innocent-looking derived fields. A per-trade share count plus a per-trade P&L plus an absolute-dollar equity curve lets you back out the price path during every holding period. A price-to-earnings ratio multiplied by a published quarterly EPS reconstructs the closing price to the cent. The leak surface is closed under arithmetic, which means a blacklist can never be complete. That realization is what shaped the whole design.

Reason two: it is harder to fool yourself with a band than with a number

Here is the part that is less about compliance and more about research honesty. There is a long, uncomfortable literature on how easily a specific number invites a specific story. When you can see that a strategy returned 23.4% with a Sharpe of 1.41 over a particular window, the temptation is to reason backward — to nudge a parameter, re-run, and chase the number up by a tenth. That is the mechanism behind backtest overfitting, and Bailey, Borwein, López de Prado and Zhu (2014) showed just how mechanically a researcher trying enough configurations will eventually surface a beautiful in-sample result that means nothing. Harvey and Liu (2015) made the same point at the level of a whole field: run enough tests and an impressive t-statistic is the expected outcome of noise, not evidence.

An agent that can only see promising / mixed / weak is, almost by accident, insulated from the worst of this. It cannot curve-fit to the third decimal of a Sharpe ratio it never received. It is pushed to reason about the things that are more likely to survive out of sample: which regime a strategy works in, whether it beat or lagged a benchmark, how many trades it took to get there, whether the drawdown was shallow or severe. López de Prado (2018) argues throughout Advances in Financial Machine Learning that the durable signal in a backtest is structural, not point-estimate; the categorical projection is a crude but real way to keep the agent's attention there. We are not claiming this turns an agent into a careful researcher. We are claiming it removes one of the easier ways to be a careless one.

A note on honesty

None of this makes a strategy good. A "promising" tier is still just a label on a backtest, and a backtest is still a single draw from a noisy process (Taleb, 2001, on mistaking that draw for skill). The boundary protects against over-precise reasoning; it does not protect against being wrong. Those are different problems, and the platform does not pretend the first solves the second.

How it actually works

The enforcement lives in two small, deliberately boring modules in the MCP server, and the pipeline at every tool is the same: upstream result → sanitize → egress filter → response.

The sanitiser: transform what's structural, drop what's raw

sanitize.py runs first and does the transforms. It walks the entire response tree and rewrites leaky shapes into safe ones. An absolute-dollar equity curve becomes a relative multiplier that starts at 1.0 — the drawdown profile and recovery shape survive, the capital anchor is gone. Monte Carlo dollar percentiles get normalized against their own median, so the cone-of-outcomes shape stays but the spot-price anchor that would invert it does not. Per-trade entry and exit prices, share counts, and dollar P&L are stripped; the percentage return of each trade, a pure ratio, survives. Absolute-level fundamentals (revenue, EPS, book value, EBITDA) are removed globally, while the ratios and z-scores built on them stay — because the analytical product of a value analyzer is "cheap or expensive versus history," and a z-score carries that without carrying the price.

The egress whitelist: default-deny, per tool

Then egress.py applies a default-deny whitelist, declared separately for every single tool. The schema is a plain data structure: a field is either explicitly allowed, recursed into with its own sub-schema, or passed through a transform that coarsens it — a numeric score becomes a decile label (D1..D10), a percentile becomes a quintile (Q1..Q5), a Hurst exponent becomes mean_reverting / random_walk / trending, a volatility becomes a tier. Anything a tool does not explicitly list is dropped. The inversion from blacklist to whitelist is the whole point: adding a new field to a response is now a deliberate act someone has to write down, not a quiet drift that ships a leak. A CI guard fails the build if a registered tool has no whitelist entry at all.

Defence in depth, honestly stated

If a tool is missing a whitelist, the gate currently fails open — it logs a warning and passes the payload through — so a freshly-added tool isn't silently broken before its entry ships. That is a real trade-off, and it is exactly why the sanitiser runs first and why critical tools carry a second, independent allow-list. We would rather state the seam plainly than imply the wall has no seams.

Two allow-lists that must agree

The Primitives run-result tool is the clearest example of defence in depth. The strategy builder itself never sends raw node results across the boundary. Inside Primitives, a projection function reads each terminal node and emits only a categorical summary — a performance tier, a drawdown band, a trades tier, a beat/lagged/matched flag, a market-state regime label, a basket count. Any result kind it does not understand (a raw price series, a bare scalar, an unknown dict) returns nothing and is dropped: it fails closed. Then, on the MCP side, a second strict whitelist enumerates that same small set of categorical keys and discards anything else. Two independent lists, written in two different services, both have to agree before a field reaches the agent. If one has a bug, the other still holds. Feynman's (1974) warning that the easiest person to fool is yourself is the design note here: we assume each layer is fallible and make them check each other.

What the agent does see

It is worth being concrete about what survives, because the boundary is meant to be useful, not merely restrictive. An agent reading through MCP gets: relative equity multipliers and drawdown shapes; z-scores and percentage returns aggregated to the portfolio level; regime labels and trade counts; ranking deciles and qualitative tiers; the point-in-time anchor a run was evaluated at, and any look-ahead warning attached to it. That last item matters as much as the redactions — the boundary surfaces the honesty metadata (when, and whether the result might be contaminated) right alongside the coarsened outcome.

Key points

The MCP server returns categorical results — tiers, bands, deciles, counts, regime labels — and never raw prices, dollars, per-name fundamentals, or equity curves.
One reason is licensing: we use market data under license and do not get to re-export the raw feed. Leaks reconstruct across innocent fields, so the fix is structural, not a blacklist.
A second reason is research honesty: an agent reasoning about structure and relative outcomes can't curve-fit to a precise number it never received.
Mechanism: sanitize.py transforms leaky shapes (equity curves → multipliers, dollars → bands), then a per-tool egress.py default-deny whitelist drops anything not explicitly allowed.
Critical paths carry two independent allow-lists that must agree, and the run result fails closed on any result kind it doesn't understand.
The boundary guards against over-precise reasoning. It does not make a strategy good — a "promising" label is still just a label on one noisy backtest.

Closing

A boundary like this is never finished and never perfectly tight; the fail-open default and the long history of patched reconstruction leaks are proof that we keep finding seams. What we can offer is a design that assumes its own fallibility — transform first, then default-deny, then make two layers agree — and a habit of writing the seams down rather than papering over them. The agent sees a coarser world than you do when you open the platform yourself. That is the trade, and on both the licensing side and the research side, we think it is the honest one.

Bailey, D. H., Borwein, J. M., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458–471.
Harvey, C. R. & Liu, Y. (2015). Backtesting. Journal of Portfolio Management, 42(1), 13–28.
Feynman, R. P. (1974). Cargo Cult Science (Caltech commencement address). Engineering and Science, 37(7), 10–13.

Textbook references

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Taleb, N. N. (2001). Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets. Random House.

Try it in QuanterLab

Try it in QuanterLab. Connect an AI agent to the QuanterLab MCP server and ask it to run one of your saved Primitives projects, then read the result. Notice what comes back: a performance tier, a drawdown band, a regime label, the point-in-time anchor — and nowhere a single price or dollar figure. That coarse-by-design view is the data boundary in action.