Most uses of an AI assistant in research are one-shot: you ask a question, you get an answer, and the answer sounds confident whether or not it is right. That is exactly the failure mode we worry about most. The easiest person to fool is yourself, and a fluent paragraph is a very efficient way to get fooled — it skips the part where you actually run something and look at what came back.
QuanterLab's MCP server exists to change the shape of that interaction. Instead of producing a single answer, an AI agent can operate the platform programmatically — list what strategies and saved work exist, launch a backtest or a Primitives run, poll it to completion, and read a deliberately narrow summary of the result. That turns one chat reply into a research loop: plan, run, read, refine. The point is not that the agent is smarter. It is that the agent is forced to go and check.
What MCP is, and what we actually expose
MCP (the Model Context Protocol) is a standard interface for letting an AI agent call tools. A server publishes a set of typed tools; a client — Claude, or any MCP-capable agent — discovers them and calls them with structured arguments. There is nothing finance-specific about the protocol itself. What matters is which tools you choose to expose and what they return.
Our server (mcp/MCPSERVER/server.py) speaks MCP over an authenticated HTTP transport at /mcp, behind OAuth 2.1 — every call carries a bearer token that resolves to a specific user, and that user identity scopes everything the tools can see. The surface is intentionally narrow. We deliberately trimmed it down to an orchestration surface: catalogs, listings, job launchers, progress pollers, and a couple of summarizers. We do not expose raw result-fetchers that would hand back the full numeric payload.
The tools fall into a few honest families:
- Catalogs (metadata only).
list_indicators,list_stochastic_methods,list_primitives,list_fundamental_analyzers, andlist_basic_backtest_strategiesdescribe what blocks, indicators and methods exist — their parameters, defaults and bounds — and nothing else. No market data, no computed values. - Your saved work (ownership-scoped listings).
list_my_backtests,list_my_filter_results,list_my_trading_configs,list_my_projects, andlist_my_primitives_projectsreturn summary rows for things you saved — and only you. A backtest row, for instance, carries a categoricalperformance_tier(promising / mixed / weak / unrated) instead of the underlying Sharpe and drawdown scalars. - Launch and poll (the async pattern). Each long-running workflow gets a pair:
start_*returns a task handle immediately;get_*_progressreports status and progress until it reachesdoneorerror. This coversstart_backtest,start_walkforward,start_stochastic_backtest,start_mr_scan,start_pairs_scan,start_regime_optimization,start_regime_analysis,start_basic_backtest, andstart_primitives_run. - Read a summary.
summarize_workflowturns a completed task into a categorical verdict plus a short, templated narrative;get_primitives_run_resultreturns a per-terminal summary in categorical tiers and bands. Neither returns equity curves, dollar values, per-name weights, or raw metrics — for the full picture you open the platform UI.
An agent never sends us a strategy to run. It calls list_my_primitives_projects to get an opaque project id, then start_primitives_run(project_id). The server resolves your own latest circuit for that project on the server side, runs it at the point-in-time anchor that was saved with it, and hands back only a task handle. The circuit — your tickers, parameters, capital, the actual strategy IP — never crosses the boundary, in or out.
Why the loop matters for honest research
The case for an agentic loop over a one-shot answer is not really a case about AI at all. It is the old forecasting result: people who update incrementally, who break a question into pieces and check each against reality, are better calibrated than people who commit to one confident story. Tetlock and Gardner (2015) made this concrete — the best forecasters are the ones who keep score and keep adjusting, not the ones with the most assured narrative. An iteration-and-calibration loop beats a single confident pronouncement, and it beats it precisely because it builds in the chance to be wrong and notice.
Plan → run → read → refine is that loop, made operational. The agent proposes a configuration, launches it, waits for the real result, reads what came back, and adjusts. At every cycle there is a hard contact with data the agent did not get to invent. Feynman (1974) framed scientific integrity as a kind of leaning-over-backwards to report the things that might prove you wrong; an agent that must wait for an out-of-sample result before it can claim anything is, mechanically, doing a small version of that. The fluent paragraph is no longer enough — there is a task id to poll.
The discipline: a collaborator still has to respect out-of-sample
An agent that can launch jobs in a loop is also an agent that can overfit in a loop, faster than any human. This is the danger we take most seriously, because it is the danger most likely to feel like progress while it happens. If you let something sweep parameters until a number looks good, the number will eventually look good — and it will mean nothing. Bailey, Borwein, López de Prado and Zhu (2014) called the polished version of this "pseudo-mathematics": backtest statistics presented with full rigor that are, underneath, the product of having tried enough configurations. López de Prado (2018) is blunt that an isolated backtest, however clean, is closer to a story than a test. And Harvey and Liu (2015) showed how the published cross-section of "anomalies" looks very different once you account for how many things were tried.
The platform is built so the agent inherits the same guardrails a careful human gets, rather than a faster way around them:
- Out-of-sample is a first-class argument, not an afterthought.
start_backtesttakes anoos_pctreservation.start_walkforwardexists specifically to test that an edge is not curve-fit: it re-fits parameters on an in-sample window, then applies them untouched to the next out-of-sample window and steps forward. The whole point of the tool is to refuse to grade a strategy on the data it was tuned on. - Sweeps are bounded. The launchers clamp grid sizes and reject pathological cost ceilings — a regime optimization that would imply tens of thousands of inner evaluations is refused at the boundary with a hint about which knob to turn down. There is also a small cap on concurrent in-flight tasks per user. None of this makes overfitting impossible; it makes the brute-force version inconvenient enough that you have to mean it.
- Point-in-time is enforced on the server, not requested by the client. A Primitives run executes at the anchor saved with it, and the result surfaces any look-ahead warning. The agent cannot quietly move the anchor to flatter the result.
- Results come back deliberately coarse. Verdicts and tiers, not scalars. That is partly a data-redistribution boundary (covered in its own article), but it has a second, useful effect: a categorical verdict is hard to mine for an optimization target. You cannot gradient-descend on the word "mixed".
Treat the loop the way you would treat a diligent junior researcher: it can run far more experiments than you can, which makes it far more useful and far more dangerous. The same multiple-testing skepticism you would apply to your own afternoon of fiddling applies to its afternoon of fiddling — more so, because it never gets tired and stops. The honest move is to decide what would change your mind before you launch, and to count how many things were tried.
A worked shape
A realistic session reads less like a question-and-answer and more like a lab notebook. The agent calls describe_capability to map an intent to the right workflow, lists your saved configs, launches a backtest with an out-of-sample reservation, polls it, reads summarize_workflow for a verdict, and — if the verdict is anything short of clearly negative — launches a walk-forward to see whether the edge survives out of sample. Each arrow is a tool call with a real result behind it. The agent ends the session able to say what it ran and what came back, rather than what it believes.
That is the whole ambition here, and it is a modest one. We are not promising that an agent finds anything. We are making it cheap to do the slow, checkable version of research — and a little inconvenient to do the fast, self-fooling version. An agentic loop will not save you from a bad idea. But it will make you watch the out-of-sample window before you fall in love with one, and on most days that is the more valuable service.
- The MCP server lets an AI agent operate QuanterLab programmatically — catalogs, ownership-scoped listings, launch/poll pairs, and categorical summaries — over an authenticated, per-user transport.
- That turns a single chat answer into a real research loop: plan, run, read, refine, with a hard contact against data at every cycle.
- Iterating with calibration beats one-shot confidence — the forecasting literature is clear that the best work updates incrementally rather than committing to a story.
- An agent can overfit in a loop, fast. The guardrails — out-of-sample as a first-class argument, walk-forward, bounded sweeps, server-enforced point-in-time, deliberately coarse results — are the same discipline a careful human needs, not a way around it.
- Primitives runs go by reference: the agent never sends or receives the circuit, only an opaque project id and a categorical outcome.
None of this is a substitute for judgment. It is scaffolding for it. The agent runs the experiments; deciding what would count as evidence — and being honest when the evidence is thin — is still your job, and always will be.
Further Reading
Foundational papers
- Tetlock, P. E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
- Feynman, R. P. (1974). Cargo Cult Science (Caltech commencement address). Engineering and Science, 37(7), 10–13.
- Bailey, D. H., Borwein, J. M., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458–471.
- Harvey, C. R. & Liu, Y. (2015). Backtesting. Journal of Portfolio Management, 42(1), 13–28.
Textbook references
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Related QuanterLab articles
- The data boundary: why the agent never sees raw prices
- Quantin: a research assistant, not a strategy oracle
Try it in QuanterLab
Try it in QuanterLab. Connect an MCP-capable agent to quanterlab.com/mcp, then ask it to list your saved backtests, launch one with a 40% out-of-sample reservation, poll it to completion, and follow up with a walk-forward. Watch the loop run — and notice that every claim it makes has a task id behind it.