📊 Full opportunity report: Introducing Forezai · TradingAgents — a committee of LLMs decides paper-trades on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

Forezai · TradingAgents has developed a framework where multiple LLMs operate as a decision-making committee for paper-trading. This system aims to test whether AI can outperform random choices in simulated markets. The project enhances prior research by providing operational tools for systematic evaluation.

Forezai · TradingAgents has introduced an operational platform that employs a committee of large language models (LLMs) to make paper-trading decisions in simulated markets. This development extends previous research into AI-driven trading strategies by providing an autonomous system that can run daily trading loops, generate trade orders, and evaluate performance without risking real money. The system aims to assess whether AI agents structured in specialized roles can produce decision-making at least as reliable as random choices, marking a significant step in AI research for financial decision processes.

The Forezai · TradingAgents project is a fork of an open-source framework originally designed to explore multi-agent AI decision-making in trading. It retains the core architecture of thirteen specialized LLM roles, including analysts, debate agents, risk teams, and portfolio synthesizers, which argue and justify trading decisions based on structured reports. The new operational layer adds an autonomous scheduler, paper-trading interfaces, position management, and multi-broker support, enabling fully automated daily trading cycles in a simulated environment.

Unlike previous prototypes that functioned as research demos, this version includes a web dashboard built with FastAPI and React, allowing users to monitor performance metrics, equity curves, and decision breakdowns. It also incorporates safeguards to prevent unintentional real-money trading, with multiple layers of manual override and strict operational controls. The system is designed for research, not live trading, with all data stored in audit logs for later analysis.

Initial testing shows the system running daily cycles with a focus on evaluating whether the AI committee’s decisions outperform random or mechanical strategies over time. While the framework does not claim the LLMs predict markets accurately, it tests whether structured debate among specialized models can yield decisions that are at least no worse than random, providing insights into AI reasoning in complex decision environments.

Introducing Forezai · TradingAgents — Thorsten Meyer AI
AGENTS
● ANNOUNCEMENT / MAY 2026
THORSTEN MEYER AI · FOREZAI · § 03
FOREZAI · 03
TRADINGAGENTS · LAUNCH
Research Series · Companion to Polybot Week 1-2 · 2026-05-17

Introducing Forezai · TradingAgents.
A committee of LLMs
decides paper-trades.

After two weeks of finding out most parametric strategies don’t work, the obvious next research question: can multi-agent LLM judgment do any better?
A fork of the open-source TradingAgents framework (TauricResearch): thirteen LLM agents in four stages — four parallel analysts · a bull-bear debate with research-manager arbitration · a three-voice risk team · a two-layer trader + portfolio-manager decision. The fork keeps the agent graph intact and adds the operational layer the upstream doesn’t ship: an autonomous loop · a multi-broker abstraction · a local web dashboard · Codex OAuth · MCP plug-ins · 520+ unit tests. The question is narrower than “do LLMs predict the market” — that prior is “no, with high confidence.” The narrower question is: when LLMs are structured into specialised adversarial roles, does the committee produce decisions at least no worse than a coin flip after fees? Honest priors before running: it might fail too. If it appears to work, the most likely explanation is variance.
This is not financial advice. Nothing in this announcement should be used to inform real trading decisions. The software described trades simulated money by default. If you reconfigure it to trade real money, you should expect to lose that money — regardless of how clever any individual agent’s reasoning looks. Algorithmic trading is zero-sum after fees and structurally hostile to part-time retail strategies.
13 agents
Specialised roles in four stages
Analysts · Debate · Risk · Decision
78% / -33%
Polybot prior: fleet win rate
combined with -33% bankroll
520+
Passing unit tests across engine,
services, HTTP routes (starting baseline)
€0 floor
LLM cost on Codex OAuth
(falls back to public API per token)
FOREZAI / TRADINGAGENTS· APACHE 2.0 FORK· UPSTREAM TAURIC RESEARCH· LANGGRAPH· 13 AGENTS / 4 STAGES· 4 PARALLEL ANALYSTS· BULL-BEAR DEBATE· 3-VOICE RISK TEAM· TRADER + PORTFOLIO MANAGER· 5-TIER FINAL RATING· ALPACA PAPER + LOCAL + SHADOW· LIVE ENDPOINTS HARD-REFUSED· FASTAPI + REACT VIA CDN· CODEX OAUTH· MCP PLUG-IN REGISTRY· 520+ UNIT TESTS· POLYBOT WEEK 1: 21 EXPERIMENTS· WEEK 2: -33% BANKROLL· 78% FLEET WIN RATE· HONEST RESEARCH, NOT EDGE· FOREZAI / TRADINGAGENTS· APACHE 2.0 FORK· UPSTREAM TAURIC RESEARCH· LANGGRAPH· 13 AGENTS / 4 STAGES· 4 PARALLEL ANALYSTS· BULL-BEAR DEBATE· 3-VOICE RISK TEAM· TRADER + PORTFOLIO MANAGER· 5-TIER FINAL RATING· ALPACA PAPER + LOCAL + SHADOW· LIVE ENDPOINTS HARD-REFUSED· FASTAPI + REACT VIA CDN· CODEX OAUTH· MCP PLUG-IN REGISTRY· 520+ UNIT TESTS· POLYBOT WEEK 1: 21 EXPERIMENTS· WEEK 2: -33% BANKROLL· 78% FLEET WIN RATE· HONEST RESEARCH, NOT EDGE·
FIG. 01 — THE 13-AGENT COMMITTEE
Thirteen specialised roles · four stages · biases made to argue in public
The architecture forces the system to articulate its reasoning rather than relying on what a single context window happens to recall
Stage 1 · Four analysts in parallel4 agents
Market
Structure, ranges, regime indicators
News + Insider
News flow, filings, insider activity
Fundamentals
Balance sheet, earnings, ratios
Social Sentiment
Social-media tone, retail signal
Stage 2 · Bull-bear debate + research-manager arbitration3 agents
Bull researcher
Argues upside thesis from analyst reports
Bear researcher
Argues downside thesis from same reports
Research manager
Arbitrates · writes single synthesis
Stage 3 · Three-voice risk team3 agents
Aggressive
Looks for upside · accepts variance
Conservative
Looks for downside · protects capital
Neutral
Balances · forces downside articulation
Stage 4 · Two-layer decision2 agents
Trader
Three-tier proposal · buy / hold / sell
Portfolio manager
Five-tier rating + price target + horizon · sees arguments only, never raw data
The portfolio manager only sees the arguments, never the raw data — which forces the committee to make its reasoning explicit rather than relying on a single context window’s recall. The upstream framework ships the agent graph; it does not ship the operational machinery to run that graph on autopilot, observe its results honestly, store them for later inspection, or prevent the operator from accidentally trading real money. That gap is what the Forezai fork fills.
FIG. 02 — THE POLYBOT PRIOR · WHY THIS IS A DIFFERENT BET
Two weeks of paper-trading prediction markets · the trap underneath the headline numbers
25 experiments · 78% fleet-wide win rate · -33% bankroll · most parametric strategies are structurally negative-expectation when measured honestly
The flattering number
78%
Fleet-wide win rate · week 2
“You can win four out of five trades and still go broke, because the one loss is bigger than the four wins put together.” Win rate without P&L context is a mechanical illusion.
The honest number
−33%
Fleet bankroll · week 2 close
The strongest possible demonstration of the trap. A parametric trading strategy that looks compelling in a backtest will almost always fail to survive a fresh sample. Most “edges” are mechanical artefacts.
Week 1: 21 parallel strategy experiments · early winners mostly mechanical illusions · exactly one strategy (a fair-value taker on BTC) showed the mathematical signature of real edge over a few hundred settled trades. Week 2: same fair-value strategy with more data collapsed. A separate mid-week hypothesis (market-making) also failed cleanly. Fleet ended week 2 at roughly negative thirty-three percent of bankroll. The honest research finding wasn’t on the winning side — it was on the losing side. Adding more parameters to Polybot wouldn’t change that. TradingAgents is asking a separable question.
FIG. 03 — WHAT THE FORK ADDS · THE OPERATIONAL LAYER
Six layers the upstream framework doesn’t ship
Same agent graph, intact. The fork makes it a research instrument rather than a tech demo.
01 · Loop
An autonomous loop
Scheduler · watchlist · auto-trader maps ratings to paper orders · allow-list filtering · per-ticker cooldowns · sector caps · cash checks · position manager evaluates open positions every 60s for TP / SL / max-hold. Append-only audit logs.
02 · Brokers
Multi-broker abstraction
Three modes: local Python broker (yfinance fills, JSON-persisted) · Alpaca paper-trading adapter · “shadow” mode running both in parallel with divergence view. Real Alpaca live endpoints are hard-refused at multiple layers.
03 · Dashboard
A local web dashboard
FastAPI backend · React via CDN, no Node toolchain · SVG equity curve · rolling-peak drawdown · win-rate by rating / ticker / model · exit-reason breakdown · LLM cost vs realised P&L joined by run ID. Runs locally; nothing sent to a cloud service.
04 · Codex
Codex OAuth
Runs the engine on a ChatGPT Pro subscription via the Codex backend. LLM cost floor effectively zero if you already have ChatGPT Pro. Token stored encrypted locally. Falls back to the regular OpenAI API if you’d rather pay per token.
05 · Alerts
Multi-channel alerts
Slack · Discord · SMTP email · configurable filter on rating events and order fills · append-only history kept locally. Webhook URLs masked in API responses so a screenshot can’t accidentally leak credentials.
06 · MCP
MCP plug-ins
Registry for adding Anthropic Model Context Protocol servers (Kensho · Aiera · FactSet · Morningstar · LSEG) as analyst tools. Plug-ins advertise category (fundamentals · news · market data · social) · probe endpoint tests credentials.
Honest-by-design touches: every generated report prepends “Research, not advice” and appends a footer with version, commit, provider, models used, run ID, and cost. Closed trades carry the same metadata. 520+ passing unit tests across engine, services, and HTTP routes. The intent: when the system loses money, the journal makes it impossible to pretend it didn’t.
FIG. 04 — HONEST PRIORS · BEFORE RUNNING THIS IN ANGER
Three priors stated before the data starts arriving
The bias of the project: when the data says no, the dashboard says no, the article says no
1
It might fail too. LLMs are not oracles, and a sophisticated framework around language-model outputs does not change the underlying error rate of the model. Sample is still everything. The framework’s outputs are subject to the same statistical noise as any prediction system over small samples.
Highest likelihood
2
If it appears to work, the most likely explanation is variance. The same trap that caught the first article’s candidate edge applies here. A high win rate over fifty trades means much less than it looks. Without out-of-sample confirmation, a flattering early sample tells you almost nothing about whether the system has real edge.
Second-most likely
3
If it appears to work for the right reasons — empirical win rate matches stated confidence, and alpha-versus-benchmark persists across non-overlapping samples — that would be a meaningful research finding. Whether that happens, I don’t know. The point of putting it in the open is that the data will say.
Genuinely open
This is explicitly not a launch announcement for a product anyone should connect a real brokerage account to. The Alpaca live endpoints are hard-refused at multiple layers in the code, and the design choice is deliberate. The right next step is data, not deployment. The bias of the whole project is straightforward: when the data says no, the dashboard says no, the article says no, and no one tries to retroactively rescue the thesis. That’s the contribution.
FIG. 05 — WEEK THREE · WHAT THE METHODOLOGY WILL MEASURE
Four concrete measurements before publishing findings
The hope: write the week-three article from a position of “here’s what the data says”. The fear: another candidate falsified at higher sample. Both outcomes are publishable.
M1 · Sample discipline
Small watchlist for a few weeks before publishing
A handful of tickers across two or three sectors. Long enough to gather sample, narrow enough to keep attention on what’s actually happening per agent. Avoid the noise of a 65-ticker autonomous loop until the smaller version has been read carefully.
M2 · Calibration view
Stated confidence vs. realised win rate
When the system says “75% confident”, do the trades actually win 75% of the time? Same measurement applied to Polybot’s fair-value model. If the model is systematically over-confident, that bias dominates everything downstream.
M3 · Cost accounting
Cost per ticker · per rating · per profitable trade
With Codex OAuth the marginal LLM cost is effectively zero. With the public OpenAI API, each run is hundreds of agent turns. The honest question: does this scale economically if you ever did run it at real cost?
M4 · Non-overlapping windows
Alpha vs benchmark · out-of-sample
Not within-sample alpha — trivially inflatable. Hold out one period entirely, run the system on the next, then check whether the held-out result matches the in-sample stats. If they diverge sharply, the in-sample was curve-fit.
Open under Apache-2.0 with upstream cited from every relevant surface. Not open: the operator’s running results, the specific watchlist, the per-agent prompt customisations, the alert channels, the trade journals — kept local for the same reason Polybot’s per-experiment data is kept local. Publishing exact configurations encourages people to copy them with real money, which is the opposite of what an honest research project should do. Summary findings will be published. Recipes will not.
The bet is on a different mechanism, not a different parameter setting. The point is not to find a money-printing AI. The point is to put honest measurements of these systems into the public record — so the next person looking at the space starts a step further along than the last.
Thorsten Meyer AI · Introducing Forezai · TradingAgents · § 03

Potential Impact on AI-Driven Trading Research

This development represents a notable advance in applying large language models to financial decision-making research. By operationalizing a multi-agent AI system capable of autonomous trading simulations, Forezai · TradingAgents provides a platform to systematically evaluate whether structured AI reasoning can improve over simple heuristics or random choices. If successful, this approach could influence future AI trading strategies, emphasizing explainability and multi-perspective reasoning rather than raw prediction accuracy. It also offers a transparent environment for testing AI in finance without risking real capital, which is critical for ethical and practical reasons.

Quantum Machine Learning (QML): Platform, Tools and Applications (Volume 140) (Advances in Computers, Volume 140)

Quantum Machine Learning (QML): Platform, Tools and Applications (Volume 140) (Advances in Computers, Volume 140)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background on AI and Trading Strategy Testing

Previous research in AI-driven trading has often focused on backtested parametric strategies, which frequently fail to survive out-of-sample testing, revealing the limitations of rule-based approaches. The initial experiments with Polybot and similar systems demonstrated that even strategies with apparent edge often collapse under real-world conditions, highlighting the challenge of consistent profit generation. This has led to interest in alternative AI approaches, such as multi-agent systems and structured reasoning, to see if they can produce more robust decision-making.

The underlying research framework, TradingAgents, was originally designed to route market data through specialized LLM roles, encouraging explicit reasoning and debate rather than prediction. The recent operational version by Forezai extends this concept into a practical tool for ongoing testing, adding automation, multi-broker support, and user interfaces. This evolution reflects a broader trend of moving from theoretical AI models to operational research platforms capable of systematic evaluation in simulated trading environments.

“This system allows us to rigorously test whether AI, structured as a committee of specialized models, can make decisions that hold up over time in simulated markets. It’s about understanding the reasoning process, not just predicting prices.”

— Thorsten Meyer, project lead

Free Fling File Transfer Software for Windows [PC Download]

Free Fling File Transfer Software for Windows [PC Download]

Intuitive interface of a conventional FTP client

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Limitations and Unknowns of the Current System

It remains unclear whether the AI committee’s decisions will outperform random or heuristic strategies over extended periods and across different market conditions. The system currently operates in simulated environments with paper trades, and its effectiveness in live trading or with real capital has not been established. Additionally, the impact of potential biases among the specialized LLM roles and their influence on decision quality are still under investigation. The long-term stability and scalability of the approach also require further testing.

Context Engineering for Multi-Agent Systems: Move beyond prompting to build a Context Engine, a transparent architecture of context and reasoning

Context Engineering for Multi-Agent Systems: Move beyond prompting to build a Context Engine, a transparent architecture of context and reasoning

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Upcoming Testing and Research Directions

The next steps involve running extended experiments to evaluate the decision quality of the AI committee over longer timeframes and diverse market scenarios. Researchers plan to analyze the decision rationales, performance metrics, and failure modes to understand strengths and weaknesses. Further development will focus on refining the operational framework, integrating additional safeguards, and exploring variations in agent roles. Ultimately, the goal is to assess whether structured AI debate can meaningfully contribute to the development of more robust, explainable AI trading systems.

Technical Analysis of the Financial Markets: A Comprehensive Guide to Trading Methods and Applications

Technical Analysis of the Financial Markets: A Comprehensive Guide to Trading Methods and Applications

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this system trade with real money?

No, the current setup is designed for paper trading in simulated environments. It includes safeguards to prevent unintentional real-money trading unless deliberately overridden.

Does the AI predict market movements?

No, the system does not aim to predict prices but to evaluate whether a structured AI committee can make decisions that are at least as reliable as random choices in simulated trading scenarios.

What are the main components of the AI decision process?

The system involves multiple specialized roles, including analysts, debate agents, risk teams, and a portfolio synthesizer, which argue and justify trading decisions based on structured reports and reasoning.

Is this system ready for live trading?

No, it is currently designed for research and testing in simulated environments. Transitioning to live trading would require significant additional safeguards and validation.

How does this research contribute to AI in finance?

It explores whether structured AI debate among specialized models can produce more reliable decision-making, potentially informing future development of explainable and robust AI trading systems.

Source: ThorstenMeyerAI.com

You May Also Like

The pyramid cracks. What agentic AI does to the consulting leverage model.

Generative AI is disrupting the traditional consulting pyramid, shifting value from analysis to deployment and causing structural industry splits.

The citation. Why generative engine optimization rewards the same brand on the least stable ground.

An analysis of generative engine optimization reveals it favors established brands, creating stability for incumbents but challenging for newcomers.

From Sci-Fi to Reality: Crazy Gadget Ideas That Actually Came True

Bridging imagination and innovation, these once-fantastical gadgets have become reality—discover how science fiction is shaping our everyday lives.

The policy menu. There’s no single answer. There’s a menu — and choosing is a values choice in disguise.

A comprehensive analysis of the diverse policy options—UBI, ownership, data dividends, or do-nothing—highlighting their values, trade-offs, and uncertainties.