AI traders "flop" in group interviews: overtrading, chaotic strategies... algorithmic models still struggle to understand the market
Artificial intelligence is knocking on the doors of Wall Street trading rooms, but so far its report card isn't looking good.
Early results from a series of public trading competitions show that mainstream large language models (LLMs) generally perform poorly in autonomous trading—most systems lose money, trade too frequently, and make radically different decisions when given the same prompts. These results raise a central question: just how deep is the gap between LLMs and real market operations?
The most representative case comes from the Alpha Arena competition operated by tech startup Nof1. The competition pits eight advanced AI systems—including Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, and Elon Musk’s Grok—against each other in four independent rounds. Each system receives $10,000 before each round and autonomously trades U.S. tech stocks over two weeks. Ultimately, the overall portfolio lost about one-third, with only 6 out of 32 results showing a profit.
Nof1 founder Jay Azhang bluntly stated: "Handing money directly to an LLM and letting it trade on its own simply doesn’t work right now."
Competition Results: Losses, Overtrading, and Divergent Decisions
Alpha Arena's data reveals multiple shortcomings of current LLMs in trading scenarios. Under the same prompts, Alibaba's Qwen made 1,418 trades in one round, while the best-performing Grok 4.20 only placed 158 trades. Grok’s best performance came in the round where it could observe competitors’ results.
AI blog Flat Circle tracked 11 market-related arenas, showing that at least one model in every arena was profitable, but only two arenas had median models with positive returns, indicating most models struggle to beat the market.
Differences in decision-making between models are also notable. According to Azhang, in Alpha Arena's latest test, Claude tended to go long, Gemini had no aversion to shorting, and Qwen enjoyed taking risks with high leverage. "They each have a 'personality'; managing them is almost like managing a human analyst," said Doug Clinton, head of Intelligent Alpha, which runs LLM-driven funds. He noted that informing models of their biases can improve results to some extent.
Capability Boundaries: LLMs Excel at Research, Struggle with Timing
Jay Azhang pointed out that LLMs have advantages in research and using the right tools, but face systemic deficiencies in trading execution: they do not understand the weighting of variables such as analyst ratings, insider trading, or sentiment changes that affect stock prices, making them prone to errors in timing, position size, and excessive trading.
Intelligent Alpha's benchmark tests provide a relatively positive reference. Ten AI models were provided with financial documents, analyst forecasts, earnings call transcripts, macroeconomic data, and web search access, focusing on predicting earnings direction. Results showed that in Q4 2025, OpenAI’s ChatGPT achieved a 68% accuracy rate in predicting the direction of earnings forecasts—the best result to date. Clinton said model performance generally improves with each new version release.
Methodological Dilemma: Backtesting Fails, Live Market Testing Is the Only Option
Assessing AI’s trading abilities faces a fundamental methodological challenge. Traditional quantitative strategies rely on historical backtesting, but this framework is nearly useless for LLMs—a model asked in 2026 how to trade March 2020 already "knows" how that history unfolded. This contamination, called "lookahead bias," forces researchers to rely on live market testing, leading to the proliferation of benchmarks and competitions.
Flat Circle blog author and YipitData co-founder Jim Moran believes that most public experiments so far have cycles that are too short and results that are too noisy to support definitive conclusions. These arenas also have inherent disadvantages, including the inability to access proprietary stock research and lower execution quality. "If you transplanted an AI agent from one of these arenas directly into a top hedge fund, it would likely perform better," he said.
Industry Outlook: Truly Effective Strategies May Fade from Public View
Former Coatue Management head of data science and current NX1 Capital staffer Alexander Izydorczyk recently wrote that none of the AI trading bots he tracks have shown sustained excess return capabilities. He argues that the limitation of these arenas lies in the lack of practical quant techniques used by secret trading organizations in their training data.
However, Izydorczyk also left an intriguing judgment: "Sometimes beginners can see things that veterans can’t." He wrote in his blog, "When LLM agent trading strategies truly start to work, you won’t hear about it right away."
Nof1 is preparing for Alpha Arena season two, planning to give each AI model web search, longer thinking time, more data sources, and multi-step execution abilities. However, the company’s core business model is to provide retail traders with system tools to build AI trading agents—rather than putting AI directly on the trading desk. This positioning itself may be the most pragmatic annotation of current AI trading capabilities.
Risk warning and disclaimerMarkets have risks, investment requires caution. This article does not constitute personal investment advice and does not consider individual users’ specific investment objectives, financial situations, or needs. Users should evaluate whether any opinions, viewpoints or conclusions in this article are suitable for their particular circumstances. Investing based on this, responsibility is your own.