How we taught a computer to read charts

A field report from sixteen experiments, three reboots, and one model that actually works.

2026-05-08 · by Greg Brown with Claude (Anthropic)

tl;dr

We trained twelve neural networks to rank ~3,800 stocks by how their next 30 to 120 days are likely to play out. On 2.58 million held-out historical signals, the average model picks the better of two stocks ~62% of the time. The strongest model — predicting which stocks will have the least downside risk over the next 60 trading days — gets it right 64.4% of the time.

More usefully: when the model says it's confident, it's right 70 to 81% of the time. When it says it doesn't know, it correctly hovers at coin-flip in the middle deciles. That second part, the knowing what it doesn't know, is the part we're actually proud of.

What follows is how we got there, including the parts where the model collapsed to predicting the population mean for a month straight, where the GPU caught fire (figuratively, we hope), and where the entire training stack rebooted the laptop three times in one day. Engineering blog as comedy. You're welcome.

The question we kept asking

Trend-line analysis has a reputation problem. Half the internet treats horizontal lines on price charts as the gospel of risk; the other half treats them as horoscopes for the financially curious. Both factions have strong opinions and weak data.

We started with a statistical lattice to put the question on numerical footing. The depressing first answer: active supports break 76% of the time within 30 trading days. Resistance, 77%. If your trade is "the line holds", you are wrong three times out of four. The lattice can refine that with geometric features (line age, gradient, gaps), but the working range tops out at ~55% hold for the historically safest cells — still below 50/50. So: as a binary "this line holds" trade, even the best of the lattice loses.

But the binary frame wasn't necessarily the right frame. Maybe a model that ranks stocks against each other, instead of trying to predict any individual line's fate, can find structure the lattice can't. We thought we'd find out. Take a strict geometric trend-line definition (two confirmed swing points, a greedy envelope scan, no hand-drawing, no vibes), apply it to ~3,800 stocks across decades, score every candidate feature, ask the only question that matters: does any of this carry signal?

Spoiler from sixteen experiments: yes, some, but only after we stopped pretending the problem was easy.

Experiments 1 through 6: the great regression collapse

We started with the obvious thing — predict the percentage return at 30, 60, 90, and 120 days from a tabular feature vector. LightGBM, thirty-three features, one model per horizon. It got 53% directional accuracy. A coin flip wears that as a Halloween costume.

So we went to convnets. The first was a small dilated CNN, 76,000 parameters. It managed 55–57%, swinging ±10pp between batches like a drunk pendulum. We grew it to 1.76 million parameters with attention pooling and forty-two regression outputs across multiple horizons. This was, in retrospect, an act of architectural arson.

The 1.76M model produced what we now refer to as the great collapse: every prediction, for every ticker, on every day, was the same number. The model had discovered that the loss function was minimised by simply predicting the population mean and refusing to engage further. We checked, hopefully: maybe it was predicting centered means. It was not. It was predicting one constant. For everyone. Forever.

We tried Mamba. Mamba is the architecture du jour for long sequences. Mamba runs on CUDA. We had a Mac and a 3070 with 8 GB of VRAM. The 3070 swap-thrashed itself into a coma in roughly an hour and twenty minutes per batch of twenty tickers. The Mac, lacking CUDA kernels, ran the JIT scan loop at 0.3 samples per second, which is roughly the throughput of a child reading aloud. We pivoted.

Experiment 7: pairwise ranking, or, "stop trying to predict the number"

The collapse had a clear cause. With independent per-sample loss, the model can always game it by predicting one constant — whatever the population's least-bad guess turns out to be. Every regression loss we tried — Huber, Pinball, Asymmetric Gaussian — ended at the same depressing constant, because all of them rewarded that behaviour.

The fix was to stop asking "what number?" and start asking "which one is bigger?" Pairwise margin loss compares two samples at a time and tells the model: this one was higher than that one, you should rank them in that order. The optimal constant strategy stops working, because constants don't have an ordering. The model is forced to differentiate.

This was the moment things started moving. Not working, exactly. But moving.

Experiments 12 through 14: the data was wrong

We trained v1 on 6.64 million signals. It got promising results. Then we found the bugs. Bug one: pass-2 trend lines (the finer-grained ones the analyser produces as a second pass) were duplicated in the training set, so popular lines voted twice. Bug two: there was no liquidity filter, so penny stocks with hundred-bar histories were leaking in. Bug three: the proximity threshold for "near the price" was set wide enough that lines lying out of any meaningful proximity were still being scored.

We fixed the bugs. Re-generated the index. It dropped from 6.64 million signals to 2.58 million. We retrained from scratch, the v2 generation, and it was worse.

This was disorienting until we realised the v1 numbers had been measured against the bugged holdout. v1 looked good against itself. Once both models were scored on the same clean data — via a script cheekily named compare_old_vs_new.py — v2 was, in fact, better. The lesson: when you change the data, the old benchmarks are not your friends. Re-score, always.

v3 added a continued-training pass at one-third the learning rate. v4 fixed a few more analyser ordering bugs. By v4 we had something worth keeping, but only one strong horizon — the model was learning magnitude, but the horizons that paid most weren't the ones that ranked best.

Experiment 15: two heads are better than one

v5 was the architectural pivot. Rather than asking one model "what is the magnitude of this stock's next 30 days?", we split it in two:

Volatility models (4 of them) ask "how big a spike up or down might happen?" Capped at 60 days, because anything longer is too noisy to count as a spike.
Hold models (8 of them) ask "if the trend line keeps working, how much area accumulates between the line and the price, and in which direction?" 30, 60, 90, and 120 days, in both directions.

Twelve models. Same architecture, same loss, just twelve different questions. This was the version we shipped first. The full-market decile spread on 2.58M held-out signals confirmed something we'd been hoping was true and never quite proven: the rank order carries directional information. Top-decile picks went net up 66 to 74 per cent of the time, depending on horizon. The bottom decile sat at 43%, give or take, regardless of horizon. The model's "buy" signal sharpens with time; the "avoid" signal stays steady.

Experiment 16: slow and steady, for 62 hours

The v5-slow continuation was straightforward in concept and hostile in execution. Take v5, drop the learning rate from 3e-4 to 1e-4, and let it stew. Twelve labels, one Python process holding all twelve model+optimiser pairs in MPS memory simultaneously, sharing a held-out evaluation set, with a background fetcher feeding the GPU at 95% utilisation.

This worked on the third try. The first two attempts used a multi-process consumer architecture that opened twelve copies of the holdout in twelve different RAM regions and rebooted the laptop three times in one day. We added a memory watchdog that SIGKILLs the trainer at 80 GB, which, mercifully, never had to fire during the eventual successful run. Memory flat-lined at 63 GB for 62 hours.

After 62 hours, every label's holdout Spearman had improved. Apples-to-apples — v5's checkpoint re-evaluated against the same v5-slow holdout — the average lift was +0.087 Spearman across all twelve labels. Eight of the twelve models swapped their saved best for a later checkpoint when the post-training decile-best scan ran. We shipped the result.

The headline finding: the model knows when it knows

If you take only one chart away from this post, take this one. Bin the model's predictions into deciles by score (D1 is the worst-ranked tenth of the universe, D10 is the best), then count how often each bucket is on the side of the population median that the score said it should be on.

Model	D1 right	D5 right	D10 right
vol_min_bot2_60d	80.8%	52.6%	80.0%
vol_max_top2_60d	77.7%	51.1%	74.4%
hold_down_60d	76.8%	52.2%	77.2%
hold_down_120d	75.9%	52.6%	78.4%
hold_up_120d	74.6%	51.6%	74.1%
hold_up_30d	71.2%	51.4%	69.6%

(Six representative models out of twelve. The other six follow the same shape.)

The model's worst-ranked decile (D1) — the picks it explicitly flags as below-median — is correctly below median 71 to 81% of the time. Its best-ranked decile (D10) is correctly above median 69 to 80% of the time. The middle deciles drop to ~51%, almost exactly chance. This is what good calibration looks like: confident at the extremes, ignorant in the middle, and honest about both.

The U-shape is the meaningful part. A model whose accuracy was flat at 62% across all deciles would be no more useful than its average accuracy: every prediction would be a 62% guess. A model whose accuracy concentrates at the extremes is dramatically more useful, because it tells you when to listen and when to ignore it. The top decile here is a different beast from the average prediction. It's the part of the rank where the signal is strongest — and, for what it's worth, the part the home page surfaces as Top 10%, Top 5%, and Top 1% badges.

The number that gets quoted in tweets isn't the headline. Pair accuracy averaged across all twelve models is 61.9%, and the strongest individual model hits 64.4%. Both numbers are fine and both are below. They're not the interesting part — the table above is. If you only read 62%, you've taken the average and thrown away the structure. The structure is where the model earns its keep.

If you want one number anyway

Across 2,584,905 held-out historical signals per label, scored after deployment:

Model	Pair-acc	Above-median	Spearman ρ	Top-10% lift
vol_min_bot2_60d	64.4%	65.1%	+0.42	6× less downside
vol_min_bot1_30d	63.2%	63.9%	+0.38	5.7× less downside
hold_down_90d	63.0%	63.2%	+0.38	+8.1× resistance break
hold_down_60d	62.8%	63.0%	+0.37	+8.4× resistance break
hold_down_120d	62.8%	63.0%	+0.37	+7.6× resistance break
vol_max_top2_60d	61.8%	62.2%	+0.34	+22.9% peak
hold_down_30d	61.6%	61.7%	+0.33	+6.9× resistance break
hold_up_120d	61.2%	61.4%	+0.33	+4.7× support hold
vol_max_top1_30d	61.1%	61.6%	+0.32	+15.5% peak
hold_up_90d	60.7%	60.9%	+0.31	+4.7× support hold
hold_up_60d	60.5%	60.6%	+0.31	+5.0× support hold
hold_up_30d	59.4%	59.4%	+0.27	+4.5× support hold
mean	61.9%	62.2%	+0.35	—

Pair accuracy is the probability that the model, shown two random stocks at two random points in history, correctly says which one performed better on its metric. 50% is a coin flip. 61.9% is not a coin flip. But it averages over the deciles, and as the table above shows, the deciles are not the same. Use the deciles.

What we didn't solve

We trained models that rank stocks. We did not train models that trade them. The decile spreads above are accuracy claims, not return claims. Whether you can extract realised return from this rank ordering, after costs and slippage and the small matter of execution in a market that knows you exist, is a different question. We are working on it. There will be a different blog post. It may include more dignified failures.

Two specific things we're explicitly not claiming:

The model has alpha. Alpha is risk-adjusted excess return. We've shown a ranking edge. Whether that edge survives the bid-ask spread, transaction costs, and the elemental cruelty of having to actually own things is an empirical question for the backtest rewrite (currently task #59 on our board, in case you want to follow along).
The model is right about any specific stock you're looking at right now. It's right about the average of two thousand stocks over thousands of trading days. Individual predictions vary, sometimes wildly. The U-shape above is a statistical claim, not a personal promise.

Open questions and v6

The two long-horizon hold-down models (90d and 120d) regressed relative to v5 — the only two of twelve that did. Continued training past a certain point caught the long-tail noise rather than the signal. We didn't ship them. We're still trying to work out whether the regression is overfitting (in which case stop training), or underfitting (in which case feed more data), or feature-bound (in which case rebuild the input). The next generation, v6, will re-investigate proximity-gap search, exchange features, and a structured audit of which historical signals are actually useful for training versus actively confusing the model.

We also have an audit task open for cases like HAS.L on 2021-06-14, where the analyser is showing both the original trend line and a newer one that better describes recent action, and we suspect we may be feeding the wrong one to training. Stay tuned.

Coming next

The training-curve overlay (Spearman by step, twelve labels, with the v5 baseline) and the per-model U-curve charts will get a dedicated page once they're rendered. They are, frankly, more visually striking than the tables above. If you'd like the raw data to plot yourself, the full per-decile breakdown lives in training/full_market_report.py, fed by the per-label winner_scores_full_market.csv outputs.