How we taught a computer to read charts
A field report from sixteen experiments, three reboots, and one model that actually works.
tl;dr
We trained twelve neural networks to rank ~3,800 stocks by how their next 30 to 120 days are likely to play out. On 2.58 million held-out historical signals, the average model picks the better of two stocks ~62% of the time. The strongest model — predicting which stocks will have the least downside risk over the next 60 trading days — gets it right 64.4% of the time.
More usefully: when the model says it's confident, it's right 70 to 81% of the time. When it says it doesn't know, it correctly hovers at coin-flip in the middle deciles. That second part, the knowing what it doesn't know, is the part we're actually proud of.
What follows is how we got there, including the parts where the model collapsed to predicting the population mean for a month straight, where the GPU caught fire (figuratively, we hope), and where the entire training stack rebooted the laptop three times in one day. Engineering blog as comedy. You're welcome.
The question we kept asking
Trend-line analysis has a reputation problem. Half the internet treats horizontal lines on price charts as the gospel of risk; the other half treats them as horoscopes for the financially curious. Both factions have strong opinions and weak data.
We started with a statistical lattice to put the question on numerical footing. The depressing first answer: active supports break 76% of the time within 30 trading days. Resistance, 77%. If your trade is "the line holds", you are wrong three times out of four. The lattice can refine that with geometric features (line age, gradient, gaps), but the working range tops out at ~55% hold for the historically safest cells — still below 50/50. So: as a binary "this line holds" trade, even the best of the lattice loses.
But the binary frame wasn't necessarily the right frame. Maybe a model that ranks stocks against each other, instead of trying to predict any individual line's fate, can find structure the lattice can't. We thought we'd find out. Take a strict geometric trend-line definition (two confirmed swing points, a greedy envelope scan, no hand-drawing, no vibes), apply it to ~3,800 stocks across decades, score every candidate feature, ask the only question that matters: does any of this carry signal?
Spoiler from sixteen experiments: yes, some, but only after we stopped pretending the problem was easy.
Experiments 1 through 6: the great regression collapse
We started with the obvious thing — predict the percentage return at 30, 60, 90, and 120 days from a tabular feature vector. LightGBM, thirty-three features, one model per horizon. It got 53% directional accuracy. A coin flip wears that as a Halloween costume.
So we went to convnets. The first was a small dilated CNN, 76,000 parameters. It managed 55–57%, swinging ±10pp between batches like a drunk pendulum. We grew it to 1.76 million parameters with attention pooling and forty-two regression outputs across multiple horizons. This was, in retrospect, an act of architectural arson.
The 1.76M model produced what we now refer to as the great collapse: every prediction, for every ticker, on every day, was the same number. The model had discovered that the loss function was minimised by simply predicting the population mean and refusing to engage further. We checked, hopefully: maybe it was predicting centered means. It was not. It was predicting one constant. For everyone. Forever.
We tried Mamba. Mamba is the architecture du jour for long sequences. Mamba runs on CUDA. We had a Mac and a 3070 with 8 GB of VRAM. The 3070 swap-thrashed itself into a coma in roughly an hour and twenty minutes per batch of twenty tickers. The Mac, lacking CUDA kernels, ran the JIT scan loop at 0.3 samples per second, which is roughly the throughput of a child reading aloud. We pivoted.
Experiment 7: pairwise ranking, or, "stop trying to predict the number"
The collapse had a clear cause. With independent per-sample loss, the model can always game it by predicting one constant — whatever the population's least-bad guess turns out to be. Every regression loss we tried — Huber, Pinball, Asymmetric Gaussian — ended at the same depressing constant, because all of them rewarded that behaviour.
The fix was to stop asking "what number?" and start asking "which one is bigger?" Pairwise margin loss compares two samples at a time and tells the model: this one was higher than that one, you should rank them in that order. The optimal constant strategy stops working, because constants don't have an ordering. The model is forced to differentiate.
This was the moment things started moving. Not working, exactly. But moving.
Experiments 12 through 14: the data was wrong
We trained v1 on 6.64 million signals. It got promising results. Then we found the bugs. Bug one: pass-2 trend lines (the finer-grained ones the analyser produces as a second pass) were duplicated in the training set, so popular lines voted twice. Bug two: there was no liquidity filter, so penny stocks with hundred-bar histories were leaking in. Bug three: the proximity threshold for "near the price" was set wide enough that lines lying out of any meaningful proximity were still being scored.
We fixed the bugs. Re-generated the index. It dropped from 6.64 million signals to 2.58 million. We retrained from scratch, the v2 generation, and it was worse.
This was disorienting until we realised the v1 numbers had been
measured against the bugged holdout. v1 looked good against itself.
Once both models were scored on the same clean data — via a script
cheekily named compare_old_vs_new.py — v2 was, in fact,
better. The lesson: when you change the data, the old benchmarks
are not your friends. Re-score, always.
v3 added a continued-training pass at one-third the learning rate. v4 fixed a few more analyser ordering bugs. By v4 we had something worth keeping, but only one strong horizon — the model was learning magnitude, but the horizons that paid most weren't the ones that ranked best.
Experiment 15: two heads are better than one
v5 was the architectural pivot. Rather than asking one model "what is the magnitude of this stock's next 30 days?", we split it in two:
- Volatility models (4 of them) ask "how big a spike up or down might happen?" Capped at 60 days, because anything longer is too noisy to count as a spike.
- Hold models (8 of them) ask "if the trend line keeps working, how much area accumulates between the line and the price, and in which direction?" 30, 60, 90, and 120 days, in both directions.
Twelve models. Same architecture, same loss, just twelve different questions. This was the version we shipped first. The full-market decile spread on 2.58M held-out signals confirmed something we'd been hoping was true and never quite proven: the rank order carries directional information. Top-decile picks went net up 66 to 74 per cent of the time, depending on horizon. The bottom decile sat at 43%, give or take, regardless of horizon. The model's "buy" signal sharpens with time; the "avoid" signal stays steady.
Experiment 16: slow and steady, for 62 hours
The v5-slow continuation was straightforward in concept and hostile in execution. Take v5, drop the learning rate from 3e-4 to 1e-4, and let it stew. Twelve labels, one Python process holding all twelve model+optimiser pairs in MPS memory simultaneously, sharing a held-out evaluation set, with a background fetcher feeding the GPU at 95% utilisation.
This worked on the third try. The first two attempts used a multi-process consumer architecture that opened twelve copies of the holdout in twelve different RAM regions and rebooted the laptop three times in one day. We added a memory watchdog that SIGKILLs the trainer at 80 GB, which, mercifully, never had to fire during the eventual successful run. Memory flat-lined at 63 GB for 62 hours.
After 62 hours, every label's holdout Spearman had improved. Apples-to-apples — v5's checkpoint re-evaluated against the same v5-slow holdout — the average lift was +0.087 Spearman across all twelve labels. Eight of the twelve models swapped their saved best for a later checkpoint when the post-training decile-best scan ran. We shipped the result.
The headline finding: the model knows when it knows
If you take only one chart away from this post, take this one. Bin the model's predictions into deciles by score (D1 is the worst-ranked tenth of the universe, D10 is the best), then count how often each bucket is on the side of the population median that the score said it should be on.
| Model | D1 right | D5 right | D10 right |
|---|---|---|---|
| vol_min_bot2_60d | 80.8% | 52.6% | 80.0% |
| vol_max_top2_60d | 77.7% | 51.1% | 74.4% |
| hold_down_60d | 76.8% | 52.2% | 77.2% |
| hold_down_120d | 75.9% | 52.6% | 78.4% |
| hold_up_120d | 74.6% | 51.6% | 74.1% |
| hold_up_30d | 71.2% | 51.4% | 69.6% |
(Six representative models out of twelve. The other six follow the same shape.)
The model's worst-ranked decile (D1) — the picks it explicitly flags as below-median — is correctly below median 71 to 81% of the time. Its best-ranked decile (D10) is correctly above median 69 to 80% of the time. The middle deciles drop to ~51%, almost exactly chance. This is what good calibration looks like: confident at the extremes, ignorant in the middle, and honest about both.
The U-shape is the meaningful part. A model whose accuracy
was flat at 62% across all deciles would be no more useful than its
average accuracy: every prediction would be a 62% guess. A model
whose accuracy concentrates at the extremes is dramatically more
useful, because it tells you when to listen and when to ignore it.
The top decile here is a different beast from the average prediction.
It's the part of the rank where the signal is strongest — and, for
what it's worth, the part the home page surfaces as
Top 10%, Top 5%, and Top 1%
badges.
If you want one number anyway
Across 2,584,905 held-out historical signals per label, scored after deployment:
| Model | Pair-acc | Above-median | Spearman ρ | Top-10% lift |
|---|---|---|---|---|
| vol_min_bot2_60d | 64.4% | 65.1% | +0.42 | 6× less downside |
| vol_min_bot1_30d | 63.2% | 63.9% | +0.38 | 5.7× less downside |
| hold_down_90d | 63.0% | 63.2% | +0.38 | +8.1× resistance break |
| hold_down_60d | 62.8% | 63.0% | +0.37 | +8.4× resistance break |
| hold_down_120d | 62.8% | 63.0% | +0.37 | +7.6× resistance break |
| vol_max_top2_60d | 61.8% | 62.2% | +0.34 | +22.9% peak |
| hold_down_30d | 61.6% | 61.7% | +0.33 | +6.9× resistance break |
| hold_up_120d | 61.2% | 61.4% | +0.33 | +4.7× support hold |
| vol_max_top1_30d | 61.1% | 61.6% | +0.32 | +15.5% peak |
| hold_up_90d | 60.7% | 60.9% | +0.31 | +4.7× support hold |
| hold_up_60d | 60.5% | 60.6% | +0.31 | +5.0× support hold |
| hold_up_30d | 59.4% | 59.4% | +0.27 | +4.5× support hold |
| mean | 61.9% | 62.2% | +0.35 | — |
Pair accuracy is the probability that the model, shown two random stocks at two random points in history, correctly says which one performed better on its metric. 50% is a coin flip. 61.9% is not a coin flip. But it averages over the deciles, and as the table above shows, the deciles are not the same. Use the deciles.
What we didn't solve
We trained models that rank stocks. We did not train models that trade them. The decile spreads above are accuracy claims, not return claims. Whether you can extract realised return from this rank ordering, after costs and slippage and the small matter of execution in a market that knows you exist, is a different question. We are working on it. There will be a different blog post. It may include more dignified failures.
Two specific things we're explicitly not claiming:
- The model has alpha. Alpha is risk-adjusted excess return. We've shown a ranking edge. Whether that edge survives the bid-ask spread, transaction costs, and the elemental cruelty of having to actually own things is an empirical question for the backtest rewrite (currently task #59 on our board, in case you want to follow along).
- The model is right about any specific stock you're looking at right now. It's right about the average of two thousand stocks over thousands of trading days. Individual predictions vary, sometimes wildly. The U-shape above is a statistical claim, not a personal promise.
Open questions and v6
The two long-horizon hold-down models (90d and 120d) regressed relative to v5 — the only two of twelve that did. Continued training past a certain point caught the long-tail noise rather than the signal. We didn't ship them. We're still trying to work out whether the regression is overfitting (in which case stop training), or underfitting (in which case feed more data), or feature-bound (in which case rebuild the input). The next generation, v6, will re-investigate proximity-gap search, exchange features, and a structured audit of which historical signals are actually useful for training versus actively confusing the model.
We also have an audit task open for cases like HAS.L on 2021-06-14, where the analyser is showing both the original trend line and a newer one that better describes recent action, and we suspect we may be feeding the wrong one to training. Stay tuned.
Coming next
The training-curve overlay (Spearman by step, twelve labels, with
the v5 baseline) and the per-model U-curve charts will get a
dedicated page once they're rendered. They are, frankly, more
visually striking than the tables above. If you'd like the raw
data to plot yourself, the full per-decile breakdown lives in
training/full_market_report.py, fed by the per-label
winner_scores_full_market.csv outputs.