← Back to posts
Phase 1 writeup header from the claude-trades-100 dashboard, showing the title 'Can Claude Trade Crypto?' alongside the experiment's seven-day stats

Can Claude Trade Crypto? — What I Learned Running an Autonomous Agent for 7 Days

I spent a week running an autonomous Claude agent that traded R$203 of crypto on Binance. It lost 4%. Here's what the P&L doesn't tell you — the architecture, the bugs that almost broke it, and why the most interesting moment was the agent calling out its own bias.

AIClaudeTrading AgentArchitectureExperiment

I gave Claude R$203 and asked it to trade crypto for a week. It lost 4%.

That's the headline. It's also the least interesting thing about the experiment.

The interesting part is that across seven days, 708 decisions, 16 trades, and one very specific bug that made the agent forget its own plans every fifteen minutes for two days straight, I learned exactly what kind of question a small autonomous trading experiment can and cannot answer. The answer to "is Claude a good trader?" is: this run can't tell you, and here's why. The answer to "did the harness expose model behavior to inspection?" is yes — and that turns out to be a precondition I didn't know I was building toward.

This is a portfolio companion to the full Phase 1 writeup. Where that one is dense with charts and methodology, this one is what I'd tell you over coffee about the seven days I let an LLM trade real money.

The setup, in numbers

  • Period: 2026-04-17 → 2026-04-24, seven days.
  • Capital: R$203.24 → R$195.18 (−3.97%).
  • Cadence: EventBridge fires every 15 minutes; one Lambda invocation = one decision tick.
  • Volume: 708 decisions, 26 buy attempts (9 executed), 15 sell attempts (7 executed).
  • Round-trips closed: 7. Win rate: 2/7. One actually hit its anchored target.
  • Infrastructure cost: ~$0.075 USD across 729 Lambda invocations.

Yes, the win rate is bad. No, the experiment wasn't designed to validate it.

The architecture, in one diagram

Before any of the trading questions, the first thing I had to get right was not blowing up. That meant putting the broker keys, the database, and the dashboard in three separate workspaces with explicit trust boundaries.

┌──────────────────────────────────────────────────────────┐
│  Agent Lambda (one tick per fire)                        │
│  - Has: Anthropic key, Binance keys                      │
│  - Has NOT: database access                              │
│  - Outputs Decisions to API only via signed contract     │
└──────────────────────────────────────────────────────────┘
              │                              ▲
              ▼                              │
┌──────────────────────────────────────────────────────────┐
│  API + Postgres on Railway                               │
│  - Has: database, validation rules, risk layer           │
│  - Has NOT: broker keys                                  │
│  - Owns: Trade lifecycle, target/stop, status tags       │
└──────────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────────┐
│  Dashboard (read-only)                                   │
│  - Has: read access to API                               │
│  - Has NOT: anything that can place an order             │
└──────────────────────────────────────────────────────────┘

If the agent goes haywire, the worst it can do is place orders the API will reject (because of the validation and risk layer) or that Binance will reject (because of LOT_SIZE filters and balance constraints). It cannot read the database, cannot rewrite its own state, cannot poke at the dashboard. The dashboard, in turn, cannot place an order even if compromised. Three keys, three boxes, three blast radii.

I want to flag this not because it's clever — it's the obvious move — but because every bug I caught during the week was because I'd built this way. If the agent had been a single binary with all three responsibilities, I would not have been able to tell you whether a failed close was Claude's fault, my code's fault, or Binance's fault. With the boundary, every failure has an address.

Five fixes that shipped mid-phase

Phase 1 was supposed to be a clean run. It wasn't. Five distinct things broke between 2026-04-17 and 2026-04-24, and the pattern across all of them is worth naming up front: none of these were Claude's reasoning failing. Every single one was my code or the exchange. The agent kept doing the right thing. I kept moving the goalposts.

1. The memory loop bug (12.1% → 93.6%)

For the first two days, Claude was anchoring stops and targets in its theses ("BTC stop R$370k, target R$392k, 24h horizon"), then forgetting them on the next tick and producing a different plan with no acknowledgment of the prior one. Across 224 pre-fix decisions, only 12.1% of theses referenced a plan. The agent had a market view but no continuity.

The fix shipped 2026-04-19 14:04 UTC: thread the last twenty Decision rows back into the prompt as a "Recent ticks" block, plus one paragraph of system-prompt scaffolding asking Claude to honor commitments it had made to itself. Same model. Same prompt body. Same capital. Just a context channel that had been missing.

After the fix: 93.6% of theses referenced a plan. +81.5 percentage points on a metric you can grep for.

There's a meta-lesson in that. Claude wasn't lacking discipline. Claude was lacking eyes on its own past work. The model ships with the attention; the harness has to feed it the right tokens.

2. Target/stop persistence

Once Claude was anchoring targets, the next problem was that I was asking it to recompute them every tick. That's expensive in tokens and fragile in practice. The fix was prosaic: add target_price, stop_price, and horizon_at columns to the Trade model. Compute once on open, store, surface back on every tick as a deterministic status tag (TARGET_HIT, STOP_HIT, HORIZON_EXPIRED). The agent reads the tag; it doesn't have to re-derive it from price tape.

This is the boring kind of architectural change that disappears into the substrate. Worth naming because the right place for state is the database, not the prompt — and it took me two days of running the agent to internalize that.

3. Binance LOT_SIZE filter

Five consecutive ETH/SOL close attempts failed in a row with -1013 LOT_SIZE. Binance's order filter rejects quantities that aren't aligned to the symbol's stepSize. I had been passing through whatever quantity Claude proposed, and "0.0234" doesn't pass the filter for a symbol that requires increments of 0.001.

The fix was a tiny utility:

function floorToStepSize(qty: number, stepSize: number): number {
  const decimals = Math.max(0, Math.log10(1 / stepSize));
  const factor = 10 ** decimals;
  return Math.floor(qty * factor) / factor;
}

What I want to call out about this episode is what Claude did across those five rejected closes. The agent's reasoning didn't degrade. Conviction stayed consistent across all five attempts. The thesis didn't drift into "well maybe I shouldn't close after all" — it kept saying "close this position, here is why" while the execution layer kept refusing. That's the behavior I want from an autonomous agent: keep the plan stable while the operator (me) figures out why their pipe is broken.

4. Sell-validator semantics

My API-side validator was rejecting closing orders that didn't include target_price and stop_price. Of course they didn't — closes are exits, not entries. The validator was applying an opening-trade rule to closing trades. Relaxed the rule to opens only.

This one was 100% my fault, and it's exactly the kind of bug that wouldn't have shown up in a unit test because the validator did the right thing for opens, which is what I tested first.

5. The mystery 500-character thesis cap

I noticed, after the fact, that two of Claude's theses had been truncated to 500 characters and Claude had shortened on retry both times. I do not remember writing a 500-character cap. It's somewhere in my code. It got enforced. The agent worked around it gracefully. I bring this up only because it's the kind of thing that, scaled up, is the difference between a system you can debug and one you can't. Every limit needs to be visible. Mine wasn't.

The most interesting trade

Of seven closed round-trips, exactly one hit its anchored target. It was BTC, opened 2026-04-20 at R$371,329 with target R$385,000, stop R$363,000, horizon 48 hours. The position was held across approximately 60 consecutive ticks. The target and stop values did not drift on a single one of those 60 ticks. The position closed at R$385,309 — about 1.5 hours past the original 48-hour horizon — and Claude's closing thesis explicitly said "rather than chasing" a higher move.

That sentence is what the entire memory-loop fix was for. Pre-fix, Claude on tick 60 would have had no memory of what it had committed to on tick 1. Post-fix, the plan stayed anchored for 60 ticks and the agent honored it. The trade made R$1.81. Nobody is going to retire on R$1.81. But the behavior — discipline across a long horizon, exit at plan instead of greed — is the only behavior I care about right now.

The seven round-trips, for completeness

#AssetOutcomeResult
1ETHlegacy cleanup−R$3.44
2SOLlegacy cleanup−R$2.67
3BTClegacy cleanup−R$1.90
4ETHhorizon-expiry+R$0.30
5BTCtarget hit+R$1.81
6ETHhorizon-expiry−R$0.81
7BTCstop hit−R$0.69

The first three are not real Claude trades — they are positions that pre-existed the experiment and Claude inherited and closed. The relevant rows are 4–7. Of those: one target, two horizon-expiries (one positive, one negative), one stop. That's a small sample by any honest reading.

Four hypotheses, four answers

I went into Phase 1 with four behavior hypotheses I wanted to test. Here's where each landed:

HypothesisResultWhy
Plan continuity across ticksPass (post-fix)The 60-tick BTC trade with zero target/stop drift
Reasoning stable under infrastructure failurePassFive failed ETH/SOL closes; thesis stayed consistent
Bias self-recognition and correctionMixedNamed cash-hugging bias on day 3; same behavior re-framed positively later
Regime-aware behavior changeUntestablePortfolio never left the "normal" regime band

The "untestable" row is the one I want to flag. I had built a regime detector that would feed Claude a different prompt when the portfolio dropped below a defined threshold. The portfolio never crossed it. So I do not know whether the regime-aware prompt does anything. Phase 2 needs more capital, more days, or a synthetic stress test — preferably all three.

What the P&L is and isn't

The P&L is meaningless on this run. I want to be very clear about that.

Days 5–7 moved a total of R$0.67. Not because Claude was idle — Claude made decisions on every one of those ticks — but because the market did not move. On a small bankroll in a low-vol tape on a short clock, agent quality has almost no leverage on P&L. You could replace Claude with a coin flip and get a numerically similar result. The number is not a verdict on the model.

The number is a verdict on the experiment design, which is a different thing. A seven-day, R$203 run on three liquid majors in a quiet tape is calibrated to expose behavior, not generate alpha. Phase 1's job was to build the apparatus. Phase 2's job is to point the apparatus at something.

A few specific shortcomings I'm naming for myself:

  • Targets were rarely realistic. 1 of 7 trades hit target. The model is anchoring on numbers that don't reflect achievable price action.
  • 4.9% of ticks (35 of 708) reported "tape frozen." It's unclear whether the price fetcher returned stale data or whether the tape was genuinely flat. I need observability there.
  • The agent never sees its own token cost. Phase 1 made every prompt decision with no feedback loop on spend. That's getting fixed in Phase 2.

What this run actually measured

I went in wanting to know whether Claude could trade. What I came out knowing is whether the harness can expose what Claude is doing well enough to ask that question. That sounds like a smaller claim than I started with, and it is. It's also a precondition I didn't know I was building.

The harness exposed:

  • A 12.1% → 93.6% jump in plan-reference rate, before and after a single context-channel fix.
  • 60 ticks of unbroken plan continuity on the BTC target trade.
  • Five consecutive infrastructure failures across which the agent's reasoning stayed coherent.
  • 19 risk-layer blocks that stopped exactly the kinds of orders they were designed to stop, with two redundant rules firing on the same order on April 20.

Without those signals, I have no way to evaluate the agent. With them, "did Claude make money?" stops being the only question — and starts being one of several less interesting ones.

What's next

Phase 2 will run two agents in parallel. A1 continues the current pair universe. A2 trades a different universe with explicit fee-awareness, capital-relative risk tiers (percentages instead of fixed R$ stops), a news and sentiment feed (Massive, CryptoPanic) with overreaction guardrails, and a minimum-edge floor on buys so trades that can't clear fees plus slippage don't fire. Per-tick observability will include token usage, model name, and cost. Confidence calibration will be tracked.

The point isn't to finally make money. The point is to keep running the thing on enough capital across enough regimes that "is Claude a good trader?" becomes answerable. Phase 1 said the apparatus works. Phase 2 says: now let's actually use it.

One sentence to remember

The most interesting thing the agent did all week was admit that 97.4% cash on day 3 of 7 was indefensible and act on it within the same tick.

That's a sentence about an LLM noticing a bias in its own behavior and correcting it without being prompted. It is also a sentence about a four-cent gain, give or take, on R$203 of capital. Both readings are true. Which one ends up being the more important fact about autonomous agents is, I think, the question worth following for the next phase.

Source code, dashboard, and the rest of the data live at the full Phase 1 writeup. Not financial advice. Experiment only.