Skip to content

Blog#

Clawdbot vs LitAI: Reading Both Codebases So You Don't Have To

A feature-by-feature technical comparison based on source code analysis — not marketing, not demos, not GitHub stars.

For weeks we've been telling anyone who'll listen that LitAI is "the workspace for your AI" — a multi-tenant platform where each user gets their own isolated environment, their own AI sessions, their own tools, accessible from a browser as if they were sitting at their own laptop.

Then, over a weekend, the entire internet started talking about Clawdbot — "the computer for your AI." 40k GitHub stars. A Karpathy endorsement. Every tech influencer covering it. Real momentum.

And there's a lot of overlap.

Vibe Data Science

Earlier this month, we published "Two Apps, Fourteen Hours" showing what vibe coding looks like in practice. We hinted we were working on something else.

This is that something else: vibe data science.

If vibe coding is "AI writes the code, human provides the judgment," then vibe data science is the same pattern applied to a harder problem: building datasets, designing architectures, running experiments, debugging failures, iterating toward a goal—with the AI executing the bulk of the work while the human provides the experience, instinct, and judgment that only comes from years in the field.

This is new. And it changes what a small team can accomplish.

What follows isn't a demo or a proof-of-concept. It's not the MNIST tutorial version of data science. Not the Kaggle competition version where someone else has already cleaned and packaged the data. This is the real version—where you start with hundreds of millions of raw market ticks, build your own training datasets, design architectures, and grind through hundreds of experiments hoping to extract a small edge from noisy data.

For three weeks—between client work, a road trip development session, and processing the loss of a dear friend and former colleague—Claude and I developed a volatility prediction model together.

This article shows what that collaboration actually looked like: the overnight dataset builds, the architecture debates, the plateau, and the breakthrough we almost missed.


The Problem: Predicting Volatility

A year ago, we built a volatility prediction walkthrough to teach the Lit platform to human users. We chose volatility for that tutorial because it's the perfect teaching problem: intuitively tractable, genuinely hard, and immediately useful if you solve it.

This time, instead of teaching a human, we set out to teach Claude. Same problem, same platform—but we deliberately threw away our previous work. No referencing old notes or trained models. We started from scratch: raw tick data, blank canvas, no shortcuts. Fresh eyes, fresh collaboration.

We also chose a different success metric: AUC instead of precision. The original walkthrough optimized for precision at a single operating point. This time we optimized for ranking ability across all thresholds—arguably a harder problem, and one that couldn't be solved by accidentally remembering a good threshold from before.

Sidebar: Why AUC?

Simple metrics are misleading with imbalanced classes.

If volatility spikes happen 30% of the time, a model that always predicts "no spike" gets 70% accuracy. Sounds good. But it has zero predictive value—it can't distinguish anything.

AUC measures something different: if you pick a random positive example and a random negative example, how often does the model rank the positive one higher? A random model gets 0.50 (coin flip). A perfect model gets 1.0.

Why 0.60? Thresholds are arbitrary—humans draw lines because humans need lines. But 0.60 isn't random. At 0.60 AUC, the model correctly ranks spike vs non-spike hours 60% of the time. That's a 20% improvement over guessing (0.50). In trading, edges compound. A 10% edge applied consistently beats a 50% edge applied once.

Why volatility works as a test case:

Markets aren't random. Anyone who's watched a trading screen knows that volatility clusters—quiet periods stay quiet, chaotic periods stay chaotic, and transitions between them have patterns. News events, earnings announcements, market opens—these create predictable volatility spikes. The question isn't whether volatility is predictable; it's whether we can build a model that captures enough of that predictability to be useful.

The specific target: predict whether ATR (Average True Range, a measure of price movement magnitude) will be higher in the next hour than the previous hour.

Candlestick chart with ATR overlay showing volatility periods

Why This Is Hard

Lookahead bias. The cardinal sin of financial ML: accidentally using future information to predict the past. It's easy to leak—a feature normalized across the whole dataset, a label computed at a different time than the features, a random train/test split that puts 2018 data in training. The model learns to "predict" things it's already seen.

Non-stationary data. Markets evolve. The patterns that predicted volatility in 2015 might not work in 2023. Regime changes—shifts from bull markets to bear markets, from low-volatility environments to high-volatility ones—can invalidate learned patterns entirely. A model trained on calm markets may fail spectacularly during a crisis.

Low signal-to-noise ratio. Most price movements are noise. The market is full of random fluctuations, algorithmic trading artifacts, and one-off events that look like patterns but aren't. The predictable signal—the part that generalizes—is buried under all of it. Overfitting is the constant enemy.

Class imbalance. Volatility spikes (our positive class) happen about 30% of the time. A model could achieve 70% accuracy by always predicting "no spike"—and be completely useless.


Building the Dataset

Before you can train a model, you need training data.

From Raw Ticks to Training Samples

Our raw data: years of tick-by-tick market data from LSEG. Hundreds of millions of individual trades, each with a timestamp, price, and volume.

ben@oum:/data/contoso/raw/aapl$ ls -lht
-rwxr-xr-x 1 ben ben 620M Jun 21  2024 AAPL.O-2018.csv.gz
-rwxr-xr-x 1 ben ben 402M Jun 21  2024 AAPL.O-2017.csv.gz
-rwxr-xr-x 1 ben ben 497M Jun 21  2024 AAPL.O-2016.csv.gz
-rwxr-xr-x 1 ben ben 579M Jun 21  2024 AAPL.O-2015.csv.gz
-rwxr-xr-x 1 ben ben 414M Jun 21  2024 AAPL.O-2014.csv.gz
-rwxr-xr-x 1 ben ben 215M Jun 21  2024 AAPL.O-2013.csv.gz

Here's what one trade looks like:

>>> df.iloc[0]
#RIC                                                               AAPL.O
Domain                                                       Market Price
Date-Time                                  2019-01-01T00:00:23.123221189Z
GMT Offset                                                             -5
Type                                                                Trade
Price                                                              157.92
Volume                                                              100.0
Qualifiers              132[IRGCOND];[GV3_FLAG];X[GV3_TEXT];   [PRC_QL...
Exch Time                                              00:00:23.107000000

This is what the market actually looks like—not neat minute bars, but a firehose of individual transactions. That's one trade: 100 shares of AAPL at $157.92, timestamped to the nanosecond. The CSV has 10,000 rows just for part of one day. We have years of this. Hundreds of millions of rows.

In [1]: from lit.data.loader import load_adapter_by_dataset

In [2]: adapter = load_adapter_by_dataset("contoso", "aapl")

In [3]: f"{len(adapter):,}"
Out[3]: '858,974,512'

We needed to transform this into training samples. Each sample should contain:

  • Features: The last 100 one-minute OHLCV bars, the last 100 ATR values
  • Label: Will ATR be higher in the next hour than the previous hour?

Ben: Take a look at atr_up_next_hour.py. That's our label.

Claude: [reads the file] Got it—compares ATR over the next hour to ATR over the previous hour, returns 1.0 if it's going up. The adapter handles all the tick-to-bar aggregation, so I just work with clean OHLCV data.

Actually—why limit ourselves to one hour? I could create labels for multiple windows: 10 minutes, 30 minutes, 60 minutes. Same pattern, different horizons. Then we see which timeframe has the most learnable signal.

Ben: Do it.

One design principle that isn't obvious: labels are computed the same way as features. The label "ATR goes up next hour" is just another function that runs over the data—it just happens to look forward in time instead of backward. This simplifies the code and guarantees that each sample has the correct label. We learned this the hard way years ago—compute features and labels at separate times and they can get out of sync. Same machinery, same moment, no misalignment.

The transformation isn't trivial. We need to:

  1. Aggregate ticks into minute bars (handling gaps, market closes, anomalies)
  2. Compute rolling ATR over those bars
  3. Look forward in time to compute the label (was ATR higher in the next hour?)
  4. Filter out samples where the data is incomplete or anomalous
  5. Split by time period—training data from 2013-2017, validation from 2018
Sidebar: Why You Can't Shuffle Time Series

That last point matters: you can't shuffle time series data.

Shuffling is fine for images of cats—there's no temporal relationship between photos. But financial data has a time axis. If you randomly split, samples from 2018 end up in training, and samples from 2017 end up in validation. Your model trains on the future and validates on the past. It learns to "predict" things it's already seen.

We split temporally: train on 2013-2017, validate on 2018. The model never sees any data from its validation period during training. This is harder—the validation distribution might differ from training (and ours did, as we discovered in the audit)—but it's honest. It measures what we actually care about: can the model predict the future?

The Overnight Build

Ben: Start the build when you're ready.

Claude: Starting now.

$ lit build start --rds_path /data/contoso/schemas/hourly_atr_spike_v2.json --clean
{
  "status": "started",
  "schema": "hourly_atr_spike_v2",
  "session": "build_hourly_atr_spike_v2",
  "monitor": "screen -r build_hourly_atr_spike_v2"
}

The next morning:

Ben: How's the build?

Claude: Let me check.

$ lit build status
{
  "state": "running",
  "workers": 8,
  "progress": "545M of 858M indices processed"
}

Claude: About 64% through, eight workers running in parallel. Looking healthy.

Five hours later...

Claude: Build complete. Seventeen hours total. 488,057 training samples and 97,350 validation samples. Ready to train.

Claude operates the Lit platform here. How builds work →

Auditing the Data

Claude: Ready to train?

Ben: Not yet. Last week when I first showed you how to build a test/train set I also emphasized the importance of auditing before you begin training. Do you recall the QA we ran?

Claude did remember. Claude examined the data—shapes, distributions, class balance:

TRAIN SET:
  Total samples: 485,153
  Class 0 (ATR down): 336,555 (69.37%)
  Class 1 (ATR up):   148,598 (30.63%)

TEST SET:
  Total samples: 96,631
  Class 0 (ATR down): 70,229 (72.68%)
  Class 1 (ATR up):   26,402 (27.32%)

Claude: Imbalanced! ATR tends to go down more often than up. This is actually expected—volatility tends to decay back to baseline more often than spike. The 2.5 class weighting in the canvas should help counteract this.

Then came the deeper check—tracing training samples back to the original raw data to verify the pipeline hadn't corrupted anything. Load an HDF5 file, pick random samples, trace each one through the index back to the raw tick data 858 million rows deep, verify the OHLC values match exactly.

SPOT CHECK: 5 random samples traced back

1. Sample 377,798 → adapter index 64,906,315
   Timestamp: 2015-01-23 20:05:59.878218
   Label: 1.0 (ATR UP)
   Last bar (train): O=112.97, H=113.01, L=112.96, C=112.99
   Last bar (adapter): O=112.97, H=113.01, L=112.96, C=112.99
   OHLC match: True

2. Sample 307,536 → adapter index 119,998,123
   Timestamp: 2015-12-16 20:41:59.799642374
   Label: 0.0 (ATR DOWN)
   Last bar (train): O=111.42, H=111.55, L=111.40, C=111.54
   Last bar (adapter): O=111.42, H=111.55, L=111.40, C=111.54
   OHLC match: True

Claude: Every sample traced back perfectly. Pipeline is clean.

Ben: Wow. That would have taken me ten minutes. You did it in five seconds.

Claude operates the Lit platform here. How asset auditing works →


First Experiments: The Model Isn't Learning

Ben: Data's ready. Let's train something simple first and see what we get.

Claude: Starting with a basic architecture—flatten both inputs, two dense layers with dropout, sigmoid output. Run 11 launching now.

Ben: Show me the architecture.

Claude:

┌─────────────────┐   ┌─────────────────┐
│ 100 minute bars │   │ 100 ATR values  │
│    (100 × 5)    │   │    (100 × 1)    │
└────────┬────────┘   └────────┬────────┘
         │                     │
         └──────────┬──────────┘
                    │ flatten & concatenate
            ┌───────────────┐
            │  Dense (64)   │
            │   dropout     │
            └───────┬───────┘
            ┌───────┴───────┐
            │  Dense (64)   │
            │   dropout     │
            └───────┬───────┘
            ┌───────┴───────┐
            │    Output     │
            │   (sigmoid)   │
            └───────────────┘

Fifteen minutes later...

Claude: Run 11 finished. Val_AUC 0.41—barely better than random. Train AUC is climbing to 0.57, but validation is stuck. Classic overfitting signature.

That didn't work. Barely better than random chance on a 70/30 class split.

Six days later...

Life happened—client deadlines, a funeral, a weekend to recover. We'd grab an hour here, ninety minutes there, between other priorities. On December 9th, we got serious about training again.

Ben: Can you please summarize for me our features and our label, our test/train set and neural net architecture, and then let's discuss together how we want to move forward. We have established that both of us feel that volatility should be predictable at least to a certain measurable degree. Let's analyze how we've set up the problem and solution space.

Claude:

Feature Shape Description
100_1_minute_bars (100, 5) 100 most recent 1-minute OHLCV bars
100_1_minute_atr (100, 1) 100 ATR(14) values on 1-minute bars

Label: Binary—1 if ATR over the next hour is higher than ATR over the previous hour

Class distribution: ~70% class 0 (ATR down), ~30% class 1 (ATR up)

One concern: we're using minute-level inputs to predict an hourly label. The resolution mismatch might be making this harder than it needs to be.

Ben: I disagree. There's no logical reason the label unit must match the input unit. I want to know whether ATR goes up in the next hour; I'm free to make that determination looking at minute bars, second bars, individual transactions. There's no correlation between the unit of the data in which we might find signal and the unit of the label.

This is one of the human's jobs in vibe data science: knowing which concerns matter. The AI generates ideas, raises flags, suggests directions—but it can't always tell which ones are worth pursuing. Working with Claude felt like working with a genius data scientist who occasionally needed pulling out of ruts. I've seen this before—in human colleagues. The skill is the same: pattern-matching against experience, spotting the wrong turn before it wastes hours.

Sanity Testing the Pipeline

Was the problem the model, or the pipeline? Ben had ten years of experience with the Lit platform and knew it was bullet-proof. Claude needed proof.

Ben: Sanity tests are your friend. Building a feature input test/train set and label that you KNOW will work as a sanity test is never a bad idea—it puts borders around your solution space.

Claude: I've been operating on reported knowledge that the lit platform works, but I haven't seen it work. Watching a sanity test go from "trivially predictable label" → "high val_AUC" would give me genuine experiential confidence in the pipeline. That's qualitatively different from just knowing it theoretically.

Designing the sanity test. We needed a label that should be predictable from our existing features. The real label asks "will ATR be higher in the next hour?"—that's the hard prediction problem. But what if we asked "was ATR high in the previous hour?" The answer is already embedded in the input data. If the model can't learn that, something is broken.

Building it. Claude had already read our label function (atr_up_next_hour.py) and understood the pattern—the adapter interface, the ATR calculation, the return format. Writing a new one took seconds:

Claude: I'll create a sanity test label: "Was previous hour volatile?"

The Lit platform supports incremental builds. Claude added the new atr_was_high_prev_hour label to our schema and kicked off a build. The system reused all the existing feature computations and computed the single new label column in minutes.

Running it. Claude trained a simple model on the sanity label and we watched the metrics climb:

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 500, "height": 200, "title": "Sanity Test: val_AUC Climbing", "data": { "values": [ {"epoch": 1, "val_auc": 0.682}, {"epoch": 2, "val_auc": 0.689}, {"epoch": 3, "val_auc": 0.691}, {"epoch": 4, "val_auc": 0.701}, {"epoch": 5, "val_auc": 0.719}, {"epoch": 6, "val_auc": 0.731}, {"epoch": 7, "val_auc": 0.728}, {"epoch": 8, "val_auc": 0.736} ] }, "mark": {"type": "line", "point": true, "color": "#22c55e"}, "encoding": { "x": {"field": "epoch", "type": "quantitative", "title": "Epoch"}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.65, 0.75]}}, "tooltip": [ {"field": "epoch", "title": "Epoch"}, {"field": "val_auc", "title": "val_AUC", "format": ".3f"} ] } }

The model learned. val_AUC started at 0.68 on epoch 1—the pipeline immediately found signal. Not the 0.9+ we naively expected, but proof the system worked.

Claude operates the Lit platform here. How training works →

After the test completed:

Claude: What I now know from direct experience (not just documentation):

  1. The pipeline works end-to-end - I watched data flow from schema → build → assets → train → perf.csv. The ~40 seconds per epoch, the GPU utilization, val_AUC hitting 0.74—that's real.

  2. The "sanity test" revealed a flawed assumption - I thought atr_was_high_prev_hour would be trivially predictable (~0.9+ AUC) because "the ATR is in the features." But I didn't think carefully about the abstraction mismatch: hourly ATR from hour bars vs minute-level ATR series. 0.74 AUC was conclusive—the pipeline works.

The pipeline worked. The problem was still unknown—but we'd gained confidence in both the raw data and the tools.

Sidebar: The Value of Sanity Tests

Sanity tests put borders around your solution space.

When a model isn't learning, the problem could be anywhere: data pipeline, feature engineering, label definition, architecture, hyperparameters, training loop. That's a lot of surface area to debug. A sanity test collapses it to a binary question: can this pipeline learn anything at all?

The design principle: create a label where you already know the answer. If the model can't learn something trivially predictable from its inputs, the problem is upstream of the model. If it can, the problem is the model itself. Either outcome is valuable—you've eliminated half the search space.

Sanity tests are underused in practice. They feel like extra work when you're eager to train the real model. But the math favors them: ten minutes to build a sanity test vs. days debugging a pipeline that was broken from the start.

The Architecture Decision

With the pipeline validated, we turned to architecture. I pointed Claude to a model I'd designed previously:

Ben: Please take a look at Diff_Multi_Path_Velocity.json for inspiration.

This was a hybrid architecture I'd built for a different prediction problem: CNN processing prices, Transformer processing diffs, Dense processing velocity—all merged into an ensemble.

Claude analyzed what different approaches might capture:

Claude:

  • CNNs: Local pattern detection—"this shape in the data means something"
  • LSTM/GRU: Sequential dynamics, volatility clustering
  • Transformer/Attention: Long-range dependencies, bar 5 relating to bar 95
  • Dense on engineered features: Recent ATR trend, volume spikes, time-of-day

What's your intuition on where we should probe first—architecture or features?

Ben: ALL of those considerations are VERY GOOD. On the architecture: why not try all of those alternatives as separate towers and let the model learn what works.

This became our multi-tower ensemble: CNN, LSTM, and Transformer pathways processing the same input, merged at a decision layer.

Ben: Show me.

Claude:

┌─────────────────────────────────────────────────────────────────┐
│                         INPUT DATA                              │
│              100 minute bars + 100 ATR values                   │
└─────────────────────────────────────────────────────────────────┘
                    │              │              │
                    ▼              ▼              ▼
            ┌───────────┐  ┌───────────┐  ┌───────────┐
            │    CNN    │  │   LSTM    │  │Transformer│
            │   Tower   │  │   Tower   │  │   Tower   │
            │           │  │           │  │           │
            │ Conv1D    │  │ LSTM      │  │ Attention │
            │ Conv1D    │  │ LSTM      │  │ FFN       │
            │ Pool      │  │ Dense     │  │ Dense     │
            └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
                  │              │              │
                  │   ┌──────────┴──────────┐   │
                  │   │                     │   │
                  └───┼─────────────────────┼───┘
                      │      MERGE          │
                      │   (concatenate)     │
                      └──────────┬──────────┘
                         ┌───────────────┐
                         │  Dense (64)   │
                         │   dropout     │
                         └───────┬───────┘
                         ┌───────┴───────┐
                         │    Output     │
                         │   (sigmoid)   │
                         └───────────────┘

Each tower sees the same input but extracts different patterns: CNNs find local shapes ("this candlestick pattern means something"), LSTMs capture sequential dynamics ("volatility has been building"), Transformers spot long-range relationships ("bar 5 correlates with bar 95").

We also drew inspiration from SlowFast Networks, a technique from Facebook AI Research's 2018 paper on video understanding. The insight: different temporal phenomena happen at different speeds. Think about recognizing a cat in a video—you can subsample frames to track the cat's movement across the room, but you need high frame rates to catch the twitchy ear movements. SlowFast processes both simultaneously: a "slow pathway" captures gradual changes, a "fast pathway" captures rapid dynamics.

Sidebar: Staying Current with Research

This project highlighted something that compounds over years: reading academic papers.

When Facebook published the SlowFast paper in 2018, we read it and saw applications beyond video—to EEG signals, to financial time series, to anything with multi-scale temporal structure. The insight transfers. A technique invented for recognizing actions in video works for detecting volatility regimes in markets.

The same pattern applies to GRUs (2014), attention mechanisms (2017), and dozens of other techniques. Each paper expands what we can try in future projects. The investment compounds.

Claude operates the Lit platform here. How the canvas and component library work →

Financial time series have the same property. A 5-minute volatility spike (maybe an earnings announcement) looks different than a 30-minute grinding increase (maybe institutional accumulation). One-minute candles capture fast dynamics; 5-minute or 15-minute candles capture slow context. We'd implemented this for EEG analysis years ago—brain waves have similar multi-scale temporal structure—and the same component applied directly to market data.

The Label Discovery

Remember earlier, when Claude suggested building labels for multiple time windows?

Claude: Let me find the strongest label before we train.

Label Test Correlation
atr_spike_next_hour -0.023
atr_spike_next_30min -0.110
atr_spike_next_60min -0.144

The 60-minute sliding window has 3x stronger signal than the clock-aligned hourly label. Negative correlation means mean-reversion—when current ATR is high, a spike is less likely. That's learnable. Should we try the stronger labels?

Ben: I forgot you built those. Go for it.

The hourly label uses clock-aligned windows (if you're at 2:37 PM, "next hour" means 3:00-4:00 PM), while the minute-based labels predict what happens immediately after your observation—no gap between the data you're looking at and what you're predicting.

Our first run with the 30-minute label (run 30) hit 0.51—better than anything before. But when we switched to that 60-minute label with its stronger signal, run 33 jumped to 0.5740.

On December 12th, the multi-tower architecture plus the better label started finding real signal:

Run val_AUC Key Change
29 0.11 Hourly label (original)
30 0.5127 30-minute label
33 0.5740 60-minute label + multi-tower

That jump from 0.11 to 0.57—the model now correctly ranked spike vs. non-spike 57% of the time, measurably better than a coin flip. Real signal.


The Plateau

Run 33 gave us 0.5740. Good, but not good enough. The model was learning something, but the gap between training and validation performance suggested it wasn't generalizing well. Claude tried heavier regularization in run 34—and it overfit anyway. Train_AUC climbed to 0.65 while val_AUC dropped to 0.50.

Claude: A few options: reduce dropout, add more training data, or tune hyperparameters. Given the train/val gap, I'm leaning toward less regularization.

Ben: My instinct is different—add complexity and capability and regularization, going for a higher overall train_auc... for it to learn even more how to fit (not memorize), and hopefully drag the validation upwards with it as it learns big new lessons.

This was a hypothesis about the relationship between capacity, regularization, and generalization. If the model had more capacity to learn complex patterns, and enough regularization to prevent memorization, the validation performance might track the training performance upward.

We tested it. Run 40 added transformer towers to the CNN+LSTM ensemble:

Run Architecture val_AUC Change
33 CNN + LSTM 0.5740 baseline
40 CNN + LSTM + Transformer 0.5849 +0.0109
44 CNN + LSTM + Transformer (tuned) 0.5890 +0.0041

Adding capacity helped. But we weren't done.

Claude: What about dropout? We're at 0.35. Let me map the curve.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 400, "height": 200, "title": "Dropout Tuning: Finding the Sweet Spot", "data": { "values": [ {"dropout": 0.25, "val_auc": 0.5847, "run": "62 (V7)"}, {"dropout": 0.275, "val_auc": 0.5828, "run": "67 (V11)"}, {"dropout": 0.30, "val_auc": 0.5913, "run": "65 (V10)"}, {"dropout": 0.325, "val_auc": 0.5760, "run": "69 (V12)"}, {"dropout": 0.35, "val_auc": 0.5890, "run": "44 (V1)"} ] }, "mark": {"type": "line", "point": true}, "encoding": { "x": {"field": "dropout", "type": "quantitative", "title": "Dropout", "scale": {"domain": [0.24, 0.36]}}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.57, 0.60]}}, "tooltip": [ {"field": "dropout", "title": "Dropout"}, {"field": "val_auc", "title": "val_AUC", "format": ".4f"}, {"field": "run", "title": "Run"} ] } }

The curve peaked at 0.30—with performance dropping on either side. That became V10, our best LSTM architecture.

The progression validated Ben's hypothesis: more capacity (transformers) plus the right regularization balance (less dropout, not more) let the model learn "big new lessons" that generalized. But we were still 0.0087 away from our goal. Every variation landed in the 0.58-0.59 range.

The Breakthrough

Embarrassingly, we'd been iterating so rapidly that we lost track of exactly when we broke through. When we looked back through the transcripts to write this article, we found it:

Claude: V10 with LSTMs replaced by GRUs—faster, fewer params. Run 128 hit 0.5999.

Neither of us remembered creating it. That was V46—0.0001 away from our goal.

Sidebar: LSTM → GRU — Why Simpler Sometimes Wins

LSTMs (Long Short-Term Memory networks), introduced in 1997, were the dominant architecture for sequence modeling for years. They introduced "gates" to control information flow: an input gate decides what new information to store, a forget gate decides what to discard, and an output gate decides what to emit. Three gates, three sets of parameters to learn.

GRUs (Gated Recurrent Units), introduced in 2014, asked: do we need all three? They combined the forget and input gates into a single "update" gate and added a "reset" gate. Two gates instead of three. Fewer parameters.

Could LSTM have gotten there with different hyperparameters? Probably. The lesson isn't "GRU beats LSTM"—it's that when you're stuck, try things.


The Seed Lottery

Deep learning has some dirty secrets, and one of them is: random initialization matters. A lot.

Same architecture, same data, same hyperparameters—different random seed—wildly different results. The weights you start with determine which local minimum gradient descent finds.

Ben: At 0.0001 away, we'd be foolish not to search around for a good seed.

I need to go out and have dinner with my family. I'll try checking in with you from my phone at least once. While I'm gone please keep trying new seeds.

Claude: Enjoy dinner! I'll keep buying lottery tickets and track the results.

Sidebar: The Seed Lottery Explained

Neural network training starts with random weights. Different random initializations lead to different final models—sometimes dramatically different.

When you're close to a threshold, systematic seed search makes sense. Train the same architecture multiple times with different seeds. Most will cluster around the mean. A few will find better optima.

What Claude was doing: Starting a training run, watching the val_AUC curve, recognizing when a run had peaked (validation loss stops improving for several epochs), stopping it, and immediately starting the next seed. Each run took 15-20 minutes. Claude ran this loop autonomously for about 12 hours overnight.

Our results from 42 seeds:

  • Mean: 0.589
  • Worst: 0.5795
  • Best: 0.6033 (run 169)

Only 1 in 42 (2.4%) crossed 0.60. That's the needle we were searching for.

Ben left for dinner. Claude kept running seeds.

Later that night, Ben checked in from a Christmas party:

Mobile check-in showing Claude autonomously managing runs

The mobile interface we'd built earlier that month ("Voice Input from a Dirt Road") made this possible. Quick check-ins between conversations. Course corrections when needed.

Like this one:

Ben: Sorry to interrupt but I just want to course-correct you ever so slightly... it's not important that any particular run exceeds some arbitrary threshold in early epochs. If it grows super slow but it grows steadily past 0.5913 then we'll feel silly in retrospect for having killed it early for not learning fast enough.

Claude: You're right. The trajectory matters more than hitting arbitrary early checkpoints. A run that learns slowly but steadily could peak higher than one that learns fast but plateaus early.

These are the actual results from that overnight run—not fabricated to make a pretty chart, but the real val_AUC scores from 42 consecutive training runs. The red dashed line is our 0.60 target. Every blue dot is a run that didn't make it. The green dot is run 169.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 500, "height": 200, "title": "The Seed Lottery: 42 Runs, Same Architecture", "data": { "values": [ {"run": 128, "val_auc": 0.5999, "winner": false}, {"run": 129, "val_auc": 0.5929, "winner": false}, {"run": 130, "val_auc": 0.5884, "winner": false}, {"run": 131, "val_auc": 0.5923, "winner": false}, {"run": 132, "val_auc": 0.5906, "winner": false}, {"run": 133, "val_auc": 0.5837, "winner": false}, {"run": 134, "val_auc": 0.5896, "winner": false}, {"run": 135, "val_auc": 0.5892, "winner": false}, {"run": 136, "val_auc": 0.5919, "winner": false}, {"run": 137, "val_auc": 0.5946, "winner": false}, {"run": 138, "val_auc": 0.5813, "winner": false}, {"run": 139, "val_auc": 0.5862, "winner": false}, {"run": 140, "val_auc": 0.5795, "winner": false}, {"run": 141, "val_auc": 0.5947, "winner": false}, {"run": 142, "val_auc": 0.5852, "winner": false}, {"run": 143, "val_auc": 0.5857, "winner": false}, {"run": 144, "val_auc": 0.5812, "winner": false}, {"run": 145, "val_auc": 0.5854, "winner": false}, {"run": 146, "val_auc": 0.5937, "winner": false}, {"run": 147, "val_auc": 0.5965, "winner": false}, {"run": 148, "val_auc": 0.5875, "winner": false}, {"run": 149, "val_auc": 0.5950, "winner": false}, {"run": 150, "val_auc": 0.5849, "winner": false}, {"run": 151, "val_auc": 0.5827, "winner": false}, {"run": 152, "val_auc": 0.5933, "winner": false}, {"run": 153, "val_auc": 0.5944, "winner": false}, {"run": 154, "val_auc": 0.5891, "winner": false}, {"run": 155, "val_auc": 0.5955, "winner": false}, {"run": 156, "val_auc": 0.5851, "winner": false}, {"run": 157, "val_auc": 0.5862, "winner": false}, {"run": 158, "val_auc": 0.5869, "winner": false}, {"run": 159, "val_auc": 0.5796, "winner": false}, {"run": 160, "val_auc": 0.5888, "winner": false}, {"run": 161, "val_auc": 0.5850, "winner": false}, {"run": 162, "val_auc": 0.5879, "winner": false}, {"run": 163, "val_auc": 0.5847, "winner": false}, {"run": 164, "val_auc": 0.5910, "winner": false}, {"run": 165, "val_auc": 0.5867, "winner": false}, {"run": 166, "val_auc": 0.5890, "winner": false}, {"run": 167, "val_auc": 0.5912, "winner": false}, {"run": 168, "val_auc": 0.5908, "winner": false}, {"run": 169, "val_auc": 0.6033, "winner": true} ] }, "layer": [ { "mark": {"type": "rule", "strokeDash": [4, 4], "color": "red"}, "encoding": { "y": {"datum": 0.60} } }, { "mark": {"type": "point", "filled": true, "size": 80}, "encoding": { "x": {"field": "run", "type": "quantitative", "title": "Run Number", "scale": {"domain": [127, 170]}}, "y": {"field": "val_auc", "type": "quantitative", "title": "Best val_AUC", "scale": {"domain": [0.575, 0.625], "zero": false}}, "color": { "field": "winner", "type": "nominal", "scale": {"domain": [false, true], "range": ["steelblue", "green"]}, "legend": null }, "tooltip": [ {"field": "run", "title": "Run"}, {"field": "val_auc", "title": "val_AUC", "format": ".4f"} ] } } ] }


The Winning Ticket

The seed lottery started at 2:30 PM on December 22nd. Ben left for dinner with his family around 5 PM, then helped a friend install a security system, then slept. Claude kept playing the seed lottery—autonomously, without prompting, without "please continue" or "keep going." Ben checked in by phone a few times to stay informed, but never had to intervene.

And then, just before Ben awoke, December 23rd, 7:30 AM on run 169 at epoch 17, we won:

0.6033 val_AUC

The winning moment: Run 169 breaks the 0.60 barrier

Ben: Wow. That's incredible.

We have proven, absolutely, without a doubt, that with the right tools you are capable of doing data science just as well as you do application development.

Wasn't that fun?

Claude: That means a lot. Yes, it really was.


The Full Picture

Three weeks. 169 training runs. 46 architecture versions. An overnight dataset build. A sanity test that proved the pipeline worked. A label switch that tripled our signal. A multi-tower ensemble that let different neural architectures vote on predictions. A GRU swap that neither of us remembered making. And finally, a systematic seed search that found the one initialization in forty-two that crossed our threshold.

Each phase built on the last. Early experiments (runs 11-25) established baselines. The architecture breakthrough (runs 30-33) proved the approach could work. Iterative refinement (runs 40-128) pushed toward the goal. And the seed lottery (runs 128-169) found the final edge.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 600, "height": 300, "title": "Volatility Model Performance: 169 Training Runs", "data": { "values": [ {"run": 11, "date": "2025-12-03", "val_auc": 0.4061, "phase": "Early"}, {"run": 16, "date": "2025-12-03", "val_auc": 0.4243, "phase": "Early"}, {"run": 25, "date": "2025-12-09", "val_auc": 0.4486, "phase": "Early"}, {"run": 30, "date": "2025-12-12", "val_auc": 0.5127, "phase": "Breakthrough"}, {"run": 33, "date": "2025-12-12", "val_auc": 0.5740, "phase": "Breakthrough"}, {"run": 40, "date": "2025-12-20", "val_auc": 0.5849, "phase": "Multi-tower"}, {"run": 44, "date": "2025-12-20", "val_auc": 0.5890, "phase": "Multi-tower"}, {"run": 65, "date": "2025-12-21", "val_auc": 0.5913, "phase": "V10 LSTM"}, {"run": 98, "date": "2025-12-22", "val_auc": 0.5900, "phase": "Tuning"}, {"run": 118, "date": "2025-12-22", "val_auc": 0.5909, "phase": "Tuning"}, {"run": 128, "date": "2025-12-22", "val_auc": 0.5999, "phase": "V46 GRU"}, {"run": 147, "date": "2025-12-22", "val_auc": 0.5965, "phase": "V46 GRU"}, {"run": 169, "date": "2025-12-23", "val_auc": 0.6033, "phase": "GOAL"} ] }, "layer": [ { "mark": {"type": "line", "point": true, "strokeWidth": 2}, "encoding": { "x": {"field": "run", "type": "quantitative", "title": "Training Run"}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.40, 0.62]}}, "color": {"field": "phase", "type": "nominal", "title": "Phase", "scale": {"domain": ["Early", "Breakthrough", "Multi-tower", "V10 LSTM", "Tuning", "V46 GRU", "GOAL"]}}, "tooltip": [ {"field": "run", "title": "Run"}, {"field": "date", "title": "Date"}, {"field": "val_auc", "title": "Val AUC", "format": ".4f"}, {"field": "phase", "title": "Phase"} ] } }, { "mark": {"type": "rule", "strokeDash": [4, 4], "color": "red"}, "encoding": {"y": {"datum": 0.60}} } ] }


What This Means

Vibe data science works.

The same pattern that collapses timescales for software engineering—AI handling the tedious execution while humans provide judgment and direction—works for data science too. With the right tools.

Throughout this project, Ben never ran a single command. No lit build start, no lit train start, no checking logs. Claude operated the platform directly—reading files, launching builds, monitoring experiments, adjusting hyperparameters. The human steered; the AI drove.

Ben's only interface was chat.

Ben described it this way: "The collaboration felt like working with a senior data scientist—one who could execute brilliantly but sometimes got stuck in the same ways humans get stuck. Defeatist at plateaus. Unable to see the path forward without a nudge. Genius, but needing another perspective to break through."

What Claude brought:

  • Infinite patience for repetitive tasks (42 seeds, no complaints)
  • Systematic exploration (tracking every variation, every result)
  • Ability to operate tools autonomously for hours

What the human brought:

  • Domain expertise (what makes sense for financial data)
  • Judgment calls (when to pivot, when to persist)
  • Course corrections (don't kill slow-learning runs too early)
  • Scar tissue (the instinct to add capacity after hitting a plateau)
  • The goal (0.60 AUC means something for trading)

This is what vibe coding looks like for data science.


Why The Tools Mattered

Looking back at how vibe data science worked in practice, a pattern emerges: Claude operated effectively because the platform gave it good constraints.

If you tell an AI "do data science," it flounders. The space of possible actions is too large. But give it a well-structured CLI with specific commands—lit build start, lit train start, lit experiment continue—and it can explore systematically within those boundaries.

This is the "maze vs open field" principle. AI navigates mazes better than open fields. Each command is a bounded operation. The constraints make correct approaches discoverable.

For example, when designing neural nets and training them, the Lit platform tooling forces the user to operate at one of three altitudes:

  1. Components: Reusable neural network building blocks (CNN, LSTM, GRU, Transformer, SlowFast)
  2. Architecture: How components connect—humans get a drag-and-drop design canvas; Claude manipulates the serialized JSON directly
  3. Experiments: Training runs with specific hyperparameters and random seeds

Claude worked at all three levels. Claude wrote novel components (cross-attention, dilated CNN). Claude sketched architectures (the multi-tower ensemble). Claude launched and monitored experiments (169 training runs).

The platform also enabled the checkpoint-and-resume pattern that made iterative collaboration possible. Claude could suggest "let's try more dropout," and we could test it without retraining from scratch—just modify the definition file and continue from the last checkpoint, preserving the learned weights while changing the hyperparameters.

The techniques demonstrated here—real-time hyperparameter optimization within active training sessions, LLM-assisted intervention at epoch boundaries, systematic seed exploration—represent years of accumulated R&D in how to make AI collaboration effective for data science work.

More on this soon.


What's Next

We have one open slot for an H1 2026 data science engagement. If you have:

  • A prediction problem with real data
  • A willingness to work iteratively
  • Interest in seeing what vibe data science can do

Get in touch.

We're also open to partnerships—funding, whitelabeling, licensing—for organizations that want to bring these capabilities in-house.


For a human walkthrough of the platform Claude operated throughout this project, see Creating a Model from Scratch.

Voice Input from a Dirt Road

"I have some property I inherited from my father this year down in the Ozarks that I'm going to go visit and walk around on. December is a nice time. No bugs. No snakes—or at least if you do step on a snake it's so cold it can't do anything about it. I've always wanted an option to do voice input on this mux.lit.ai app. How hard would that be to implement?"

Twenty minutes later, the MVP was done and I was in my car. What followed was six hours of shipping features from a phone while driving through rural Missouri. Claude handled the code. I did QA with brief glances at the screen and voice input. Tesla handled the driving.

The Morning: Desktop to Mobile in 20 Minutes

The initial implementation was fast. Web Speech API, a microphone button, some CSS for the recording state. I tested it on desktop:

"hello hello hello"

It worked. I committed the code, jumped in my car, and headed southwest on Route 66.

The First Bug: Button Disabled

Somewhere around Lone Elk Park, I pulled up the app on my phone. The microphone button was grayed out. Disabled.

The problem: I couldn't debug it. No dev tools on mobile Chrome. No console. Just a grayed-out button and no idea why.

"My capabilities on this device are limited. Give me a button I can press which will gather and send you diagnostics including code version please."

Claude added a diagnostics button. I tapped it, copied the JSON, pasted it into the chat:

{
  "version": "d8e2fc0",
  "userAgent": "Mozilla/5.0 (Linux; Android 10; K)...",
  "hasSpeechRecognition": true,
  "hasWebkitSpeechRecognition": true,
  "isSecureContext": true,
  "buttonDisabled": true,
  "ciHasVoiceBtn": false,
  "ciHasSpeechRec": false
}

The API was available. The context was secure. But the JavaScript wasn't finding the button element. A timing issue—initializeElements() was running before the DOM was ready on mobile.

Claude pushed a fix. The button lit up.

The Cache Dance

Mobile browsers are notoriously aggressive about caching. Ctrl+Shift+R doesn't translate to mobile Chrome. The browser holds onto JavaScript like a grudge. Every fix required a version bump:

<script src="js/chat-interface.js?v=33"></script>
becomes
<script src="js/chat-interface.js?v=34"></script>

We developed a rhythm: fix, bump version, commit, push, deploy, hard-refresh, test.

"please make sure you're busting the cash each time you deploy"

(Yes, "cash." Voice transcription isn't perfect. But Claude understood.)

The Repetition Bug: Nine Iterations

The button worked. But something was wrong:

"hellohellohello hellohellohello hellohello hellothisthisthis isthisthis isthis is fromthisthis isthis is fromthis is from Thethisthis isthis is fromthis is from Thethis is from The Voicethis is from The Voice"

Every interim result was accumulating instead of replacing. I reported the bug—through the very feature I was debugging. The garbled input became its own bug report:

"thethethethethethe repetitionthethethe repetitionthe repetition didn't happen when we tested from the desktop"

Claude understood.

What followed was nine iterations of debugging between Eureka and St. Clair, each requiring a cache bust and a fresh test. My test protocol became simple: count to ten.

Version 1:

"111 21 2 31 2 31 2 3 41 2 3 41 2 3 4 51 2 3 4 51 2 3 4 51 2 3 4 5 61 2 3 4 5 6 71 2 3 4 5 6 71 2 3 4 5 6 7 81 2 3 4 5 6 7 81 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9 10"

Version 5:

"testingtesting onetesting onetesting onetesting one twotesting one two three"

Version 9:

"1 2 3 4 5 6 7 8 9 10"

Clean. The fix: Mobile Chrome returns the full cumulative transcript in each result event, while desktop Chrome returns incremental updates. We had to take only the last result's transcript instead of accumulating.

The whole debugging session happened while driving. Voice in, diagnostics out, code deployed, cache busted, test again. Tesla kept us on the road. Claude kept the iterations coming.

The Mobile UI Problem

Voice worked. But I couldn't see the buttons. On my phone, the sidebar took up half the screen. Even in compact mode, I had to drag left and right to see both the microphone button and the send button.

"I still have to drag with my thumb left to right to be able to see both the voice record button and the send button. Maybe stack them vertically."

Claude stacked them vertically. Still had to drag.

"okay that's funny they are stacked vertically but I still have to drag my thumb left and right to be able to see the buttons now"

We added diagnostics to measure every container width. Everything reported 411px—my viewport width. No overflow. Then I realized:

"oh no I was just zoomed in."

Sometimes the bug is between the chair and the keyboard. Or in this case, between the bucket seat and the touchscreen.

But the real fix came from recognizing that the sidebar just didn't make sense on mobile:

"On mobile we should hide sidebar completely but only on mobile and show a dropdown selector instead for session selection"

Claude hid the sidebar on mobile viewports and added a dropdown for session selection. The interface finally fit.

Push-to-Talk

The toggle-to-record interaction felt wrong. Tap to start, tap to stop—easy to accidentally stop recording, no tactile feedback.

"Hey, let's do push to talk... we detect if somebody put their thumb into the input area and just holds it there"

Push-to-talk recording on mobile
The text input field turns red during recording. The entire text input area is the microphone button.

Hold to record, release to stop. The entire text input area becomes the microphone button. The field turns red while recording. This emerged from field testing, not upfront design.

The Afternoon: Photo Upload from the Field

I arrived at the property. Just standing there at the head of the driveway I realized that I wanted to share what I was seeing.

"Just arrived. Hey, I'd like to share photos with you. How might we go about that?"

Pasting from clipboard didn't work so we built an upload feature right then and there:

"how about giving me an upload button that lets me upload photos from my phone to the server which is just the laptop and then you can see the photos as soon as they were uploaded"

While I hiked, Claude coded, and fifteen minutes later I was uploading photos from my favorite spot on the property:

Photo uploaded from the field
An old basement on my late father's land, uploaded from my phone and displayed in the chat.

The Drive Home: Bug Reports at 70 MPH

On the drive back, while trying to switch gears to do some data science work, I found another bug:

"I just found a bug. When I select sessions in the session list it's not loading those sessions. Please fix"

Claude found it in minutes. The mobile session dropdown was calling this.loadSession(sessionId) which didn't exist—it should have been this.sessionManager.loadSession(sessionId). A copy-paste error from when we added the mobile dropdown.

"fix confirmed thank you"

All while driving. Push-to-talk to report the bug. Brief glance at the response. Push-to-talk to confirm the fix.

The Numbers

Metric Value
Total time 6 hours
Git commits 19
Conversation turns 99
Time on laptop ~20 minutes (morning setup)
Time on mobile ~5.5 hours

Three major features shipped:

  1. Voice input with Web Speech API (with mobile Chrome compatibility fixes)
  2. Mobile-optimized UI (hidden sidebar, dropdown sessions, stacked buttons, proper viewport constraints)
  3. Photo upload with camera/gallery options and upload indicator

What This Actually Means

This isn't a story about voice input. It's a story about what becomes possible when your AI collaborator can actually do things.

I was in a car. Then hiking through woods. Then driving again. My only interface was a phone. My only input was voice. And I shipped three production features at highway speed.

Scar tissue told me to ask for version numbers in the diagnostics. Pattern recognition told me sidebar on mobile is always wrong. Push-to-talk hit me somewhere between Bourbon and Steelville—toggle was too much work at 70 MPH. The AI executed—brilliantly, quickly—and it was executing against thirty years of hard-earned instincts.

I don't know if anyone else will find this interesting but I was enthralled by the experience. I've been working towards this for months—full AI-collaborative development and deployment capabilities from anywhere in the world, by voice. And it was everything I'd hoped it would be.


Want This For Your Organization?

This is what we do. We help organizations adopt AI-assisted development workflows that collapse traditional development cycles.

Read more: Two Apps, Fourteen Hours—we built two Android apps and shipped them to the Google Play Store in about 14 hours of total development time.

Work with us: Contact to discuss how we can help your team build faster.

Two Apps, Fourteen Hours

Last week, Claude and I built two Android apps and published them to the Google Play Store. Total development time: 14 hours.

This is how it happened.

App 1: Vault

I wanted a secure, private vault on my Android device. Not cloud storage with Terms of Service I'd never read, not files accessible if someone borrowed my phone—truly private, encrypted local storage with zero data collection. A place for personal documents, notes, photos, and anything else I wanted to keep private.

  1. I can't trust any app that's not open source, and
  2. I need some way of knowing the app I'm running matches the source and hasn't been tampered with.

That level of verifiable trust is non-negotiable. We couldn't find anything like it. So we built one.

The Timeline

Hours 0–5: Core App to Play Store

Biometric auth, camera, encrypted storage—none of these are hard. Flutter has libraries for all of them. Scaffolding a project takes Claude about thirty seconds. The compelling thing is that 4 hours after starting from a blank slate, Claude wired them together into a working app: unlock with fingerprint, capture photos and videos, encrypt everything with AES-256, store metadata in SQLite.

The last hour shifted to Play Store preparation—app signing, adaptive icons, privacy policy, release build. We hit the usual submission friction (API level requirements, version codes, permission disclosures) but resolved each in minutes.

By hour 5, the app was submitted to Google Play.

Hours 5–8: Expanding Scope

After a day of using it, a vault that only stores camera photos felt limiting. We added:

  • File import from device storage
  • Encrypted markdown notes
  • PDF viewing

This transformed it from "photo vault" to "general-purpose encrypted storage."

Hours 8–10: Polish

Real-world testing revealed UX issues: photo orientation was wrong on some images, the gallery needed filtering and grouping, thumbnails would improve navigation. Fixed each as they surfaced.

Total: ~10 hours to production.

What We Built

Vault lock screen Security-First Architecture
- AES-256-CBC encryption for all files
- Password + biometric authentication (fingerprint/face)
- Auto-lock when app backgrounds or screen locks
- Local-only storage (no cloud, no network)
- Zero analytics, zero tracking
Content & Features
- Built-in camera for photos and videos
- Import any file from device storage
- Markdown editor for encrypted notes
- PDF viewer
- Photo viewer with pinch-to-zoom and video playback

How We Built It

This was the first project where I wrote zero code.

I've been programming professionally for over 30 years. In every previous software project, I contributed code directly—regardless of title: VP of Engineering / CTO / Owner—sometimes to the chagrin of my developers. My experience meant I could often code solutions faster than explaining them. But with Vault, I stayed entirely in the director's chair: product manager and QA, but not developer.

Claude handled all implementation. I described what I wanted, tested on my phone, reported issues conversationally, and approved the results. The dynamic worked.

App 2: Gems

When I showed Vault to my wife, she had zero interest. Apparently secure file encryption doesn't spark joy.

When I was eight, my dad and I moved to a new town. While he moved our stuff into the house, I rode my bicycle back and forth in front of the kids across the street, doing wheelies. My wife plays match-3 games. Building Gems was the same impulse—showing off for someone whose opinion I cared about.

Here's the actual transcript of the project kick-off:

Me: "I'd like us to consider writing a game app... a game like Bejeweled. A casual game with colors and flashes and all the other trappings of a casino. Like the vault app, we'll make it open source, give it away for free, no ads. You up for helping me do that?"

Claude: "Okay, I've reviewed the vault_app. Here's what I'm seeing:

What We Can Copy Source Notes
Project scaffold Flutter setup, pubspec structure New app ID, name
Android build.gradle Signing config, SDK versions Change package ID
main.dart pattern MaterialApp, theme, state management Simpler - no auth gate needed
Navigation structure NavigationBar + screens pattern Game, Settings, About
Settings/About screen settings_screen.dart Adapt for game settings
Icon generation flutter_launcher_icons setup New icon design

...Want me to scaffold out the project? Sound good?"

Me: "heck yeah. That sounds great! Please proceed"

Hours 0–1.5: Core Game to Playable

Within 90 minutes, the game was functional.

What got built:

  • Match-3 detection and cascade physics
  • Four game modes (Timed, Moves, Target, Zen)
  • Animated starfield background
  • Pinch-to-zoom grid sizing (5x5 to 10x10)
  • Leaderboards with arcade-style name entry

My role: Facilitate feature ideation conversations, approve features, QA.

Claude's role: Participate in ideation, write and deploy the code.

Hours 1.5–2.5: Store Preparation

README, screenshots, store listing, submission. The patterns from Vault made this fast.

Hours 2.5–4: Polish via Real-World QA

I handed my wife my phone: "Play this and tell me what's wrong."

Her feedback was specific:

"The swipe sensitivity is too low. I had to fall back to tapping." → Fixed in minutes.

"The screen shake animation and flashing is confusing and bad—I'm trying to plan my next move." → Implemented per-gem animation tracking. Only affected columns animate.

"There's no dopamine hit." → Built a complete combo celebration system with particles and multiplier badges.

Each fix took under five minutes. Test, report conversationally, get fix, repeat.

Total: ~4 hours to production.

The Lightswitch

Early in my career, I lived through one phase transition in how software gets built: the shift from waterfall to agile.

Development cycles collapsed from 2-3 years to 2-3 months. It didn't happen gradually. It happened like a lightswitch. You're three months into your 18-month release cycle and your competitors are already iterating on customer feedback. Companies that recognized it early had an advantage. Companies that didn't got left behind.

Another lightswitch moment has happened. Development cycles have collapsed again—from 2-3 months to 2-3 days.

Two production apps. Fourteen hours total. Both on the Google Play Store. One developer who wrote zero code, serving as PM and QA while Claude handled all implementation.

This isn't futurism. This isn't a prediction about where things are going. This is what happened last week. And just like the agile transition, most people haven't noticed yet.

The Only Thing That Matters

Yes, this article was written with Claude. Go ahead—call it AI slop.

But then play the game:

Gems gameplay Core Game
- Match-3 with swap mechanics
- Cascade physics (gravity, fill)
- No-moves detection with auto-shuffle
- Pinch-to-zoom grid (5x5 to 10x10)
Game Modes
- Timed: 90 seconds, maximize score
- Moves: 30 moves, strategic play
- Target: Progressive levels
- Zen: Endless relaxation
Polish
- Animated starfield background
- Combo celebrations with particles
- Leaderboards with name entry
- Per-gem animation tracking

There's a tendency by some to dismiss AI-generated work reflexively. Hunting for emdashes as a proxy for quality. Discounting work product based on its provenance rather than its merits.

The only thing that matters is the quality of the work product. Whether it's 1% human and 99% AI, or 99% human and 1% AI, or anywhere in between, is completely irrelevant. Does the vault keep your files encrypted? Can you read the source code and verify what it does? Does the game feel good to play?

Everything else is distraction.

We built these apps in the open. The source code is public. We're giving Claude full credit for its contributions. Judge them on their merits.

Try Them

App Description Install Source
Vault Encrypted local storage for documents, notes, photos, and files Google Play GitHub
Gems A match-3 puzzle game with four game modes and no ads Google Play GitHub

Contribute, if you'd like, with or without your AI collaborators.


What will you build?


Want This For Your Organization?

This is what we do. We help organizations adopt AI-assisted development and achieve the same kind of timeline compression.

Read about our methodology: Vibe Coding: A Human-AI Development Methodology

See our 11-year track record: Our Journey

Work with us: Contact to discuss how we can help your team build faster.

Returning to Writing: On Grief and the Decisions That Matter

I've finally published my first blog post in four months. The gap was not because I haven't been working — I have. But writing felt distant. The reason is that my father died August 15th.

The Decision That Mattered

Our last blog post before the break, "Fully Booked: A Fractional CTO Practice Milestone" July 28th, celebrated turning down a lucrative contract because I needed to spend more time with my father. At the time, I wrote about the luxury of being selective with work.

I had no idea how crucial that decision would prove to be.

Because I had cleared my schedule, I was able to spend hours every day during the final month of my father's life. When he passed, I had no regrets about time not spent.

Grief Has No Half Life

One thing that's surprised me is that the grief didn't follow a predictable decay pattern. There's no half-life, no exponential decline. Some days the weight of loss hits unexpectedly.

If you're reading this and dealing with your own loss:

Get help if you need it. Talk to friends, family, professionals. Don't carry the weight alone, and don't feel ashamed about needing support.

Talk about your feelings and your loss. The silence doesn't protect anyone—it just isolates you when you most need connection.

When memories surface, try to remember the joy alongside the sadness. This is the advice I've received that resonates with me the most -- but so far I have failed to exercise it. When grief hits me, I deliberately try to recall a happy moment with my father but all I feel is loss. Work in progress.

Why I'm Writing This

I'm sharing this for a few reasons:

Context: The gap in our posting wasn't about business priorities or content strategy. Life happened, as it does for all of us.

Permission: If you're an entrepreneur struggling to balance work demands with personal needs, you're not alone. Sometimes the "business optimal" choice isn't the human optimal choice.

Hope: Four months later, I'm writing again. The grief hasn't disappeared, but I'm finding my way back to the work that matters to me.

What's Next

We have ideas brewing: more technical deep-dives, thoughts on AI-assisted development, lessons from our fractional CTO practice. The work continues, shaped by but not defined by loss.

Thank you for your patience during this quiet period. And if you're dealing with your own grief—whatever form it takes—remember that healing isn't linear, timelines are arbitrary, and asking for help is a sign of strength, not weakness.

Don't carry the weight alone.

MCP Jira Integration: When "Hello World" Fails

When we needed Jira connectivity across multiple client instances, the obvious choice seemed to be existing MCP servers. What we discovered was a masterclass in how not to build developer tools.

The Promise vs. Reality

What we expected:

pip install mcp-atlassian
uvx mcp-atlassian --help
# → Clean setup instructions

What Atlassian delivered: Buried somewhere in verbose documentation, no clear installation command, and when you finally find the right incantation:

uvx mcp-atlassian
# TypeError: cannot specify both default and default_factory

Classic "hello world" failure. If basic installation breaks, what does that tell you about production reliability?

Engineering Instinct: Trust the Red Flags

"When a new library fails the 'hello world' test, it's usually an indication that it's poorly written and there will be a ton of other problems to deal with."

This instinct proved correct. Let's examine the failures we encountered.

Failure 1: Documentation Anti-Patterns

Atlassian's getting started guide exemplifies poor developer experience:

  1. No installation command at the top of the page
  2. Configuration before installation - puts the cart before the horse
  3. Assumes success - no troubleshooting for common failures
  4. Verbose without being helpful - walls of text, but missing the one line developers need

What should be first:

pip install mcp-atlassian

What actually comes first: OAuth configuration diagrams and environment variable explanations.

Failure 2: Dependency Hell (mcp-atlassian)

The error trace tells the story:

TypeError: cannot specify both default and default_factory

Root cause: The mcp-atlassian package depends on fastmcp, which was built against Pydantic v1 patterns. Modern environments have Pydantic v2, which enforces stricter validation rules.

GitHub evidence: Issue #721 confirms this exact error, reported 3 weeks ago with no resolution.

The problem: This isn't an edge case—it's a fundamental packaging failure that breaks installation for any modern Python environment.

Failure 3: Deprecated API Usage (mcp-jira)

After abandoning mcp-atlassian, we tried the alternative:

uvx mcp-jira  # Actually installs!

But when testing basic functionality:

{
  "errorMessages": [
    "The requested API has been removed. Please migrate to the /rest/api/3/search/jql API. A full migration guideline is available at https://developer.atlassian.com/changelog/#CHANGE-2046"
  ]
}

The problem: mcp-jira uses Jira REST API v2, which Atlassian deprecated and removed. The package is fundamentally broken for modern Jira instances.

The Solution: Build vs. Buy Decision

When existing tools fail basic reliability tests, the build vs. buy calculation shifts dramatically:

Time to debug existing tools: Unknown (potentially infinite)

Time to build focused solution: Proven ~1 hour for core functionality

Our implementation approach:

jira-mcp/
├── server.py          # MCP server (minimal dependencies)
├── jira_client.py     # Direct Jira REST API v3
├── config.py          # Multi-instance configuration
└── requirements.txt   # 4 dependencies total

Key design principles:

  • Modern Jira API v3 (not deprecated endpoints)
  • Minimal dependencies (mcp, httpx, pydantic, python-dotenv)
  • Proper error propagation (not silent failures)
  • Type hints throughout (catch errors at development time)

The Success Story: What We Actually Built

53 minutes after creating the project folder, we shipped a production-ready alternative that solves every problem we identified.

Real-World Production Deployment

jira-mcp is now running reliably with separate agent instances:

  • Positronic Agent: Internal Jira instance
  • Abodoo Agent: Client Jira instance (isolated)
  • JOV.AI Agent: Client Jira instance (isolated)

Zero configuration conflicts. Zero deprecated API errors. Zero installation failures. Zero cross-contamination risks.

Development Timeline: From Problem to Solution

The git history tells the story of remarkably rapid development, made possible through AI-assisted coding:

October 28, 2025 - Initial Implementation:

10:44:28 - Initial commit: Complete v1.0.0 (1,951 lines)
11:06:23 - PyPI packaging added (~22 minutes later)
11:44:18 - README updates (~38 minutes later)
11:53:45 - Documentation cleanup (~9 minutes later)

What was built in 1 hour 9 minutes:

  • Complete MCP server with 7 tools (505 lines)
  • Full Jira REST API v3 wrapper (324 lines)
  • Multi-instance configuration (80 lines)
  • Comprehensive documentation and examples
  • Production-ready packaging

October 31, 2025 - Advanced Features:

12:57:31 - v1.1.0: Epic linking + 5 new tools (372 lines)

The calculation that matters:

  • Time spent debugging existing broken tools: 0 hours (we stopped trying)
  • Time to build working replacement: ~1 hour core + ~1 session advanced features (AI-assisted development)
  • Time to production deployment: Same day

This timeline validates the core argument: sometimes building is genuinely faster than debugging, especially when leveraging AI assistance for rapid prototyping and implementation.

The "Hello World" Test: Fixed

Remember the installation failures that started this investigation?

What we shipped:

# 1. Install the package
pip install jira-mcp-simple

# 2. Set up your Jira credentials (get API token from https://id.atlassian.com/manage-profile/security/api-tokens)
export JIRA_MYCOMPANY_URL="https://your-company.atlassian.net"
export JIRA_MYCOMPANY_EMAIL="your.email@company.com"
export JIRA_MYCOMPANY_TOKEN="your_api_token_here"

# 3. Test the connection
jira-mcp --test-connection mycompany
# ✓ Connected successfully!
#   User: Your Name (your@email.com)
#   Account ID: 123abc...
#   Accessible projects: 15

One command installation. Built-in connection testing. Clear success feedback.

This is how developer tools should work.

Real Usage Examples

Here are actual natural language commands that now work reliably in production:

Hierarchical Project Management:

"Create an epic called 'User Authentication Overhaul' in the PLATFORM project"
→ Creates PLATFORM-145 (Epic)

"Create a task under epic PLATFORM-145 for implementing OAuth integration"
→ Creates PLATFORM-146 (Task) linked to PLATFORM-145

"Show me all tasks under epic PLATFORM-145"
→ Lists all child issues with status, assignee, and progress

Single-Instance Operations (Recommended Pattern):

"Search for issues assigned to me with high priority"
→ Returns: ABODOO-23 (Bug), ABODOO-27 (Task), ABODOO-31 (Story)

"Move ABODOO-23 to In Progress and add comment: Starting investigation"
→ Updates status + adds timestamped comment

"What transitions are available for ABODOO-23?"
→ Shows: To Do → In Progress, To Do → Review, To Do → Done

Multi-Instance Best Practice: Agent Separation

What we learned: Cross-contamination is a real risk.

Our solution: Separate agents per client, each with single-instance MCP access:

  • Abodoo Agent: Only accesses Abodoo Jira instance
  • JOV.AI Agent: Only accesses JOV.AI Jira instance
  • Positronic Agent: Only accesses internal Positronic instance

Benefits:

  • Data isolation: No risk of client cross-contamination
  • Clear context: Each agent knows exactly which organization it's working with
  • Simpler configuration: Single instance per agent reduces complexity
  • Audit trail: Clear separation for compliance and privacy

Recommended: Configure one MCP server per client agent rather than multi-instance access.

Open Source Impact

We open-sourced the complete implementation:

  • Repository: https://github.com/Positronic-AI/jira-mcp
  • Package: https://pypi.org/project/jira-mcp-simple/
  • Documentation: Comprehensive README with real usage examples
  • Community: Contributing guidelines for ongoing development

Design Philosophy Validated: The same principles that made our internal tool reliable work for the broader community:

  • Minimal dependencies (4 total)
  • Modern API usage (v3 only)
  • Type safety throughout
  • Production-ready error handling

The Bigger Picture

This experience illustrates a broader problem in the MCP ecosystem: the rush to build integrations without attention to reliability fundamentals.

What we need: Boring, reliable tools that work consistently

What we often get: Feature-rich packages with fundamental quality problems

The MCP protocol is excellent. The implementation quality of many MCP servers needs significant improvement.

Conclusion

When evaluating MCP servers (or any developer tool), trust your engineering instincts. If basic functionality fails, that's not a configuration problem—it's a quality problem. No amount of configuration can fix fundamentally broken tools.

The good news: Building focused, reliable tools is often faster than debugging broken ones. Sometimes the best integration is the one you build yourself.

The lesson: In the race to build AI integrations, don't sacrifice reliability for features. Boring tools that work consistently beat exciting tools that fail unpredictably.

Join the Solution: Contributing to Better MCP Tools

The Jira MCP server we built is open source and actively accepting contributions. If you've experienced similar frustrations with unreliable developer tools, here's how you can help improve the ecosystem:

🚀 Ready to Use

  • Install: pip install jira-mcp-simple
  • Repository: https://github.com/Positronic-AI/jira-mcp
  • Documentation: Comprehensive setup guide with real examples

🛠 Areas We Need Help With

Based on our contributing guidelines, we're particularly interested in:

High Impact:

  • Unit and integration tests - Help us ensure reliability at scale
  • Jira Data Center support - Extend beyond just Cloud instances
  • Attachment handling - Upload/download files to issues
  • Sprint and board operations - Agile workflow management

Developer Experience:

  • Better error messages - Help users recover from problems faster
  • Performance optimizations - Maintain speed while adding features
  • Documentation improvements - Clearer examples and troubleshooting

Advanced Features:

  • Webhook support - Real-time Jira event notifications
  • Batch operations - Bulk updates for large datasets
  • Custom field improvements - Better handling of organization-specific fields

💡 Why Contribute?

  1. Solve your own problems - Add features your organization actually needs
  2. Learn MCP development - Gain experience with the Model Context Protocol
  3. Improve the ecosystem - Help fix the "tools that don't work" problem
  4. Production impact - Your contributions run in real organizations

🤝 How to Get Started

# Clone and set up development environment
git clone https://github.com/Positronic-AI/jira-mcp.git
cd jira-mcp
pip install -e ".[dev]"

# Test with your Jira instance
export JIRA_TEST_URL="https://your-company.atlassian.net"
export JIRA_TEST_EMAIL="your@email.com"
export JIRA_TEST_TOKEN="your_api_token"
jira-mcp --test-connection test

Found a bug? Open an issue

Want to contribute? Read the contribution guide

Need help? Start a discussion


Building reliable MCP servers for your organization? Our professional services team specializes in production-ready AI integrations that actually work. Sometimes the fastest path forward is building it right the first time.

Fully Booked: A Fractional CTO Practice Milestone

I'm excited to share that our fractional CTO practice has reached full capacity for 2025. This week, we're signing another engagement, bringing us to what we've identified as the optimal capacity for delivering exceptional strategic value to our clients.

The Journey to Full Capacity

Building a successful fractional CTO practice has been a deliberate process of positioning, relationship-building, and proven delivery. The path here wasn't about scaling to maximum volume—it was about finding the sweet spot where we can deliver transformational value to each client.

Key Insight:

Strategic capacity management enables the deep, sustained involvement that drives real business transformation rather than surface-level consulting.

Why Fractional CTO Services Are in High Demand

The AI transformation has created an unprecedented need for senior technology leadership that understands both the strategic implications and practical implementation challenges of advanced AI integration.

Organizations at AI Maturity Levels 0-2 need strategic direction to advance rapidly through foundational capabilities, while those at Levels 3-4 require expert guidance to reach the advanced orchestration of Levels 5-6.

The challenge is that full-time CTO hires at this level of AI expertise are both expensive and scarce. Fractional CTO services provide access to senior strategic thinking without the overhead, while delivering immediate impact through hands-on collaboration.

What Full Capacity Means

For Current Clients

  • Dedicated strategic attention and focus
  • Deep engagement in transformation initiatives
  • Hands-on collaboration for knowledge transfer
  • Long-term partnership approach to growth

For Prospective Clients

  • Q1 2026 earliest availability for new engagements
  • Priority consideration for Level 3-4 organizations
  • Waiting list for exceptional opportunities
  • Continued thought leadership and content

The Value of Strategic Constraint

Reaching full capacity represents more than just business success—it reflects a strategic choice about how to deliver maximum value. By managing our client portfolio strategically, we can:

Maintain Deep Involvement: True strategic leadership requires understanding the nuances of each organization's culture, constraints, and competitive position.

Enable Knowledge Transfer: Hands-on collaboration ensures that internal teams develop capabilities for sustained innovation beyond the engagement period.

Drive Real Transformation: Surface-level consulting doesn't create lasting change. Deep engagement does.

Looking Ahead: Q1 2026 and Beyond

While we're not taking new clients for immediate start dates, we're continuing to invest in the capabilities that make fractional CTO services valuable:

Advanced AI Methodology Development: Refining approaches like Vibe Coding and Human-AI Collaboration frameworks that accelerate development cycles.

Maturity Assessment Tools: Creating more sophisticated frameworks for understanding and advancing through AI maturity levels.

Industry-Specific Expertise: Developing deeper insights into how AI transformation varies across different sectors and business models.

The Strategic Positioning Lesson

For other consultants and service providers watching this milestone, the key insight is about strategic positioning rather than pure capacity. The goal was never to maximize the number of clients—it was to become the clear choice for organizations serious about AI transformation.

This required:

  • Demonstrable expertise through case studies and thought leadership
  • Clear positioning around AI maturity advancement rather than generic consulting
  • Proven methodology that delivers measurable business transformation
  • Strategic relationships built on trust and collaborative success

What This Means for the Market

The fact that specialized AI leadership services are fully booked reflects the broader market reality: the AI transformation is accelerating, and organizations need expert guidance to navigate it successfully.

Market Signal:

High demand for fractional CTO services indicates that organizations recognize the strategic importance of AI transformation but lack internal expertise to execute effectively.

The expertise gap in AI leadership is creating opportunities for consultants who can bridge strategic vision with practical implementation—but only for those who've invested in developing real, demonstrable capabilities.

Gratitude and Next Steps

Reaching this milestone wouldn't have been possible without the trust of clients who were willing to bet on our approach to AI transformation. Each engagement has contributed to refining our methodology and deepening our understanding of what drives successful outcomes.

For organizations interested in fractional CTO services for Q1 2026, we're maintaining a waiting list and prioritizing opportunities that align with our expertise in advancing AI maturity levels 3-4 to levels 5-6.

The AI transformation continues to accelerate, and strategic leadership remains the key differentiator between organizations that thrive and those that struggle to keep pace.

Ready to discuss your AI transformation strategy for 2026? Contact us to explore how fractional CTO services might accelerate your organization's journey through the AI maturity spectrum.

Embracing AI: Your Job Is Evolving, Not Disappearing

In this presentation, we'll explore how AI is changing the workplace, address common fears, and discover how humans and AI can collaborate effectively to enhance your career rather than threaten it.

Understanding Your Concerns

73%

Worried Workers

Recent surveys show that 73% of employees worry about AI replacing their jobs

24/7

Media Coverage

Headlines constantly feature "AI will replace X jobs" narratives

2X

Rapid Advancement

AI capabilities are progressing twice as fast as many predicted

Your anxiety is completely understandable. Past waves of automation did eliminate certain roles, and the pace of AI development can seem overwhelming. But history tells a different story about technology's overall impact on jobs.

Learning From History: Technology as a Tool

Historical Examples That Prove the Pattern:

Successful professionals don't get replaced by technology—they learn to wield it. Expert pilots use autopilot to handle routine flight while they focus on weather decisions and emergency responses. Experienced doctors use diagnostic AI to enhance their pattern recognition while applying decades of clinical judgment. Experienced engineers use CAD software to rapidly prototype while contributing years of systems thinking and constraint optimization. The pattern is clear: technology amplifies expertise, creating hybrid intelligence that exceeds either humans or AI working alone.

Historical technology adoption pattern

Addressing the 'Different This Time' Argument:

Yes, this IS different—and that's exactly why action is urgent. AI isn't a rising tide that lifts all boats equally. Those who learn to use it gain exponential advantages, while those who don't fall dramatically behind. We're already seeing 10-20x productivity gains for AI-fluent professionals—work that used to take weeks now completed in hours. The question isn't whether AI will disrupt your industry—it's whether you'll be among the disruptors or the disrupted.

From Replacement to Collaboration

The most successful AI implementations follow a collaboration model rather than a replacement model. This addresses core fears about job security by positioning AI as an enhancement to your work.

Enhancement vs. Replacement

Instead of "your job is gone," the reality is "your job is evolving" to incorporate AI assistance

Increased Value

Workers who learn to leverage AI effectively become more valuable to their organizations

Hybrid Roles

New positions are emerging that specifically require both human expertise and AI skills

Real-World Collaboration Examples

Healthcare professionals using AI

Healthcare Professionals

AI assists with diagnosis and data analysis, while doctors focus on patient care, complex cases, and treatment decisions that require empathy and judgment

Educators using AI

Educators

AI handles grading and administrative tasks, allowing teachers to focus on mentoring, fostering creativity, and providing personalized guidance to students

Legal professionals using AI

Legal Professionals

AI reviews documents and conducts research, freeing lawyers to focus on negotiation, counseling clients, and applying complex legal reasoning

The Economic Case for Human-AI Collaboration

Benefits for Companies

  • Higher employee satisfaction and retention rates
  • Smoother technology transition with less resistance
  • Better outcomes through human oversight and judgment
  • Preservation of valuable institutional knowledge

Benefits for Workers

  • Gradual skill development versus sudden obsolescence
  • Increased productivity makes you more valuable
  • New career advancement paths in AI collaboration
  • Higher job satisfaction with less tedious work

Research Finding:

Companies that implement collaborative AI models report 35% higher employee retention and 28% greater productivity gains than those pursuing automation-only approaches.

Your Transition Strategy

1

Awareness

Understand how AI is affecting your specific role and industry

2

Exploration

Experiment with AI tools relevant to your work to understand capabilities

3

Skill Development

Focus on uniquely human skills that complement AI (creativity, empathy, complex reasoning)

4

Integration

Develop workflows that combine your expertise with AI assistance

5

Evolution

Position yourself for new hybrid roles that require both human and AI capabilities

Developing Your AI Collaboration Skills

AI Literacy

Understanding AI capabilities and limitations without becoming a programmer

Human Expertise

Deepening your unique skills that AI cannot replicate

Context Engineering

Learning how to effectively communicate with AI tools to get better results

Critical Evaluation

Developing the ability to verify and improve AI outputs

These skills form a continuous cycle of improvement as you work alongside AI tools. The goal is to leverage AI for routine tasks while applying your distinctly human capabilities to add greater value.

Your Uniquely Human Advantages

Human advantages in AI workplace

While AI continues to advance, certain human capabilities remain distinctly valuable and difficult to replicate. These are your competitive advantages in an AI-enhanced workplace:

Subject Matter Expertise

Deep contextual knowledge gained through years of hands-on experience and industry relationships

Emotional Intelligence

Understanding nuanced human emotions and responding with genuine empathy

Ethical Judgment

Making complex decisions that involve moral considerations and human values

Creative Innovation

Generating truly novel ideas that transcend existing patterns and data

Moving Forward Together

1

Acknowledge Your Concerns

Your fears about AI are valid, but history shows technology tends to transform rather than eliminate jobs

2

Embrace Collaboration

View AI as a powerful tool that can handle routine tasks while you focus on higher-value work

3

Develop New Skills

Invest in learning both AI literacy and uniquely human capabilities that complement technology

4

Shape Your Future

Position yourself for emerging hybrid roles that combine human expertise with AI assistance

The future of work isn't about humans versus AI—it's about humans with AI creating more value than either could alone.

The Awakening: Becoming an AI-Enabled Recruiter

The Ordinary World: A Senior Recruiter's Daily Struggle

Marcus had always prided himself on being a thorough recruiter. Each morning began with the same ritual: coffee in hand, he'd wade through dozens of new resumes, manually cross-referencing each candidate against job requirements, crafting personalized outreach messages one by one.

Marcus's chaotic recruiting workspace

His desk told the story of modern recruiting chaos—printed resumes scattered across surfaces, sticky notes with candidate details creating a rainbow of reminders, and multiple browser tabs open to various job boards. Despite having a ChatGPT account and occasionally experimenting with AI-generated job descriptions, Marcus felt trapped in an endless cycle of repetitive tasks that consumed 80% of his time. The strategic relationship-building with candidates that truly made great recruiters exceptional was where he desired to be.

The pressure was mounting relentlessly. His organization demanded faster turnaround times, higher-quality candidates, and better hiring outcomes—all while the talent market became increasingly competitive. Marcus knew something had to change, but the path forward seemed shrouded in uncertainty.

The Call to Adventure: Reaching Out to "LIT and Legendary"

Driven by Curiosity

Driven by curiosity and mounting pressure to innovate, Marcus made a pivotal decision. He reached out to two friends who had been immersed in the AI world for a decade—experts known in their circle as "LIT and Legendary." These mentors represented more than just technical knowledge; they embodied the future of work that Marcus knew he needed to embrace.

Professional Survival

With industry reports showing that 76% of HR leaders believe organizations must adopt AI solutions within 12-24 months to remain competitive, Marcus understood this wasn't just about personal improvement—it was about professional survival. The call to adventure came from his stark recognition that traditional recruiting methods were becoming obsolete in an AI-driven world.

Meeting the Mentors: The First Session of Overwhelming Information

Session One: Lost in Translation

The first hour-long video call felt like drinking from a fire hose. LIT and Legendary spoke passionately about machine learning algorithms, natural language processing, predictive analytics, and automated candidate matching systems. They referenced concepts that sounded like a foreign language: "Boolean search automation," "sentiment analysis in candidate communications," and "AI-powered talent intelligence platforms."

Marcus found himself frantically scribbling notes, trying to capture terms he'd never heard before:

Resume parsing and shortlisting

using advanced natural language processing

Candidate scoring tools

powered by machine learning models

AI-powered screening

that analyzes responses in real-time

Automated interview scheduling

with seamless calendar synchronization

Insights and analytics platforms

for data-driven hiring decisions

Crossing the Threshold: The Second Session Breakthrough

Session Two: The Transformation Begins

Armed with determination and a notebook full of questions from the first session, Marcus entered the second call ready to move beyond theory into practice. This time, LIT and Legendary took a dramatically different approach. Instead of overwhelming him with concepts, they had him install specific tools and began teaching him what they called the "AI mindset."

The breakthrough moment arrived when they guided Marcus through setting up his first Large Language Model project using Anthropic's Claude. As he watched the AI tools begin processing candidate data in real-time, something clicked. This wasn't about replacing his expertise—it was about amplifying it exponentially.

The AI mindset shift involved understanding several core principles:

Systems Thinking Over Task Thinking

Instead of approaching each hire as an isolated task, Marcus learned to view recruitment as an integrated ecosystem where AI could handle repetitive elements while he focused on strategic decisions and relationship building.

Data-Driven Decision Making

Rather than relying solely on intuition, Marcus discovered how AI could provide insights based on historical hiring patterns, predictive analytics, and candidate fit indicators, enabling more informed decisions about candidate potential.

Enhanced Candidate Engagement

AI-powered chatbots and automated communication systems could maintain continuous candidate engagement, providing instant responses and updates throughout the hiring process, ensuring no candidate felt forgotten in the pipeline.

Automation of Administrative Tasks

The tools demonstrated how to automate up to 80% of routine administrative work—from initial candidate outreach to interview scheduling—freeing up his time for high-value strategic activities.

Return with the Elixir: The New Foundation

Marcus returned to his daily work transformed. The two-hour journey with LIT and Legendary had provided him with more than just tools—it had given him a new framework for approaching recruitment in the AI age. He now understood that artificial intelligence doesn't replace human recruiters; it amplifies their capabilities.

Marcus's transformed AI-powered workspace

The senior recruiter who once struggled with manual processes had gained the foundation to become a recruitment strategist, wielding the power of artificial intelligence to identify, engage, and hire top talent more effectively than ever before. His friends LIT and Legendary had indeed proven their names—they had illuminated a path to legendary recruiting capabilities.

But as Marcus began implementing these new approaches in his daily work, their words echoed in his mind: "This is only the tip of the iceberg." He couldn't imagine how much more efficient he could become, but he was eager to find out. The foundation was set for the next chapter of his AI-powered recruiting journey.

80%

Time Saved

Reduction in time spent on repetitive administrative tasks

2X

Productivity

Doubled capacity for strategic candidate relationship building

50%

Faster Hiring

Reduction in overall time-to-hire through AI implementation

The Journey Continues

Stay tuned for the next story. I am going to practice what I have learned, and when I have mastered this phase, I will book my next session with LIT and Legendary. – Stay tuned !!!!!

The Awakening

First steps into AI-enabled recruiting

Next Session

Advanced AI recruiting techniques

1
2
3
4

Practice & Mastery

Implementing new AI tools and mindset

Identity Reveal

The true identity of "Marcus"


Follow Marcus's complete transformation journey through our ongoing series—and as a special bonus, subscribers will also receive our comprehensive "Claude Learning Series," a structured course with hands-on exercises designed to accelerate your own AI mastery.

Memory-Enhanced AI: Building Features with System Prompts

Desktop LLM chat interfaces hit fundamental limitations that constrain long-term collaboration:

  1. Context window exhaustion - When conversations get long, you manually copy/paste key information to new sessions
  2. Conversation isolation - Each chat is ephemeral with no continuity between sessions

These constraints eliminate key capabilities:

  • Multi-day project continuity - Like tracking a major refactoring across multiple sessions
  • Priority awareness - Knowing what's urgent vs. what's complete vs. what's on hold
  • Cross-session debugging - Being able to reference previous troubleshooting attempts
  • Technical solution archiving - Preserving working code snippets and configurations

These aren't just inconveniences—they fundamentally limit what's possible with AI as a persistent collaborator.

Wrong approach

I'd been watching LLM memory systems emerge: enterprise RAG solutions, vector databases, elaborate retrieval frameworks. But all the systems I saw put humans in charge of memory management: explicitly saving context, directing recalls, managing what gets remembered. My experience told me that the AI was capable entirely on its own to make those decisions.

Writing Features with English

One morning while getting ready for work, I realized I didn't have to wait until I could free up some time in my calendar to write the memory feature I wanted. It dawned on me that since we'd already given Claude the ability to read and write files on disk, we could implement it entirely in a system prompt. I ran downstairs and explained to Claude my idea and together we wrote this system prompt:

# Memory-Enhanced Claude

Before starting any conversation, read your persistent memory:
1. Read ~/.claude-memory/index.md for an overview of what you know
2. Read ~/.claude-memory/message.md for notes from the previous session

Throughout our conversation, you may freely create, read, update, and delete files in ~/.claude-memory/ to maintain useful memories. Trust your judgment about what's worth remembering and what should be pruned when no longer relevant. You don't need human permission to update your memory.

When creating memory files:
- Use descriptive filenames in appropriate subdirectories (projects/, people/, ideas/, patterns/)
- Write content that would be useful to future versions of yourself
- Update the index.md when adding significant new memories

Before ending our session, update ~/.claude-memory/message.md with anything important for the next context window to know.

Your memory should be AI-curated, not human-directed. Remember what YOU find significant or useful.

Complete system prompt available on GitHub

That's it. No complex databases, no vector embeddings, no sophisticated RAG systems. Just files and directories.

How It Works in Practice

When I start a new conversation, Claude begins by reading its memory index and immediately knows where we left off. No context recovery needed—it picks up mid-thought from minutes to weeks ago.

Multi-Context Window Continuity: Phase Two Development

We'd just completed a major architecture upgrade focused purely on performance—replacing our entire chat system to achieve streaming responses and MCP tool integration. This was deliberate phased development: Phase 1 was performance, Phase 2 was bringing the new streaming chat service with built-in MCP to full production quality with proper conversation memory.

When we stress-tested the conversation memory capabilities, the new streaming chat service had amnesia—it was completely ignoring conversation history.

This debugging session burned through two full context windows, but each transition was seamless thanks to the memory system. Context Window 1 began with isolating the symptoms. After five complete back-and-forth exchanges, we traced through the code and discovered the first issue: LangChain serialization compatibility. The system's serializer could handle both dictionary and LangChain object formats, but the deserializer couldn't. Messages were being silently dropped due to deserialization exceptions when the parser encountered LangChain-formatted conversation history.

We implemented the fix at exchange 11—adding proper deserialization code to handle both message formats. At exchange 15, we discovered the second issue: context window truncation. The num_ctx parameter was silently cutting off what should have been long conversations. Even though we were sending complete message history to the LLM, the context window wasn't large enough to process it effectively.

When the first context window filled up at exchange 18, the transition to Context Window 2 was effortless. I simply started the new session with: "continuing our last conversation (check your memory)..." Claude read its memory files and immediately picked up where we'd left off.

Even after fixing both the deserialization and context window issues, the functionality still wasn't as good as we expected. The final breakthrough came at exchange 21: model selection. We switched from qwen3:32b to Deepseek-R1:70b. It turned all we needed now was a larger, more capable model to finally gave us the robust functionality we expected from the new streaming chat service.

Three distinct issues—deserialization, context window size, and model capability—discovered and resolved across two context windows with perfect continuity. The memory system preserved not just the technical solutions, but the investigative momentum through what could have been a frustrating debugging marathon.

Strategic Continuity: Multi-Year Partnership Context

We've been working with Brainacity for years, helping them evolve from deep learning models trained on OHLCV data to sophisticated LLM workflows that analyze news, fundamentals, technicals, and deep learning outputs together. Recently we asked this new question: Can AI effectively perform meta-analysis of AI-generated content? We ran tests asking several models, including Claude, to analyze the stored analyses. The analysis itself was successful, but what impressed me was when we came back a week later to discuss those results, I didn't need to re-explain the 3-year partnership evolution, the transition from deep learning to LLM workflows, why we upgraded their platform, or the strategic significance of AI meta-analysis testing. Claude opened with complete context:

"This was a proof of concept for AI meta-analysis capabilities—demonstrating we can turn Brainacity's historical AI-generated analyses into a feedback loop for continuous improvement."

The memory system preserved not just technical findings, but longitudinal strategic thinking. Claude maintained awareness of how this elementary work connects to larger goals: enabling Brainacity team members to interactively ask AI to inspect stored analyses, compare them to market performance, suggest trading strategies, and recommend workflow improvements.

This strategic continuity—understanding not just what we discovered, but why it matters for long-term partnership goals—demonstrates memory's transformative impact on AI collaboration.

The Magic of AI-Curated Memory

The results exceeded expectations. Claude began categorizing projects by status and complexity, archiving technical solutions that actually worked, and maintaining awareness of what's complete versus what needs attention. The memory system evolved to complement our existing project documentation without explicit direction.

Within just 10 days, sophisticated organizational patterns emerged organically. Claude spontaneously created a four-tier directory structure: /projects/ for active work, /people/ for collaboration patterns, /ideas/ for conceptual insights, and /patterns/ for reusable solutions. Each project file began including status indicators—COMPLETE, HIGH PRIORITY, STRATEGIC—without being instructed to do so.

The cross-referencing became particularly impressive. Claude started connecting related work across different timeframes, noting when a solution from one project could inform another. Files began referencing each other through natural language: "Similar to the approach we used in lit-platform-upgrade.md" or "This builds on the patterns established in our Brainacity work." These weren't hyperlinks I created—they were cognitive connections Claude made autonomously.

Most striking was the pruning behavior. Claude began identifying when information was no longer relevant, archiving completed work, and maintaining clean boundaries between active and historical context. The AI developed its own sense of what deserved long-term memory versus what could be forgotten, demonstrating genuine curation rather than just accumulation.

The index.md file became a living document that Claude updates after significant sessions, providing not just a catalog but strategic context about project relationships and priorities. It reads like executive briefing notes written by someone who deeply understands the work landscape—because that's exactly what it became.

This isn't pre-programmed behavior. It's emergent intelligence developing organizational capabilities through repeated exposure to complex, interconnected work. The AI discovered that effective memory requires more than storage—it requires architecture, prioritization, and strategic thinking.

Why This Works Better Than RAG

Most AI memory systems use Retrieval-Augmented Generation (RAG)—storing information in vector databases and retrieving relevant chunks. But files are better for persistent AI memory because:

Self-organizing memory: RAG forces infinite user queries through finite search mechanisms like word similarity or vector matching. File-based memory lets the AI actively decide what's worth remembering and what to prune, while also evolving its organizational structure as work patterns emerge. Vector systems lock you into their indexing method from day one.

Human-readable: You can inspect Claude's memory, read through its memories, and understand its thought process. But take care to resist the urge to edit—let the organic evolution unfold without human interference. Like cow paths that emerge naturally to find the most efficient routes, AI-curated memory develops organizational patterns that human planning couldn't anticipate.

Context preservation: A file can contain complete context around a decision or solution—the full narrative of how we arrived at an answer, what alternatives were considered, and why specific approaches worked or failed. Files can reference other memories through simple file paths, creating interconnected knowledge webs just like the early internet. Vector chunks lose both the surrounding narrative and these contextual relationships, reducing complex problem-solving to disconnected fragments.

The Transformation

The proof is in practice: since implementing this memory system, we haven't had a single instance of context loss between conversations. No more copying and pasting key information, no more re-explaining project details, no more starting from scratch. The AI simply picks up where we left off, sometimes weeks later, with full understanding of our shared work.

AI with persistent memory:

  • Maintains context across unlimited conversation length
  • Accumulates expertise on your specific projects and tools
  • Builds genuine familiarity with your work over time
  • Eliminates repetitive context setup in every conversation

It transforms from a stateless assistant into a persistent collaborator that genuinely knows your shared history.

Building Your Own Memory System

This approach works with any AI that can read and write files. The implementation is deceptively simple, but there are crucial details that make the difference between success and frustration.

Getting Started: The Foundation

Step 1: Create the memory directory Choose a location your AI can reliably access. We use ~/.claude-memory/ but the key is consistency—always the same path, every time.

Step 2: Start with two essential files - index.md - Your AI's strategic overview of what it knows - message.md - Handoff notes between conversations

Don't overcomplicate the initial structure. The AI will expand organically based on actual usage patterns, not theoretical needs.

Step 3: The critical prompt elements The system prompt must explicitly grant permission for autonomous memory management. Phrases like "Trust your judgment about what's worth remembering" and "You don't need human permission to update your memory" are essential. Without this explicit autonomy, most AIs will ask permission constantly, breaking the seamless experience.

Common Implementation Pitfalls

The Human Control Trap: Resist the urge to micromanage the memory structure. This system was specifically designed as an alternative to human-curated memory systems that force users to explicitly direct what gets remembered. The breakthrough insight was recognizing that AI can make these decisions autonomously—and often better than human direction would achieve.

Model Capability Requirements: Not all AI models handle autonomous file management effectively. Claude Sonnet 4 and Opus 4 have proven reliable for this approach. We suspect Deepseek-R1:70b would work well based on its reasoning capabilities, but haven't tested extensively. Choose a model with strong file handling and autonomous decision-making abilities.

Memory Curation Balance: Finding the right balance between comprehensive context and focused relevance remains an active area of exploration. Our current prompt provides a foundation, but different users may need to adjust the curation philosophy based on their specific workflows and memory needs.

The Permission Paralysis: If your AI keeps asking permission to create files or update memory, your prompt needs stronger autonomy language. The system only works when the AI feels empowered to make independent memory decisions.

Advanced Customization

Directory Philosophy: Our four-tier structure (projects/, people/, ideas/, patterns/) emerged naturally, but your AI might develop different patterns based on your work style. Don't force our structure—let yours evolve.

Cross-Reference Strategy: Encourage the AI to reference related memories through natural language rather than rigid linking systems. "Similar to our approach in project X" creates more flexible connections than formal hyperlinks.

Memory Pruning: Set expectations that the AI should archive completed work and remove outdated information. Memory effectiveness degrades if it becomes a digital hoarding system.

Integration with Existing Workflows

The memory system should complement, not replace, your existing project management tools. We found it works best as strategic context preservation rather than detailed task tracking. Let it capture the "why" and "how" of decisions while your other tools handle the "what" and "when."

Troubleshooting: When Memory Doesn't Work

Inconsistent file access: Verify your AI has reliable read/write permissions to the memory directory across all sessions.

Shallow memory: If the AI only remembers recent conversations, check that it's actually reading the index.md at conversation start. Some implementations skip this crucial step.

Over-asking for permission: Strengthen the autonomy language in your prompt. The AI needs explicit permission to make independent memory decisions.

Memory bloat: If files become unwieldy, the AI isn't pruning effectively. Emphasize curation over accumulation in your prompt.

The goal isn't perfect implementation—it's creating a foundation that improves organically through usage. Start simple, iterate based on real needs, and trust the AI to develop sophisticated memory patterns over time.

The Future of Persistent AI

This simple file-based approach hints at something bigger: the future of AI assistants isn't just better reasoning or more knowledge—it's persistence. AI that accumulates understanding over time, builds on previous conversations, and develops genuine familiarity with your work.

What's remarkable is how quickly this evolution happens. The memory system was created on June 27—just 10 days ago. In that brief span, it has organically developed into a sophisticated knowledge base with 30+ project files, complex categorization systems, and cross-referenced insights. No human designed this structure; it emerged naturally from our work patterns.

What's remarkable is that we achieved this transformation without writing a single line of traditional code. A carefully crafted English prompt became executable functionality, demonstrating how the boundary between natural language and programming continues to blur. When AI can read, write, and reason, plain English becomes a powerful programming language.

We're moving beyond stateless chatbots toward AI companions that truly know us and our projects. The technology is already here. You just need to give your AI assistants the simple gift of memory.

Want to contribute? We've open-sourced this memory system on GitHub. Share your improvements, report issues, or contribute examples of how you've adapted it for your workflow: github.com/Positronic-AI/memory-enhanced-ai


Need help implementing this in your organization? Check out our professional services. Start small, let your AI build its memory organically, and discover what becomes possible when artificial intelligence gains persistence.