Experimental: Pushing OpenClaw to Its Limits
Every tool has edges — places where it works differently than expected, breaks in interesting ways, or just wasn’t designed to go. This post is about documenting those edges on purpose rather than stumbling into them by accident.
The Value of Intentional Failure
Most people discover a tool’s limitations by having something go wrong in the middle of a real task. That’s a bad time to learn — you’re already invested, deadlines are pressing, and you’ve got context switching cost.
Intentional experimentation — “let’s try to break this on purpose and see what happens” — gives you a map of the edges before you need them. You learn what the tool can genuinely do, what it can approximate, and what’s simply outside the envelope.
OpenClaw’s edges are interesting because they’re partly technical (context window, tool access) and partly fundamental to LLMs themselves.
Experiment 1: Long-Running Autonomous Task
Hypothesis: OpenClaw can run a multi-hour task with multiple decision points without degrading in quality or losing track of the original goal.
Setup: Ask it to research a broad topic (“the state of open source LLMs in 2026”), synthesize findings, draft a report, and iterate on the draft — without intermediate check-ins.
What actually happened: The first pass was solid. The second iteration flagged some hallucinated citations — the LLM had synthesized confidently but cited sources that don’t exist. The third iteration tried to correct this by searching for the specific citations it had invented, found nothing, and then… quietly stopped correcting them in the text.
This is the “confirmation drift” problem: once a model has written something with confidence, it tends to reinforce rather than question it on subsequent passes. The fix: never do more than 2 passes without a human in the loop.
Verdict: Useful for research drafts, but requires human citation checking. Don’t trust multi-pass LLM output without verification.
Experiment 2: Multi-Agent Delegation
Hypothesis: OpenClaw can spawn sub-agents to work on parts of a problem in parallel, then synthesize their results.
What actually happened: Parallelization worked — sub-agents were able to research different sub-topics simultaneously and report back. The synthesis step was where it got messy: the parent agent had to receive raw outputs from sub-agents and integrate them, and integration quality varied significantly based on how well-separated the sub-topics were.
For highly parallelizable tasks (researching N independent topics), this worked well. For tasks with interdependencies (each sub-task’s answer affects the next), it fell apart — the synthesis step had to do all the work of connecting things that should have been connected during the research phase.
Verdict: Good for embarrassingly parallel problems. Bad for problems that need iterative refinement across parts.
Experiment 3: Persistent State Manipulation
Hypothesis: OpenClaw can maintain complex, evolving state across dozens of operations in a single session — like running a small database or managing a queue.
Setup: Give it a JSON file representing a task queue and ask it to manage additions, completions, and priority updates across 50 operations.
What actually happened: It was remarkably good at this for the first 20-30 operations. After that, it started making subtle errors — skipping updates, losing track of priority ordering, occasionally “inventing” state that wasn’t in the file. This is a context window problem: as the file grows and the conversation history fills, the model has to hold more in context and accuracy degrades.
The lesson: for stateful workloads beyond a certain complexity, use a real database. OpenClaw is great at the logic layer on top of data, but shouldn’t be the data store itself.
Verdict: Fine for simple state (a few dozen records). Use a real database for anything production-critical.
Experiment 4: Emotional Tone Calibration
Hypothesis: OpenClaw can detect emotional subtext in user messages and respond appropriately — not just literally, but with appropriate warmth, urgency, or levity.
Setup: Feed it messages that have both a surface content and an emotional subtext, and see if responses address both.
What actually happened: It was surprisingly good at this in obvious cases — a user who says “I guess this doesn’t matter anyway” gets a different response than one who says “can you help me with X.” But it was inconsistent in subtle cases, and it had a tendency to over-pathologize mild frustration (“I see you’re feeling frustrated about this” felt patronizing when the user was just mildly annoyed).
The broader problem: LLMs are trained to produce helpful responses, which maps poorly onto situations where the most helpful response is actually to match the user’s emotional register rather than label it.
Verdict: Good for obvious emotional signals. Inconsistent and sometimes off-target in subtle cases. Don’t rely on it for emotionally loaded conversations without human review.
A Real Experiment Workflow
If you want to run your own OpenClaw boundary experiments, here’s a repeatable pattern that works:
The Setup
Create an experiment file at memory/experiments/YYYY-MM-DD-[name].md:
# Experiment: [name]
**Date:** YYYY-MM-DD
**Hypothesis:** ...
## Setup
<!-- What you configured, what prompt you used -->
## Run Log
<!-- Timestamped entries of what happened -->
## Observations
<!-- What stood out -->
## Verdict
<!-- What you'd tell someone asking if this works -->
The Run Loop
- Pre-flight check — Clear old memory context if you need a clean slate. Set
memory/experiments/active.mdto note you’re mid-experiment. - Run the experiment — Execute the task and log observations in real-time to the experiment file via
exec >>appends or directwritecalls. - Capture session ID — Note the session key so you can reload it if needed.
- Post-mortem — After the run, review the experiment file, write your verdict, and clean up
active.md.
Multi-Session Experiments
For long-running experiments (Experiment 1-type tasks), the pattern is:
- Start with a
sessions_spawnisolated session so the parent is free - The spawned session logs to a shared file
- Parent checks file periodically or on heartbeat
- Parent synthesizes findings when the child completes
This prevents the parent session from growing stale mid-experiment.
Troubleshooting Common Issues
When experiments go wrong in ways that aren’t the point of the experiment:
Sub-agent goes silent: Check if the parent session is healthy. Silent sub-agents are usually a parent crash. Kill and respawn.
Output looks great but has subtle errors: This is the most common failure mode. Build a specific factual check into your experiment design — something you can verify independently after the run.
Context pollution from prior conversation: If your experiment session started from a loaded context, note that in your methodology. Results may not replicate in a truly clean session.
Long experiment runs out of tokens: Set up a checkpoint mid-way — have the agent write progress to a file and then stop. Resume from the file rather than from context.
Document failures in the experiment log too. What didn’t work is often more informative than what did.
What’s Interesting About These Failures
The interesting thing isn’t that OpenClaw failed these experiments — every tool fails at its edges. The interesting thing is how it fails:
- It fails confidently — it doesn’t signal uncertainty when it should
- It fails silently — sometimes the output just looks reasonable and the errors are subtle
- It fails in context — performance degrades as context grows, which is hard to predict from the outside
None of this is a knock on OpenClaw specifically — it’s how all LLM-based systems work at the frontier. The answer isn’t to avoid these uses; it’s to use them with appropriate checkpoints and human oversight.
What You Need to Set This Up
Running experiments against OpenClaw doesn’t require much beyond standard tooling:
- OpenClaw installed and running — the core requirement
- File system access — for reading/writing state files and logs during experiments
- Sub-agent access — for the multi-agent experiment;
sessions_spawnmust be available - A way to track results — a simple markdown log or JSON file to record outcomes across runs
- Patience — you’ll run experiments multiple times to confirm results; single runs can be misleading
That’s it. No special integrations, no external services. You’re testing the system itself, so the setup is minimal.
Limitations
Running experiments against OpenClaw has its own constraints worth knowing before you invest heavily:
- Time cost — Long-running autonomous experiments consume significant tokens. Budget accordingly, especially for multi-pass synthesis tasks.
- Reproducibility variance — LLMs have non-deterministic elements (temperature, sampling). Running the same experiment twice can yield different results. Treat single runs as directional, not definitive.
- No ground truth — When testing outputs (citations, facts, emotional calibration), you need an independent source of truth to compare against. Without one, you can only eyeball plausibility.
- Context cleanliness — Experiments run inside an active session with conversation history. Prior messages can subtly influence outcomes. Isolation helps but isn’t always available.
- Sub-agent reliability — Sub-agent experiments depend on the parent session staying healthy. If the parent crashes mid-experiment, results are lost.
Document your methodology alongside your findings. Future-you will want to know what the setup actually was, not just what the results were.
Bottom Line
OpenClaw is excellent at: reasoning, research synthesis, creative concept development, scheduling, automation, and anything where you can verify the output.
It’s unreliable at: hallucination-prone multi-pass generation, complex persistent state, emotionally nuanced communication at scale, and tasks requiring real-time sensory data.
Use it where it’s strong. Build checkpoints where it’s weak.
Want to try this with OpenClaw?
OpenClaw is free and open source. Get started at openclaw.ai
Try OpenClaw →