pull down to refresh

repo: wanshuiyin/Auto-claude-code-research-in-sleep

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience.
The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing.
Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions.

ARIS has three architectural layers:
  1. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation.
  2. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models.
  3. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF.
A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.
┌─────────────────────────────────────────────────────────────────┐
│                Workflow 1.5: Experiment Bridge                  │
│                                                                 │
│   EXPERIMENT_PLAN.md                                            │
│         │                                                       │
│         ▼                                                       │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐                │
│   │ Claude   │────▶│ GPT-5.4  │────▶│ Sanity   │                │
│   │ Code     │     │ xhigh    │     │ Check    │                │
│   │ writes   │     │ reviews  │     │ (1 GPU)  │                │
│   │ code     │     │ code     │     │          │                │
│   └──────────┘     └──────────┘     └──────────┘                │
│                                          │                      │
│                                          ▼                      │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐                │
│   │ Collect  │◀────│ Monitor  │◀────│ Deploy   │                │
│   │ results  │     │ progress │     │ to GPUs  │                │
│   │          │     │ (+ W&B)  │     │          │                │
│   └──────────┘     └──────────┘     └──────────┘                │
│         │                                                       │
│         ▼                                                       │
│   Ready for /auto-review-loop                                   │
└─────────────────────────────────────────────────────────────────┘

I was meaning to write a little guide this week about how I do structural analysis on code with LLMs (but still HitL), but then I ran into this last night by chance. Although, unlike the paper claims, since Opus 4.7 I can easily hand off a 1 hour task as long as I spec at least some subtask decomposition rules/hints, that just means that with a bit of templating magic[1] good ideas from other repos can be easily re-integrated. There's quite some material in the forms of skill definitions in the repo and I'll be going over them to figure out if these can be used to improve my own framework. It's always good to learn from other people's efforts.

  1. I generally don't use many skills except for those that enable the communication harness, and zero CLAUDE.md, simply because I feel there aren't real generic rules outside of comms. There's only specific to task-at-hand and (optional) generic rules will often pollute and poison context, so I just template very long prompts instead.