A Python SDK for building AI agents that perform knowledge workβresearch, analysis, writing, and decision-making tasks that require iteration, verification, and structured thinking.
Why Knowledge Work is Different from Code
Code has a tight feedback loop: write code β run tests β fix errors β repeat. The solution space is constrainedβthere's usually one correct answer, and automated tests tell you if you found it.
Knowledge work is fundamentally different. The solution space is vast and underspecified. A "market analysis" could be a two-paragraph summary or a 50-page deep dive. A "strategy recommendation" could emphasize cost, speed, risk, innovation, or any combination. There's no test suite that returns pass/fail.
Our approach: Since knowledge work lacks natural verification, we synthesize one using rubrics. A rubric defines what "good" looks like before execution begins, enabling:
Self-verification: The agent checks its own work against explicit criteria
Transparent evaluation: Humans can audit the rubric and verification process
This SDK implements a self-verifying agentic loop that brings structure to the inherently open-ended nature of knowledge work. The agent can search the web, read and write files, execute code, generate artifacts, and ask the user for clarificationβall coordinated through an orchestrator that verifies its own output.
Why I'm Sharing This
This started as a harness for running RL training on knowledge tasks. I'm open-sourcing it because:
Knowledge workflows are underexplored. Most AI tooling focuses on code. But knowledge workβresearch, analysis, strategy, writingβis where most professionals spend their time. The primitives for building these systems aren't well established yet.
This could be a useful building block. If you're building products that involve AI doing research, making recommendations, or producing documents, this verification loop might save you weeks of iteration.
<!-- Keep this up to date. Check items off as they land. -->
Models still struggle with verification. The self-check step is the weakest link. If this gets adoption, an open-source model provider could train specifically on rubric-based verificationβimproving the entire ecosystem.
I'd rather see these ideas spread than keep them proprietary.
from verif import RLHarness
harness = RLHarness(provider="gemini") # or "openai" or "anthropic"
result = harness.run_single("Analyze the economic impact of remote work on urban real estate.")
print(result.answer) # The analysis
print(result.rubric) # Auto-generated evaluation criteria
import asyncio
from verif import AsyncRLHarness
async def main():
harness = AsyncRLHarness(provider="openai", enable_search=True)
result = await harness.run_single("Analyze the economic impact of remote work on urban real estate.")
print(result.answer)
asyncio.run(main())
Execution Modes
The SDK provides different modes optimized for different types of knowledge work:
| Mode | Best For | Rubric Strategy |
|------|----------|-----------------|
| standard | General research & analysis | Auto-created during execution |
| plan | Complex multi-step tasks | User-provided or auto-created |
| explore | Creative/divergent thinking | Quality checklist (no accuracy rubric) |
| iterate | Refining existing work | Uses existing rubric + feedback |
Supported Providers
| Provider | Config | Thinking Control |
|----------|--------|-----------------|
| Gemini | provider="gemini" | thinking_level: LOW / MEDIUM / HIGH |
| OpenAI | provider="openai" | reasoning_effort: low / medium / high |
| Anthropic | provider="anthropic" | thinking_budget: token count (default 10000) |
Standard Mode (Default)
For general tasks. The orchestrator creates brief and rubric automatically.
from verif import RLHarness
harness = RLHarness(provider="gemini", enable_search=True)
result = harness.run_single(
"Compare carbon tax vs cap-and-trade for reducing industrial emissions."
)
print(result.answer)
print(result.rubric) # Auto-generated
For divergent thinkingβgenerate multiple distinct perspectives. Unlike standard mode, explore doesn't optimize for a single "right" answer. It maps the solution space.
How explore differs from standard:
No accuracy rubric. Standard mode creates a rubric to verify correctness. Explore uses a quality checklistβare the takes distinct? Do they cover different assumptions?
Forces gap identification. Each take must state its assumptions and what would break it. This surfaces blind spots you wouldn't find with a single answer.
Quantity over convergence. Standard iterates toward one verified answer. Explore produces N parallel answers that may contradict each otherβthat's the point.
from verif import RLHarness
harness = RLHarness(provider="gemini", enable_search=True)
result = harness.run_single(
task="""Explore database architectures for a fintech handling 10K TPS
with strong consistency and multi-region deployment.""",
mode="explore",
num_takes=3, # Generate 3 distinct approaches
)
# Result contains multiple takes separated by ===
takes = result.answer.split("===")
for i, take in enumerate(takes, 1):
print(f"--- Approach {i} ---\n{take[:500]}...")
Each take includes:
The solution/recommendation
Assumptions: What must be true for this to work (e.g., "assumes budget for multi-region replication")
Counterfactual: What could make this fail (e.g., "breaks if latency requirements tighten to <10ms")
The output ends with set-level gaps: what's missing from the entire set? This tells you which angles weren't coveredβmaybe all takes assumed a single cloud provider, or none considered regulatory constraints. The gaps are often more valuable than the takes themselves.
Use explore when you're not sure what the right question is, or when the "best" answer depends on unstated constraints.
Save execution state at every step. Resume from any checkpoint with optional feedback and rubric updates.
from verif import RLHarness
harness = RLHarness(provider="gemini", enable_search=True)
# Run with checkpointing
result = harness.run_single(
"Analyze the power dynamics among Olympian gods.",
checkpoint=True,
)
# List checkpoints
for snap_id, snap in harness.snapshots.items():
print(f"{snap_id} (step {snap.step})")
# Resume from any checkpoint with new direction
resumed = harness.resume(
checkpoint_id="<snap_id>",
feedback="Focus more on the Trojan War.",
rubric_update="Must include analysis of divine intervention in the Iliad.",
)
from verif import RLHarness, Attachment, Prompt
# Create attachment with preview
attachment = Attachment(
content="/path/to/data.csv",
mime_type="text/csv",
name="data.csv",
preview="col1,col2\n1,2\n3,4...", # First N lines
)
# Build multimodal prompt
prompt: Prompt = [
"Analyze the attached sales data and create a summary.",
attachment,
]
result = harness.run_single(prompt)
from verif import RLHarness, ProviderConfig, CompactionConfig
from verif.executor import SubprocessExecutor
harness = RLHarness(
# Provider: "gemini" | "openai" | "anthropic" | ProviderConfig
provider=ProviderConfig(
name="gemini",
thinking_level="MEDIUM", # Gemini: LOW | MEDIUM | HIGH
# Optional google-genai HttpOptions pass-through:
# gemini_async_client_args={"ssl": True, "cookies": {}},
# gemini_http_options={"async_client_args": {"ssl": True}},
# OR for OpenAI:
# name="openai",
# reasoning_effort="medium", # low | medium | high
# OR for Anthropic:
# name="anthropic",
# thinking_budget=10000, # token budget for extended thinking
),
# Tool Capabilities
enable_search=True, # Web search tool
enable_bash=False, # File system navigation
enable_code=False, # Python code execution
enable_ask_user=False, # User clarification tool
# Code Execution (required if enable_code=True)
code_executor=SubprocessExecutor("./artifacts"),
artifacts_dir="./artifacts",
# Execution Limits
max_iterations=30,
# Mode Selection
default_mode="standard", # "standard" | "plan" | "explore"
# Pre-set Rubric (optional)
rubric="1. Must be accurate\n2. Must cite sources",
# Event Streaming
on_event=lambda e: print(f"[{e.entry_type}] {e.content[:100]}"),
stream=True,
stream_subagents=True,
# Context Compaction (for long tasks)
compaction_config=CompactionConfig(
enabled=True,
threshold=0.8, # Trigger at 80% context capacity
keep_recent_turns=3,
),
)
Result Objects
RunResult
result = harness.run_single(task)
result.task # Original task text
result.answer # Final submitted answer
result.rubric # Evaluation rubric used
result.history # List[HistoryEntry] - full execution trace
result.mode # Mode used: "standard" | "plan" | "explore"
result.plan # Plan (if plan mode)
result.brief # Brief (if available)
Execution Trace
# Get formatted history
print(harness.get_history_markdown())
print(harness.get_history_text())
# Access raw entries
for entry in result.history:
print(f"[{entry.timestamp}] {entry.entry_type}: {entry.content[:100]}")
[x] State checkpointing & resume β Save at every step, fork from any checkpoint with feedback + rubric updates. See Checkpointing & Resume.
[x] Anthropic provider β Claude with streaming, extended thinking, native web search.
[x] Context compaction β Summarize middle of context when approaching token limits.
[x] Explore mode β Generate N distinct approaches with assumptions, counterfactuals, and set-level gaps.
[x] Iterate mode β Stateless refinement with feedback classification (rubric vs answer level).
[x] Custom modes β Register new execution modes at runtime.
[x] Remote executor β Delegate code execution to frontend/browser via SSE.
[x] ask_user tool β Orchestrator can request clarification; verification blocks until answered.
In Progress
[ ] Anthropic checkpointing β Checkpointing works for Gemini and OpenAI. Anthropic not fully tested β the complexity is interleaved thinking blocks (thinking + signature pairs) that need to survive deep-copy and context replay correctly.
[ ] Compaction for Anthropic β SDK does its own compaction (summarize middle, keep recent turns) rather than using server-side context caching. Not stress-tested with Anthropic's 200K window.
Planned
[ ] Computer use subagent β Attach a computer-use capable subagent for GUI interaction (filling forms, navigating apps, extracting data from web interfaces).
[ ] Multi-app workflows β Working across browsers, spreadsheets, and documents in a single run.
[ ] Parallel verification β Run multiple verification passes and take consensus, reducing single-verifier bias.
[ ] Rubric quality scoring β Meta-evaluation: score the rubric itself before using it for verification. Catch "always-pass" rubrics early.
[ ] Structured output from runs β Return typed sections (executive summary, recommendations, evidence) instead of a single answer string.
[ ] Eval framework β Systematic comparison across providers/modes/rubric strategies on a benchmark task set. run_eval exists but needs scoring and reporting.
[ ] Token usage tracking β Surface per-run token counts by phase (brief, rubric, execution, verification) for cost analysis.
[ ] Mixed-model orchestration β Use different models for orchestrator vs subagents (e.g., Opus for orchestration, Flash for search subagents). Currently the same provider handles both. I kept it this way because RL training benefits from a single policy, but for production use the cost savings of routing cheap tasks to smaller models would be significant.
Leave out subagent outputs, search results, and code execution from the training signalβeven if they're generated by the same policy. The goal is to improve the orchestration and verification layers. Everything else is downstream; if the orchestrator gets better at decomposition and the rubric gets better at capturing intent, the subagents benefit automatically.
Verification is the bottleneck. Most training gains come from improving the verify step. A model that can accurately assess its own work against a rubric is more valuable than one that generates slightly better first drafts.
Limitations
Verification is only as good as the model. The rubric is generated by the same model that does the work. If the model has blind spots, the rubric will too. This is a fundamental constraint of self-verification.
External grounding happens at brief level, not verification. If you need external validation (e.g., checking facts against a database), you can provide your own rubric. But be careful: the verifier is intentionally limitedβit doesn't have access to search or filesystem. The design assumes grounding happens during task execution (via the brief and subagents), not during verification. The verifier checks internal consistency against the rubric, not external correctness.
Rubrics can be gamed. A sufficiently clever model could write a rubric that's easy to pass. This is why human review of rubrics matters for high-stakes tasks.
Context compaction requires a Gemini API key. Compaction (summarizing mid-context to stay under token limits) uses gemini-3-flash-preview regardless of your chosen provider. If you enable compaction with OpenAI or Anthropic as the orchestrator, you'll still need a GEMINI_API_KEY. Free keys are available from Google AI Studio.