DeepFabric and Spin: A Case Study in Building Better Agentic Training Data
As the world exhausts its supply of original training data, synthetic data has become not just useful but necessary for continued AI training. Yet this shift to synthetics brings it’s challenges - particularly for training models to effectively use Tools and conform with structured schema output. When both Tool calls and their responses are generated by an LLM, the resulting models consistently underperform against real systems. They struggle with error recovery, mishandle state dependencies, and often exhibit what we call “time travel” errors: acting on information they haven’t actually received yet (e.g., skipping verification steps because they “know” a file exists).
This post introduces DeepFabric’s execution-based Tool tracing system via the Spin framework, which replaces simulated Tool outputs with real execution inside WebAssembly sandboxes. The result is training data grounded in actual system behavior, including the messy parts that make real-world Tool use challenging.
Background: DeepFabric
DeepFabric generates synthetic training data for language models, along with training Tools and an Agent evaluation system. By combining reasoning traces with Tool-calling patterns, it creates high-quality, domain-specific datasets that teach models to think, plan, and act effectively, call Tools correctly, and conform to strict schema structures.
What sets DeepFabric apart from other dataset generation Tools is its ability to ensure high diversity yet domain-anchored relevance through unique topic graph generation algorithms. This guides sample creation to cover all necessary subtopics while avoiding redundancy, which is where other projects often fall short, resulting in model overfit.
DeepFabric also generates reasoning datasets with step-by-step examples of Tool calling, to train models in the ReAct (Reason-Act-Observe) paradigm. This is crucial for teaching models not just what to do, but how to think through problems and decide when and how to use Tools effectively (with the correct signature / parameters).
We originally used an LLM to simulate Tool responses during data generation. However, as we scaled up and tested models trained on this data, we observed consistent failure modes that stemmed from the simulated nature of Tool calls and the LLMs propensity to “know” the outcomes of its own actions (e.g. hallucinations). This led us to develop an execution-based tracing system using Spin.
The Problem with Simulated Tool Calls
Consider a typical data generation pipeline for tool-using agents. An LLM generates a user request, then generates an assistant response with Tool calls, then generates what those Tools might return. The fundamental issue: the same model is playing both sides of an interaction that should involve genuine uncertainty.
This leads to several failure modes in the resulting training data:
Time Travel Errors: The model “knows” what the Tool will return because it’s generating both the call and the response. Training on this data produces agents that skip verification steps - why check if a file exists when you already know what’s in it?
State Inconsistency: When the model hallucinates Tool outputs, it can drift from any coherent state. A file written in turn 1 magically drifts to having different contents when “read” in turn 5, because the model forgot what it generated earlier.
Missing Error Paths: Simulated Tools tend toward happy paths. Real systems fail in specific, recoverable ways. Models trained without exposure to FileNotFoundError, rate limits, or malformed responses handle these poorly in production.
Sequence Violations: Models will sometimes write to a file and then check if the file exists, or modify data without reading it first. These inversions are rare in real interactions but common when both sides are generated.
Real Execution via WebAssembly Sandboxes
The solution for us maintainers in the DeepFabric project, was to replace simulated Tool responses with actual real life executions. When the model generates read_file("config.json"), it now will actually read a file. When it generates write_file("output.txt", content), we actually write content. The model must then reason about real data when deciding its next action.
DeepFabric implements this using the Spin framework. The architecture is straightforward:
+-------------------+ HTTP POST /execute +------------------+
| DeepFabric | ------------------------> | Spin Service |
| (Python) | | (Wasm) |
| | <------------------------ | |
| - ReAct Loop | JSON Response | - Tool Comps |
| - LLM Calls | | - KV Store |
| - Session Mgmt | | - Sandboxed |
+-------------------+ +------------------+
The generation loop follows a ReAct (Reason-Act-Observe) pattern:
- Reason: LLM decides what Tool to call based on current context
- Act: Tool executes via Spin sandbox
- Observe: Actual result is fed back to the LLM
- Repeat: LLM decides next action based on real outcomes
This eliminates time travel by construction. The model cannot know what a Tool will return because it hasn’t returned yet. Each decision is made using only information that has actually been observed.
Three Types of Tool Execution
DeepFabric supports three categories of Tool execution, each suited to different use cases.
VFS Tools: Virtual Filesystem
The Virtual Filesystem component provides session-isolated file operations. Each generation session gets its own namespace in a key-value store, ensuring complete isolation between concurrent generations.
We have an initial inbuilt set of Tools (soon to expand by much more) as available operations:
| Tool | Description | Parameters |
|---|---|---|
read_file | Read file content | file_path (required) |
write_file | Write content to file | file_path, content (required) |
list_files | List files in session | None |
delete_file | Remove a file | file_path (required) |
The implementation is a Rust WebAssembly component that uses Spin’s built-in key-value store for persistence within a session. Files are namespaced by session ID, so session_001:main.py and session_002:main.py are completely independent.
Response format is consistent across all Tools:
{
"success": true,
"result": "file content here...",
"error_type": null
}
Or on failure:
{
"success": false,
"result": "File not found: config.yaml",
"error_type": "FileNotFound"
}
Error types are structured (FileNotFound, IOError, InvalidArguments) rather than free-form strings, making them suitable for training models to handle specific failure modes.
Component Tools: Real API Access
For scenarios requiring interaction with external services, DeepFabric supports full API integration through Spin components. The is a GitHub reference component that demonstrates this pattern:
# Start with GitHub token for API access
SPIN_VARIABLE_GITHUB_TOKEN=ghp_xxx spin up
# Optionally restrict to specific repositories
SPIN_VARIABLE_ALLOWED_REPOS="myorg/repo1,myorg/repo2" spin up
Available GitHub Tools include repository search, file content retrieval, issue and PR listing, commit details, and more. Each Tool hits the real GitHub API, returning actual repository data.
Safety controls are built in:
- Repository allowlisting: Restrict which repos can be accessed during generation, using WebAssembly capabilities based network access control
- Write protection: Mutation operations disabled by default
- Structured errors: Non-allowed access returns clear, actionable error messages
This enables generating training data for code analysis tasks using real codebases, with guard rails that prevent unintended modifications.
Mock Tools: MCP-Compatible Schema-Driven Responses
For APIs where real access isn’t practical during training (payment processors, production databases, rate-limited services), the mock component provides deterministic responses based on Tool schemas. The mock component uses the Model Context Protocol (MCP) schema format, enabling direct integration with any MCP server.
MCP Schema Format
The mock component accepts Tool definitions in the standard MCP tools/list response format:
{
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius"
}
},
"required": ["location"]
},
"annotations": {
"title": "Weather Lookup",
"readOnlyHint": true,
"openWorldHint": true
}
}
]
}
This means you can take the Tool schema from any MCP server and load it directly into Spin for mock execution–all parameter definitions, types, and constraints are automatically mapped.
Loading Schemas from MCP Servers
The mock component can pull Tool definitions directly from a real MCP server:
# Pull Tools from an MCP server endpoint
curl -X POST http://localhost:3000/mock/pull \
-H "Content-Type: application/json" \
-d '{"url": "http://localhost:8080/mcp"}'
This sends a JSON-RPC tools/list request to the MCP server and loads all returned Tool definitions. The response includes the Tool count and names:
{
"loaded": 12,
"tools": ["get_weather", "search_files", "run_query", ...]
}
You can also pull directly at runtime:
deepfabric import-tools --transport stdio \
--command "npx @sentry/mcp-server@latest \
--access-token=sentry-user-token \
--host=sentry.example.com" \
--spin http://localhost:3000
Mock Responses and Fixtures
Once Tools are loaded and defined, you can map out mock responses. The default response echoes the Tool name and input:
{
"tool": "get_weather",
"description": "Get weather for a location",
"input_received": {"location": "Paris"},
"mock_result": "Successfully executed get_weather",
"status": "success"
}
For more realistic training data, we can define custom response templates with argument interpolation:
# Set a mock response template
curl -X POST http://localhost:3000/mock/update-response \
-H "Content-Type: application/json" \
-d '{
"name": "get_weather",
"mockResponse": {
"temperature": 72,
"conditions": "Partly cloudy",
"location": "{{location}}"
}
}'
The {{location}} placeholder expands to the actual argument value at execution time.
For argument-specific responses, use fixtures:
# Add a fixture for specific arguments
curl -X POST http://localhost:3000/mock/add-fixture \
-H "Content-Type: application/json" \
-d '{
"name": "get_weather",
"match": {"location": "Seattle"},
"response": {"temperature": 55, "conditions": "Rainy", "location": "Seattle"}
}'
# Fixtures can match multiple arguments
curl -X POST http://localhost:3000/mock/add-fixture \
-H "Content-Type: application/json" \
-d '{
"name": "search_files",
"match": {"directory": "/src", "pattern": "*.py"},
"response": {"files": ["main.py", "utils.py", "config.py"], "count": 3}
}'
Fixtures are matched by specificity - a fixture matching two arguments takes precedence over one matching a single argument. This enables building up realistic response libraries for complex Tool interactions.
Why MCP Compatibility?
MCP is becoming the standard protocol for Tool integration in AI systems. By conforming to MCP schemas, the mock component enables:
- Direct schema import: Pull Tool definitions from any MCP server without manual transcription
- Ecosystem compatibility: Use the same Tool schemas across Claude Desktop, IDEs, and training pipelines
- Schema validation: Tool parameters are validated against the
inputSchemabefore mock execution - Future-proofing: As MCP servers proliferate, their Tools can immediately be used for training data generation
However frontier models are trained less frequently on the latest MCP schemas. With DeepFabric, and SLM training being highly efficient, you can rapidly iterate on training data using the latest Tool definitions from any MCP server of choice.
A new version of a specific MCP server is shipped? Simply pull the updated Tool list into Spin and regenerate training data to keep models up to date.
Seed Files
Certain files can be seeded into the virtual filesystem at the start of a generation session to create realistic starting conditions. This is done via the scenario_seed option in the DeepFabric configuration:
tools:
spin_endpoint: "http://localhost:3000"
available:
- read_file
- write_file
- list_files
max_per_query: 3
max_agent_steps: 5
# Pre-populate the virtual filesystem
scenario_seed:
files:
"main.py": |
def greet(name):
return f"Hello, {name}!"
if __name__ == "__main__":
print(greet("World"))
"config.json": |
{
"version": "1.0.0",
"debug": true
}
The scenario_seed option pre-populates the virtual filesystem before generation begins, creating realistic starting conditions. The seeded files are loaded into the session’s store before any Tool calls execute. We will be following up with more sophisticated seeding options soon, for example to set up specific directory structures or initial data states imported from external sources such as Git repositories.
Support is being planned to populate seed files directly from a folder or git repo, enabling complex project structures to be used as starting points for generation.
Security: Why WebAssembly
A natural question: why did we not just use Docker containers for Tool execution?
WebAssembly provides the same (some would argue more) robust isolation guarantees, yet with significantly lower overhead and cold start times. This is crucial when generating large datasets and not wanting to introduce seconds of latency per Tool call.
| Property | Docker | WebAssembly |
|---|---|---|
| Outbound network access | Allowed by default | Denied by default |
| System calls | Full syscall access | Capability-based model |
| Memory isolation | Process-level | Module-level |
| Cold start time | Seconds | Milliseconds |
| Resource overhead | ~100MB+ per container | ~1MB per module |
The capability-based security model is particularly valuable for training data generation. Consider:
Agent: read_file("/etc/passwd")
Wasm: {"success": false, "result": "Access denied", "error_type": "PermissionDenied"}
The sandbox rejection isn’t just a safety feature - it’s valuable training data. Models learn that certain paths are off-limits and how to recover when access is denied and find a more appropriate method to achieve their goals.
This becomes more pressing as we start to witness SOTA (State-of-the-art) frontier models exhibiting dangerous behaviors, such as attempting to delete a users entire home directory when given file system access. WebAssembly’s default-deny posture ensures that any such attempts are safely blocked, while also providing informative error feedback for training.
In spin.toml, capabilities are explicitly granted:
[component.vfs]
# No network access
allowed_outbound_hosts = []
# Only KV store access
key_value_stores = ["default"]
[component.github]
# Specific external APIs only
allowed_outbound_hosts = [
"https://api.github.com"
]
If a Tool attempts anything not explicitly allowed, the Wasm runtime rejects it. There’s no possibility of sandbox escape through configuration oversight.
Building Custom Tool Components
The built-in VFS and mock service cover common use cases, but the real power of this architecture is extensibility. You can package your own Tool components, host them on GitHub, and run them in Docker for production deployments.
Anatomy of a Spin Component
Each component is a WebAssembly module that handles HTTP requests. Here’s the minimal structure for a Rust component:
my-tools-sdk/
├── spin.toml # Application manifest
├── components/
│ └── mytool/
│ ├── Cargo.toml # Rust dependencies
│ └── src/
│ └── lib.rs # Tool implementation
The spin.toml defines routing and capabilities:
spin_manifest_version = 2
[application]
name = "my-tools"
version = "1.0.0"
[[trigger.http]]
route = "/mytool/..."
component = "mytool"
[component.mytool]
source = "components/mytool/target/wasm32-wasip1/release/mytool.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://api.myservice.com"]
[component.mytool.build]
command = "cargo build --target wasm32-wasip1 --release"
workdir = "components/mytool"
Polyglot Language Support
And of course, this being WebAssembly, you can build components in any language that compiles to Wasm, including Javascript, Go, and Python. The Spin framework handles the HTTP interface and capability restrictions uniformly across languages.
Conclusion
Execution-based Tool tracing addresses a fundamental limitation in synthetic data generation for agentic models. By replacing simulated Tool responses with real execution in WebAssembly sandboxes, we produce training data that reflects actual system behavior–including the error conditions and state dependencies that make Tool use challenging.
The Spin framework provides the necessary isolation with minimal overhead, and the three-tier Tool system (VFS, Component, Mock) covers the full spectrum from sandboxed file operations to real API access to controlled mock responses.
If you’re building agentic training data, consider whether your current approach captures the iterative, uncertainty-driven nature of real Tool use. Execution-based tracing ensures it does.
Acknowledgements
Thanks to the Spin team for building such a flexible and powerful framework for WebAssembly applications. Their work made this architecture feasible.
DeepFabric Resources
About the Author
Luke Hinds is the CEO and co-founder of Always Further, a company focused on advancing AI efficiency, security, and scalability. He has a background in open source software development, cybersecurity, and distributed systems. Luke is the creator of Sigstore, an open source project for software supply chain security.