Troubleshooting

Common issues and solutions when running the AWS Orchestrator Agent, plus frequently asked questions.

Common Issues

1. Validator fails with "Directory not found"

Symptom: tf-validator returns INVALID: module directory not found at workspace/terraform_modules/{service}/

Cause: Path mismatch between the virtual filesystem and the real disk. The validator runs shell commands from the project root without a leading /, but the generated files may not have been synced yet.

Fix:

Ensure your docker-compose.yml does not mount a volume over /app/workspace unless needed — the agent manages this path internally
Verify TERRAFORM_WORKSPACE=./workspace/terraform_modules is set in your .env
If running from source, ensure the workspace/ directory exists at the project root

2. Generator retries endlessly and fails

Symptom: Continuous loop of tf-validator errors followed by tf-generator rewrites until timeout.

Cause: The Deep Agent tier LLM model lacks sufficient reasoning capacity to parse complex validation errors and fix them. Flash or Lite models struggle with intersecting dependency errors.

Fix: Switch LLM_DEEPAGENT_MODEL to a premium model:

# In .env
LLM_DEEPAGENT_MODEL=gemini-3.1-pro-preview

The Standard tier can stay on flash/lite — only the Deep Agent tier needs strong reasoning.

3. "I'm sorry, I cannot fulfill out-of-scope requests."

Symptom: Immediate rejection without any sub-agents running.

Cause: The Supervisor couldn't detect Terraform or AWS intent in your prompt.

Fix: Use clear infrastructure action words:

✅ "Generate Terraform code for an S3 bucket with encryption"
✅ "Create a VPC module with 3 AZs"
✅ "Update the AWS module at acme/infra-modules"

❌ "Make me some cloud stuff"
❌ "Write a Python script for AWS"

4. GitHub commit fails (401 or "Not Found")

Symptom: The github-agent successfully generates the commit payload but GitHub rejects it.

Cause: Your GitHub Personal Access Token is either missing, expired, or doesn't have sufficient scope.

Fix:

Verify GITHUB_PERSONAL_ACCESS_TOKEN is set in your .env
Ensure the token has repo scope (read/write to repositories)
Confirm the target repository exists before running the agent — the agent creates files, not repos
For organization repos, ensure the token has org-level access

5. Terraform MCP connection timeout

Symptom: The tf-planner hangs during requirements analysis or reports "MCP server unreachable."

Cause: The Terraform Registry MCP server is bundled inside the Docker image but may fail to start if the container is resource-constrained.

Fix:

Check container logs: docker logs aws-orchestrator-agent --tail 50
Ensure the container has at least 2GB memory allocated
If running from source, verify the Terraform MCP server is installed and accessible

6. Stale skills produce outdated code

Symptom: Generated code uses deprecated resource arguments or old provider syntax despite the Terraform MCP having newer data.

Cause: Skills are cached in the virtual filesystem. If a skill was created with an older provider version and the provider has since been updated, the generator follows the stale blueprint.

Fix: Ask the agent to regenerate the skill:

"Regenerate the VPC module from scratch — ignore cached skills and re-research from the Terraform Registry."

The planner checks provider version metadata in the skill's SKILL.md frontmatter. If the version is stale, it automatically re-runs the research pipeline. You can also manually delete the cached skill directory:

rm -rf skills/{service}-module-generator/

7. UI streaming freezes mid-generation

Symptom: The TalkOps UI freezes while the backend is still running (visible in Docker logs).

Cause: Transient disconnect in the A2A streaming pipeline during a long sub-agent tool call.

Fix: Refresh your browser — don't restart the container. The LangGraph checkpointer preserves state, so the UI will reconnect and resume from exactly where it left off.

8. Memory files not loading

Symptom: The agent doesn't follow org standards or HITL policies you've configured.

Cause: Memory files in the memory/ directory aren't being seeded into the virtual filesystem at startup.

Fix:

Verify memory/AGENTS.md exists — it's the index file the coordinator reads first
Check file permissions: all .md files in memory/ must be readable
If using Docker, ensure the memory/ directory is inside the container (it's baked into the image by default)

Accessing Logs

For debugging, fetch the runtime logs from Docker:

# Recent logs
docker logs aws-orchestrator-agent --tail 200

# Follow live
docker logs aws-orchestrator-agent -f

# Filter for errors
docker logs aws-orchestrator-agent 2>&1 | grep -i error

FAQ

Which AWS services does it support?

Any service supported by the AWS Terraform provider. The agent doesn't have a hardcoded list — the planner researches each service dynamically via the Terraform MCP server. VPC, S3, EC2, RDS, EKS, Lambda, IAM, CloudFront, and any other AWS provider resource all work.

Which LLMs work?

Google Gemini, OpenAI, Anthropic, AWS Bedrock, and Azure OpenAI. Set LLM_PROVIDER and LLM_MODEL in your .env. The default config uses Gemini. Each of the three tiers can use a different provider if needed.

Will it commit code without asking?

No. Two mandatory approval gates are enforced: (1) after validation passes — push to GitHub or keep local, (2) after completion — generate another module, update, or done. Destructive operations always require explicit approval, no exceptions.

Does it work with private GitHub repos?

Yes — your GITHUB_PERSONAL_ACCESS_TOKEN needs repo scope. For organization repos, ensure the token has the appropriate org-level permissions.

How does it avoid hallucinating provider configs?

The planner queries the Terraform Registry MCP server for the latest provider documentation before generating any code. It writes a skill blueprint with the exact resource arguments, variable types, and provider version constraints. The generator follows this blueprint — not its training data.

What if the generated module fails validation?

The coordinator re-dispatches the generator with the error details and re-runs validation. If it fails again after retry, it reports the errors and stops — it doesn't loop forever. The middleware stack (20 write calls, 60 total tool calls) provides additional guardrails.

Does it need Terraform CLI installed?

For Docker: no. The Docker image includes Terraform CLI and the Terraform MCP server. For local development: yes, you need Terraform CLI installed on your machine for sandbox validation.

How do I connect a client?

AWS Orchestrator speaks the A2A protocol. Any A2A client works. The included docker-compose.yml ships with TalkOps UI at localhost:8080. The agent runs at localhost:10104.

Common Issues​

1. Validator fails with "Directory not found"​

2. Generator retries endlessly and fails​

3. "I'm sorry, I cannot fulfill out-of-scope requests."​

4. GitHub commit fails (401 or "Not Found")​

5. Terraform MCP connection timeout​

6. Stale skills produce outdated code​

7. UI streaming freezes mid-generation​

8. Memory files not loading​

Accessing Logs​

FAQ​

Common Issues

1. Validator fails with "Directory not found"

2. Generator retries endlessly and fails

3. "I'm sorry, I cannot fulfill out-of-scope requests."

4. GitHub commit fails (401 or "Not Found")

5. Terraform MCP connection timeout

6. Stale skills produce outdated code

7. UI streaming freezes mid-generation

8. Memory files not loading

Accessing Logs

FAQ