San Francisco · On Site · Full Time
Judgment Labs is building the infrastructure for continual learning in long-horizon AI agents.
The next generation of agents will not improve from prompts alone. They will improve from experience: the tasks they attempt, the tools they use, the mistakes they make, the edge cases they encounter, and the outcomes they produce in production. The hard part is turning that raw experience into high-quality data that can actually improve the system.
Judgment builds the infrastructure to do that. We turn long agent trajectories into clean, structured data for evals, labeling, rubric generation, context engineering, and RL workflows. Instead of only showing teams what happened, Judgment helps decide what matters, what should be learned from, and how that learning should flow back into the agent.
Databricks built the data infrastructure for analytics. Judgment is building the learning infrastructure for agents.
We’ve raised $30M+ from Lightspeed, SV Angel, Valor Equity Partners, and others.
We’re looking for a Senior Cloud Infrastructure Engineer to own the infrastructure that lets Judgment run reliably across our cloud, customer environments, and enterprise deployments.
This role focuses on cloud/platform infrastructure: Terraform, EKS, ArgoCD/Kargo, IAM, DNS, observability, CI/CD, multi-region architecture, BYOC, self-hosted deployments, private connectivity, and enterprise-grade reliability.
You’ll work on the systems that keep high-throughput telemetry ingestion, ClickHouse, RabbitMQ, Temporal, evaluation workers, and customer-facing services running under real production load. You’ll also make Judgment deployable for customers with strict security and infrastructure requirements: multi-region, data residency, private networking, self-hosted, air-gapped, and Bring Your Own Cloud environments.
Enterprise-grade deployment architecture. Run Judgment in customer environments — self-hosted, air-gapped, or BYOC — while keeping operations, upgrades, observability, and reliability sane.
Multi-region reliability. Design failover, disaster recovery, data residency, and deployment patterns for customers that cannot tolerate downtime or ambiguous data movement.
Infrastructure for high-throughput telemetry. Support ingestion systems parsing and persisting hundreds of thousands of spans per second, with graceful backpressure and clear failure modes.
Operating stateful systems at scale. Keep ClickHouse, RabbitMQ, Temporal, evaluation workers, and supporting services healthy as workloads grow and customer traffic becomes spiky.
Private and secure connectivity. Build secure paths into customer environments using network isolation, IAM, SSO/SAML/SCIM, encryption, private connectivity, and restricted-network deployment patterns.
A single operational story across many deployment modes. Cloud, multi-region, BYOC, and self-hosted deployments should not become four totally different products to operate.
Safe production rollouts. Build deployment automation, environment parity, feature-flag discipline, CI/e2e reliability, monitoring, and rollback mechanisms so the team can move fast without breaking customer trust.
Own cloud infrastructure for production services across Terraform, EKS, ArgoCD/Kargo, IAM, DNS, networking, metrics, CI/CD, and deployment automation.
Build and operate infrastructure for trace ingestion, evaluation workers, RabbitMQ, Temporal, ClickHouse, and the systems that support Judgment’s core product.
Design multi-region and enterprise deployment architectures, including data residency, automatic failover, disaster recovery, and customer-managed environments.
Build secure deployment patterns for BYOC, self-hosted, private-network, and restricted environments.
Implement private connectivity, identity integrations, network isolation, encryption patterns, and enterprise security requirements.
Improve observability, alerting, runbooks, incident response, and operational tooling so the team can debug root causes quickly rather than chase symptoms.
Partner with backend engineers on reliability, scaling limits, queue behavior, storage growth, ingestion throughput, and production incidents.
Make deployments safer and faster through automation, rollout strategies, environment parity, CI reliability, e2e test health, and better internal tooling.
Work directly with customers when deployment, networking, security, or production environment constraints are the blocker.
Raise the bar for infrastructure quality through design docs, code reviews, operational rigor, and clean abstractions.
Strong experience designing, building, and operating production cloud infrastructure for real customer-facing systems.
Deep understanding of distributed systems failure modes, especially around stateful services, queues, networking, storage, degraded networks, partial outages, and regional failures.
Strong programming ability in a modern language and a bias toward automating repeated operational work.
Experience with Kubernetes / EKS or similar orchestration systems, infrastructure-as-code, CI/CD, cloud networking, IAM, DNS, and production observability.
Ability to reason about reliability, security, deployment ergonomics, and developer velocity at the same time.
Experience owning infrastructure systems from design through implementation, rollout, incident response, and long-term maintenance.
Comfort working directly with customers on enterprise deployment, networking, compliance, or security constraints.
Clear written communication. You can write architecture proposals, operational runbooks, incident notes, and crisp tradeoff docs.
Experience with Terraform, EKS, ArgoCD, Kargo, AWS networking, IAM, DNS, and production metrics/logging systems.
Experience operating ClickHouse, RabbitMQ, Temporal, Kafka, or other stateful production infrastructure.
Experience with private connectivity such as AWS PrivateLink, Azure Private Link, or GCP Private Service Connect.
Experience building BYOC, self-hosted, air-gapped, hybrid-cloud, or enterprise SaaS deployment models.
Experience with SSO, SAML, SCIM, secrets management, encryption, network isolation, and enterprise security reviews.
Experience with observability infrastructure, telemetry ingestion, or platforms like Datadog, Honeycomb, Sentry, or similar systems.
Experience supporting AI infrastructure, LLM evaluation workloads, or high-throughput event pipelines.
We’re building the learning infrastructure for agents. As agents move from demos to production, the bottleneck is no longer just better prompts. It is turning real production experience into high-quality data for evals, labeling, rubric generation, context engineering, and RL workflows.
Infrastructure is a product requirement here. Customers need Judgment to run reliably across our cloud, enterprise environments, and customer-managed deployments. Deployment quality directly affects whether they can use us.
The systems are real. High-throughput ingestion, stateful services, workflow orchestration, ClickHouse, LLM scoring, multi-region reliability, and BYOC all show up early.
This is a Databricks-scale infrastructure opportunity. Databricks built the data infrastructure for analytics. Judgment is building the learning infrastructure for agents.
You’ll have broad ownership. This is a small team, so infrastructure engineers own architecture, implementation, operations, and customer deployment outcomes.
In person in San Francisco. We work together in person because the problems are hard, the product is moving fast, and the feedback loops matter.