DevOps Engineer - Cloud Infrastructure & Platform Reliability
Location: Remote
About AAIL
American AI Logistics (AAIL) is a defense technology company scaling rapidly. We are at the most consequential moment in our short history — growing from a startup into a full-scale defense technology platform. We move fast, win big, and are looking for driven individuals who want to be part of something that matters.
About the Role
We’re looking for a DevOps / Infrastructure Engineer to own the cloud platform that powers a B2B SaaS product serving U.S. government procurement. Our application runs on AWS and processes sensitive customer data under multiple compliance frameworks, and common criteria of reliability, security, and auditability. They aren’t aspirational goals, they’re contractual obligations.
You’ll be responsible for the full infrastructure lifecycle: provisioning and hardening cloud resources, building and maintaining CI/CD pipelines, managing container deployments, and keeping production systems available and secure. This is a small team, you won’t be filing tickets for someone else to execute. You’ll design it, build it, ship it, and carry the pager for it.
What You’ll Work On
Infrastructure as Code & Cloud Architecture
- Designing, provisioning, and managing AWS infrastructure using Terraform, resources such VPCs, subnets, security groups, ALBs, ECS clusters, RDS instances, S3 buckets, KMS keys, and IAM policies
- Managing multi-region AWS environments with environment promotion patterns (dev → stage → production) and strict resource isolation between environments
- Maintaining Terraform state management, module organization, and CI-driven plan/apply workflows with proper review gates
- Designing network architecture — public/private subnet segmentation, NAT gateways, security group rules following least-privilege principles, and WAF configurations
Containers & Compute
- Building and optimizing Docker images for production services running on ECS Fargate, including multi-stage builds, image size reduction, and layer caching strategies
- Managing ECR container registries with lifecycle policies, image scanning (Trivy, AWS Inspector), and automated vulnerability remediation workflows
- Configuring ECS service definitions, task placement, auto-scaling policies, and multi-AZ deployment for fault tolerance
- Managing EC2 instances where needed (e.g., virtual desktop infrastructure, bastion access, specialized compute) using Ansible for configuration management and patching
CI/CD & Deployment
- Building and maintaining CI/CD pipelines (Bitbucket Pipelines or equivalent) with linting, SAST, dependency scanning, IaC scanning, and automated tests as CI gates
- Implementing deployment strategies — rolling deployments, blue/green, canary releases — with automated rollback capabilities
- Managing branch-promotion deployment models with required approvals, environment-specific configurations, and secrets injection at deploy time
- Automating database migration execution as part of the deployment pipeline with safety checks and rollback procedures
Monitoring, Observability & Incident Response
- Building and maintaining the monitoring stack — CloudWatch metrics, alarms, dashboards, and log aggregation pipelines (CloudWatch Logs → Kinesis Firehose → S3 for long-term retention)
- Configuring alert routing by severity — PagerDuty for critical/major incidents, Slack for informational notifications — with on-call escalation policies
- Triaging GuardDuty security findings and AWS Inspector vulnerability reports, converting them into actionable remediation tickets with severity-based SLA targets
- Participating in on-call rotation, incident response, and post-incident reviews. Carrying the pager and owning the resolution.
Security, Secrets & Encryption
- Managing IAM policies, resource based policies, service control policies, permissions boundaries, roles, and trust relationships following least-privilege principles; for human users, CI/CD pipelines, and service-to-service access
- Administering secrets management using AWS Secrets Manager and SSM Parameter Store; rotation policies, access controls, and runtime injection patterns
- Managing KMS customer-managed keys (CMKs) for encryption at rest across S3, RDS, and other services; key rotation, key policies, and documenting exceptions
- Implementing and maintaining just-in-time privileged access mechanisms (e.g., SSM Session Manager) for production database and infrastructure access with full audit logging
- Managing TLS certificates via ACM, reviewing supported cipher suites and protocols, and hardening edge configurations
Database & Messaging Infrastructure
- Managing RDS PostgreSQL instances — provisioning, parameter tuning, backup configuration, point-in-time recovery, and semi-annual restore testing
- Managing message broker infrastructure (Amazon MQ / RabbitMQ) — cluster configuration, queue topology, monitoring for consumer lag and dead-letter queues
- Capacity planning and cost optimization across compute, storage, and data transfer
Compliance & Audit
- Supporting SOC 2 (and more such as ISO 27001, FedRAMP) compliance requirements, implementing and evidencing controls for security, availability, and confidentiality
- Maintaining CloudTrail organization trails, log retention policies, and audit trail integrity for compliance evidence
- Operating vulnerability management automation — scanning, ticket creation with severity-based remediation SLAs, and exception handling
- Supporting quarterly access reviews, documenting infrastructure changes, and maintaining runbooks for operational procedures
What We’re Looking For
Required
- 3+ years of experience operating production infrastructure on AWS in a security-conscious environment
- Strong Terraform skills — module design, state management, workspace/environment patterns, and CI-driven apply workflows. You should be writing Terraform daily, not occasionally.
- Deep familiarity with ECS Fargate (or equivalent container orchestration) — task definitions, service discovery, scaling, health checks, and deployment strategies
- Solid Docker experience — writing production Dockerfiles, optimizing builds, debugging container runtime issues, and managing image registries
- Strong understanding of AWS networking — VPC design, subnet architecture, security groups, NACLs, ALB/NLB configuration, WAF rules, and NAT gateways
- Proficiency in IAM — policies, roles, trust relationships, service-linked roles, and the principle of least privilege applied rigorously, not aspirationally
- Experience with RDS PostgreSQL — provisioning, parameter groups, backup/restore, read replicas, and performance monitoring
- Comfortable with Bash scripting and general-purpose automation (Python is a plus)
- Experience building and maintaining CI/CD pipelines with quality gates, artifact management, and environment promotion
- Understanding of deployment strategies (rolling, blue/green, canary) and when to use each
- Familiarity with monitoring and alerting — CloudWatch (or Datadog/Grafana), log aggregation, and on-call incident response workflows
- Experience with SOC 2 or equivalent compliance frameworks; you’ve implemented controls, gathered evidence, and worked with auditors, not just read about it
Preferred
- Experience with Ansible for EC2 configuration management, patching, and fleet operations
- Familiarity with Lambda and serverless patterns — event-driven automation, monitoring integrations, and cost-effective background processing
- Experience managing Amazon MQ, RabbitMQ, or similar message broker infrastructure in production
- Familiarity with KMS key management, Secrets Manager rotation, and encryption-at-rest strategies across AWS services
- Experience with AWS Inspector, Trivy, Checkov, or similar vulnerability and IaC scanning tools integrated into CI pipelines
- Familiarity with SSM Session Manager or similar just-in-time access patterns for production environments
- Experience with AWS Organizations, Control Tower, or multi-account strategies
- Exposure to FedRAMP, ISO 27001, NIST 800-53, or CMMC compliance frameworks
- Experience managing virtual desktop infrastructure (AWS WorkSpaces or similar) for contractor access in regulated environments
- Familiarity with cost optimization — reserved instances, savings plans, right-sizing, and tagging strategies for chargeback/allocation
- Experience with GitOps workflows and infrastructure drift detection
Tech Environment
Category: Technologies
Cloud: AWS (ECS Fargate, EC2, S3, RDS, Lambda, KMS, CloudWatch, WAF, ALB)
IaC: Terraform (Terraform Cloud + S3 state backends)
Containers: Docker, ECR, ECS Fargate
CI/CD: Bitbucket Pipelines (SAST, dependency scanning, IaC scanning)
Config Mgmt: Ansible (EC2 fleet), SSM Parameter Store
Networking: VPC, Security Groups, NAT, ALB, WAF, ACM (TLS)
Secrets: AWS Secrets Manager, SSM Parameter Store, KMS CMKs
Monitoring: CloudWatch, Kinesis Firehose, SNS, PagerDuty, GuardDuty
Security: IAM, AWS Inspector, Trivy, Checkov, SentinelOne
Database: RDS PostgreSQL (multi-AZ, 365-day backup retention)
Messaging: Amazon MQ (RabbitMQ)
Access: SSM Session Manager (JIT), Tailscale, AWS WorkSpaces
Compliance: SOC 2, Drata (evidence collection)
What You Won’t Find Here
- A ticket-taking role where someone else designs the architecture. You’ll own infrastructure end-to-end, from design through production operation.
- Kubernetes. We run on ECS Fargate deliberately, less operational overhead, right-sized for our scale.
- A pure ops role with no development. You’ll write Terraform modules, CI/CD pipelines, automation scripts, and Lambda functions. Infrastructure is code here, not clickops.
- A large platform team. We’re small. You’ll work directly with engineering, security, and leadership. Low bureaucracy, high ownership.
Pay: $145,000.00 - $170,000.00 per year
Benefits:
- 401(k)
- Dental insurance
- Health insurance
- Vision insurance
Work Location: Remote