DevOps Engineer

American AI Logistics
Washington, DC

Quick Apply

Job Details

Full-time
$145,000 - $170,000 a year
10 hours ago

Benefits

Health insurance
Dental insurance
401(k)
Vision insurance

Qualifications

Network topology design
Containerization systems
AWS IAM
CI/CD
CMMC
Cloud identity and access management (IAM)
Cloud Logging
FedRAMP
Continuous Delivery (CD) implementation
Amazon EC2
Ansible
Tooling
Infrastructure as Code (IaC)
Database backup and recovery
SOC 2
Data recovery
Configuration management
Database security hardening
Server backup and recovery
Application deployment
Security compliance frameworks implementation
Database security management
Bash
Database server management
Production systems
AWS
WAF
IT monitoring tools
Docker
PostgreSQL

Full Job Description

DevOps Engineer - Cloud Infrastructure & Platform Reliability

Location: Remote

About AAIL

American AI Logistics (AAIL) is a defense technology company scaling rapidly. We are at the most consequential moment in our short history — growing from a startup into a full-scale defense technology platform. We move fast, win big, and are looking for driven individuals who want to be part of something that matters.

About the Role

We’re looking for a DevOps / Infrastructure Engineer to own the cloud platform that powers a B2B SaaS product serving U.S. government procurement. Our application runs on AWS and processes sensitive customer data under multiple compliance frameworks, and common criteria of reliability, security, and auditability. They aren’t aspirational goals, they’re contractual obligations.

You’ll be responsible for the full infrastructure lifecycle: provisioning and hardening cloud resources, building and maintaining CI/CD pipelines, managing container deployments, and keeping production systems available and secure. This is a small team, you won’t be filing tickets for someone else to execute. You’ll design it, build it, ship it, and carry the pager for it.

What You’ll Work On

Infrastructure as Code & Cloud Architecture

Designing, provisioning, and managing AWS infrastructure using Terraform, resources such VPCs, subnets, security groups, ALBs, ECS clusters, RDS instances, S3 buckets, KMS keys, and IAM policies
Managing multi-region AWS environments with environment promotion patterns (dev → stage → production) and strict resource isolation between environments
Maintaining Terraform state management, module organization, and CI-driven plan/apply workflows with proper review gates
Designing network architecture — public/private subnet segmentation, NAT gateways, security group rules following least-privilege principles, and WAF configurations

Containers & Compute

Building and optimizing Docker images for production services running on ECS Fargate, including multi-stage builds, image size reduction, and layer caching strategies
Managing ECR container registries with lifecycle policies, image scanning (Trivy, AWS Inspector), and automated vulnerability remediation workflows
Configuring ECS service definitions, task placement, auto-scaling policies, and multi-AZ deployment for fault tolerance
Managing EC2 instances where needed (e.g., virtual desktop infrastructure, bastion access, specialized compute) using Ansible for configuration management and patching

CI/CD & Deployment

Building and maintaining CI/CD pipelines (Bitbucket Pipelines or equivalent) with linting, SAST, dependency scanning, IaC scanning, and automated tests as CI gates
Implementing deployment strategies — rolling deployments, blue/green, canary releases — with automated rollback capabilities
Managing branch-promotion deployment models with required approvals, environment-specific configurations, and secrets injection at deploy time
Automating database migration execution as part of the deployment pipeline with safety checks and rollback procedures

Monitoring, Observability & Incident Response

Building and maintaining the monitoring stack — CloudWatch metrics, alarms, dashboards, and log aggregation pipelines (CloudWatch Logs → Kinesis Firehose → S3 for long-term retention)
Configuring alert routing by severity — PagerDuty for critical/major incidents, Slack for informational notifications — with on-call escalation policies
Triaging GuardDuty security findings and AWS Inspector vulnerability reports, converting them into actionable remediation tickets with severity-based SLA targets
Participating in on-call rotation, incident response, and post-incident reviews. Carrying the pager and owning the resolution.

Security, Secrets & Encryption

Managing IAM policies, resource based policies, service control policies, permissions boundaries, roles, and trust relationships following least-privilege principles; for human users, CI/CD pipelines, and service-to-service access
Administering secrets management using AWS Secrets Manager and SSM Parameter Store; rotation policies, access controls, and runtime injection patterns
Managing KMS customer-managed keys (CMKs) for encryption at rest across S3, RDS, and other services; key rotation, key policies, and documenting exceptions
Implementing and maintaining just-in-time privileged access mechanisms (e.g., SSM Session Manager) for production database and infrastructure access with full audit logging
Managing TLS certificates via ACM, reviewing supported cipher suites and protocols, and hardening edge configurations

Database & Messaging Infrastructure

Managing RDS PostgreSQL instances — provisioning, parameter tuning, backup configuration, point-in-time recovery, and semi-annual restore testing
Managing message broker infrastructure (Amazon MQ / RabbitMQ) — cluster configuration, queue topology, monitoring for consumer lag and dead-letter queues
Capacity planning and cost optimization across compute, storage, and data transfer

Compliance & Audit

Supporting SOC 2 (and more such as ISO 27001, FedRAMP) compliance requirements, implementing and evidencing controls for security, availability, and confidentiality
Maintaining CloudTrail organization trails, log retention policies, and audit trail integrity for compliance evidence
Operating vulnerability management automation — scanning, ticket creation with severity-based remediation SLAs, and exception handling
Supporting quarterly access reviews, documenting infrastructure changes, and maintaining runbooks for operational procedures

What We’re Looking For

Required

3+ years of experience operating production infrastructure on AWS in a security-conscious environment
Strong Terraform skills — module design, state management, workspace/environment patterns, and CI-driven apply workflows. You should be writing Terraform daily, not occasionally.
Deep familiarity with ECS Fargate (or equivalent container orchestration) — task definitions, service discovery, scaling, health checks, and deployment strategies
Solid Docker experience — writing production Dockerfiles, optimizing builds, debugging container runtime issues, and managing image registries
Strong understanding of AWS networking — VPC design, subnet architecture, security groups, NACLs, ALB/NLB configuration, WAF rules, and NAT gateways
Proficiency in IAM — policies, roles, trust relationships, service-linked roles, and the principle of least privilege applied rigorously, not aspirationally
Experience with RDS PostgreSQL — provisioning, parameter groups, backup/restore, read replicas, and performance monitoring
Comfortable with Bash scripting and general-purpose automation (Python is a plus)
Experience building and maintaining CI/CD pipelines with quality gates, artifact management, and environment promotion
Understanding of deployment strategies (rolling, blue/green, canary) and when to use each
Familiarity with monitoring and alerting — CloudWatch (or Datadog/Grafana), log aggregation, and on-call incident response workflows
Experience with SOC 2 or equivalent compliance frameworks; you’ve implemented controls, gathered evidence, and worked with auditors, not just read about it

Preferred

Experience with Ansible for EC2 configuration management, patching, and fleet operations
Familiarity with Lambda and serverless patterns — event-driven automation, monitoring integrations, and cost-effective background processing
Experience managing Amazon MQ, RabbitMQ, or similar message broker infrastructure in production
Familiarity with KMS key management, Secrets Manager rotation, and encryption-at-rest strategies across AWS services
Experience with AWS Inspector, Trivy, Checkov, or similar vulnerability and IaC scanning tools integrated into CI pipelines
Familiarity with SSM Session Manager or similar just-in-time access patterns for production environments
Experience with AWS Organizations, Control Tower, or multi-account strategies
Exposure to FedRAMP, ISO 27001, NIST 800-53, or CMMC compliance frameworks
Experience managing virtual desktop infrastructure (AWS WorkSpaces or similar) for contractor access in regulated environments
Familiarity with cost optimization — reserved instances, savings plans, right-sizing, and tagging strategies for chargeback/allocation
Experience with GitOps workflows and infrastructure drift detection

Tech Environment

Category: Technologies

Cloud: AWS (ECS Fargate, EC2, S3, RDS, Lambda, KMS, CloudWatch, WAF, ALB)

IaC: Terraform (Terraform Cloud + S3 state backends)

Containers: Docker, ECR, ECS Fargate

CI/CD: Bitbucket Pipelines (SAST, dependency scanning, IaC scanning)

Config Mgmt: Ansible (EC2 fleet), SSM Parameter Store

Networking: VPC, Security Groups, NAT, ALB, WAF, ACM (TLS)

Secrets: AWS Secrets Manager, SSM Parameter Store, KMS CMKs

Monitoring: CloudWatch, Kinesis Firehose, SNS, PagerDuty, GuardDuty

Security: IAM, AWS Inspector, Trivy, Checkov, SentinelOne

Database: RDS PostgreSQL (multi-AZ, 365-day backup retention)

Messaging: Amazon MQ (RabbitMQ)

Access: SSM Session Manager (JIT), Tailscale, AWS WorkSpaces

Compliance: SOC 2, Drata (evidence collection)

What You Won’t Find Here

A ticket-taking role where someone else designs the architecture. You’ll own infrastructure end-to-end, from design through production operation.
Kubernetes. We run on ECS Fargate deliberately, less operational overhead, right-sized for our scale.
A pure ops role with no development. You’ll write Terraform modules, CI/CD pipelines, automation scripts, and Lambda functions. Infrastructure is code here, not clickops.
A large platform team. We’re small. You’ll work directly with engineering, security, and leadership. Low bureaucracy, high ownership.

Pay: $145,000.00 - $170,000.00 per year

Benefits:

401(k)
Dental insurance
Health insurance
Vision insurance

Work Location: Remote

Quick Apply

Job Seeker Tools

Employer Tools

Browse

Stay Connected