Site Reliability Engineer (AI Forms Platform) ›
Filevine
Software Engineering, Data Science
United States
Posted on Jan 7, 2026
Responsibilities
- Infrastructure as Code: Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools to support the new Forms Platform.
- Availability & Uptime: Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime and "P1 incidents".
- Observability: Implement comprehensive monitoring, logging, and alerting (Datadog, New Relic, etc.) to provide deep visibility into AI model performance and system health.
- Security & Compliance: Design architecture that aligns with SOC standards and ensures proper handling of PII/PHI data and audit trails for model outputs.
- Release Engineering: Build and maintain efficient CI/CD pipelines to support the "tapering" of legacy systems and the rapid deployment of new features.
- Incident Response: Lead incident response efforts for the Forms Platform and conduct post-mortems to drive continuous improvement.
- Automation: Aggressively automate manual operations tasks using scripting (Python/Go) and AI tools to reduce toil.
Qualifications
- Bachelor’s degree in Computer Science, Computer Engineering, or related field.
- 3+ years of SRE or DevOps experience, specifically in high-availability production environments.
- Cloud Proficiency: Deep expertise in AWS or Azure ecosystem, including container orchestration (Kubernetes/Docker).
- Security Mindset: Experience implementing security best practices (SOC2, HIPAA) in a cloud environment.
- Scripting: Proficiency in Python, Go, or Bash for automation.
- Agile/Scrum: 1 to 3 years experience with scrum/agile development methodologies.
- AI Adaptability: Willingness and ability to use AI/LLMs to accelerate infrastructure development and debugging.
- Communication: Excellent verbal and written communication skills to document architecture and incident reports