Job Description
Position Overview
Monolith AI is seeking an experienced QA Engineer to lead load testing efforts for a critical system
release focused on improving concurrency and high request load handling. This fast-paced, short-
term engagement requires someone who can quickly understand complex distributed systems,
design comprehensive load tests, and work collaboratively with a rapidly growing engineering team
- to ensure our new environment meets performance requirements.Primary Responsibilities
- Design and Implement Automated Load Testing Framework
◦ Develop comprehensive load tests for FastAPI endpoints, Temporal workflows/
activities, and AWS service interactions
◦ Create realistic test scenarios simulating concurrent workflow execution patterns,
including graph-based workflow orchestration
- ◦ Build automated test suites that measure system behavior under varying concurrencylevels and request loads
- Performance Analysis and Bottleneck Identification
◦ Monitor and analyze system performance across the entire stack (API layer,
Temporal workers, AWS services)
◦ Identify concurrency limitations in Temporal workflow execution, AWS service
limits (Athena, ECS), and inter-component communication
- ◦ Document performance characteristics including response times, throughput limits,and failure modes under load
- Collaborate on Non-Functional Requirements (NFR) Definition
◦ Work with Customer Success and Product teams to understand business
requirements and translate them into measurable performance criteria
◦ Iterate on acceptable concurrency thresholds, latency targets, and throughput
- requirements◦ Validate that proposed NFRs are realistic and achievable given architecturalconstraints
- System Documentation and Knowledge Extraction
◦ Understanding of the existing system through code review, discussions with the
development team, and exploratory testing
- ◦ Create clear documentation of test methodologies, results, and recommendations forfuture testing
- Recommendation and Optimization Guidance
◦ Provide actionable recommendations for removing identified bottlenecks
◦ Suggest configuration optimizations for Temporal (worker pools, task queues) and
- AWS services (Athena concurrency, ECS capacity)
- Rapid Communication and Status Reporting
◦ Maintain daily/frequent communication with the Tech Lead regarding project
progress, blockers, and findings
◦ Quickly escalate issues that could impact the aggressive timeline
- ◦ Present findings and recommendations to technical and non-technical stakeholders
- Cross-Component Integration Testing
◦ Test complex scenarios involving graph execution triggering node workflows across
multiple system boundaries
◦ Validate S3 read/write operations under concurrent load
- ◦ Ensure inter-component communication (API → Temporal, Temporal Activity →API triggers) performs reliably at scaleKey Performance Indicators
- Test Coverage and Execution
◦ Complete automated load test suite covering all critical components within first 3
weeks
- ◦ Execute baseline and progressive load tests identifying maximum sustainableconcurrency levels
- Bottleneck Identification and Impact
◦ Identify and document top 5-7 performance bottlenecks with clear impact analysis
◦ Provide actionable remediation recommendations with estimated effort and impact
for each bottleneck
3. NFR Definition and Validation
◦ Collaborate with stakeholders to define measurable NFRs within first 2 weeks
- ◦ Validate system meets or document gaps against agreed NFR criteria by project end
- Documentation and Knowledge Transfer
◦ Deliver comprehensive test documentation, results analysis, and system performance
characteristics
- ◦ Conduct knowledge transfer sessions ensuring team can maintain and extend testingframework
- Project Velocity and Communication
◦ Meet weekly milestone targets in this fast-paced 2-month engagement
- ◦ Maintain proactive communication rhythm (daily standups, weekly detailed reportsto Tech Lead)Required QualificationsExperience:
- 4+ years of experience in QA/performance testing roles
- 2+ years of hands-on experience with load testing distributed systems and microservices
- architectures
- Proven experience with load testing tools (e.g., k6, JMeter, Locust, Gatling, Artillery)
- Experience testing workflow orchestration systems (Temporal, Airflow, Prefect, or similar)
- Demonstrated ability to test systems integrating with AWS services (particularly Athena,
- ECS, S3)Technical Skills:
- Strong proficiency in Python (required for test automation and working with FastAPI/
- Temporal)
- Experience with REST API testing and performance validation
- Understanding of distributed systems concepts: concurrency, queueing, backpressure, rate
- limiting
- Familiarity with AWS infrastructure and service limits• Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or
- similar)
- Proficiency with Git and CI/CD pipelines
- Ability to read and understand code in order to design effective tests
- Immediate Availability:
- Ability to start in early January 2025 and commit to focused 3-month engagement
- Availability for full-time contract work during project duration
- Preferred Qualifications
- Direct experience with http://Temporal.io (workflows, activities, workers)
- Experience with containerized workloads and Docker/ECS
- Prior work in fast-paced startup or scale-up environments
- Experience with infrastructure-as-code (Terraform, CloudFormation)
- Background in Site Reliability Engineering (SRE) or DevOps practices
- Familiarity with data processing pipelines and analytics systems
- Previous contract/consulting experience with rapid knowledge acquisition
- Experience with graph-based workflow systems or DAG execution engines
- Knowledge of AWS service limits and optimization strategies
- Essential Soft SkillsSelf-Direction and Initiative:
- Ability to operate independently in an ambiguous, fast-moving environment with minimal
- documentation
- Proactive problem-solving mindset; doesn't wait for perfect information before taking action
- Comfortable making pragmatic decisions quickly in a time-constrained project
- Communication and Collaboration:
- Exceptional communication skills for extracting knowledge through conversations with
- existing team members
- Ability to translate technical findings into clear, actionable recommendations for diverse
- audiences• Comfortable asking clarifying questions and challenging assumptions respectfully
- Strong written communication for documentation and status updates
- Adaptability and Learning Agility:
- Quick learner who can rapidly understand complex, poorly documented systems
- Flexible and comfortable with changing priorities in a 15-person team that's doubling in size
- Thrives in fast-paced environments with aggressive timelines
- Pragmatism and Results Orientation:
- Focused on delivering practical, actionable outcomes within tight timeframes
- Understands the balance between thoroughness and speed in a 2-month engagement
- Comfortable with "good enough" when perfect isn't achievable within constraints
- Stakeholder Management:
- Skilled at managing expectations with technical leadership about realistic timelines and
- trade-offs
- Diplomatic when delivering difficult news about performance limitations or bottlenecks
- Collaborative approach when working with CS and Product on NFR definition
- Key Challenges in This Role
- Rapid Knowledge Acquisition with Limited Documentation
◦ The existing system lacks comprehensive documentation, requiring you to quickly
build understanding through code review, system exploration, and frequent
discussions with the development team
- ◦ Success requires comfort with ambiguity and strong investigative skills
- Aggressive Timeline with High Impact
◦ A 3-month timeline to design tests, execute comprehensive load testing, identify
bottlenecks, and deliver actionable recommendations is extremely tight
- ◦ Must balance thoroughness with pragmatism; prioritize ruthlessly to ensure criticalareas are covered
- Complex Distributed System with Multiple Integration Points
◦ The system involves multiple layers (FastAPI, Temporal, AWS services) with
complex inter-component communication patterns (graph → node workflows)◦ Must understand the entire stack sufficiently to design realistic, comprehensive load
tests that expose real-world bottlenecks
Apply tot his job
Apply To this Job