Staging env overhaul

Case study: Checkr

About

Checkr is a leading background check platform that leverages advanced technology and AI to help companies make faster, fairer, and more efficient hiring decisions. It streamlines the screening process with solutions tailored for modern workforces, including gig, staffing, and enterprise businesses.

Problem

Over time, the internal staging environment because used for external customer testing, creating frequent conflicts and reliability issues.
Overlapping developer activity caused a 20% increase in customer-facing incidents, disrupting demos, user acceptance testing (UAT), and go-live preparations.
Deployment delays and unreliable environments eroded developer efficiency and customer confidence.
Resolving this issue was essential to improve operational scalability, support 100+ microservices across 30+ teams, and maintain trust with customers.

Action

Discovery

To address the issue, we explored different solutions, starting with a pilot of ephemeral developer environments using Okteto. This approach tested on-demand, isolated environments with six teams, aiming to simplify setup and enhance collaboration. However, the trials revealed significant challenges:

Even minimal configurations for barebones environments proved cumbersome to maintain.
Developers struggled to deploy services outside their teams, particularly when seed data or specific configurations were needed.
Managing 100+ services for ephemeral environments was inefficient and unsustainable.

These insights informed our pivot to a persistent developer environment with mock data, which, while requiring more initial effort, offered a scalable and maintainable solution that aligned with team workflows and customer needs.

Solution

We initially considered creating a new customer environment with fresh data but pivoted after discussions with Customer Success to dedicate the new environment to developers. While this decision minimized customer disruption, it required careful development of trust and alignment across engineering, Customer Success, and product leadership.

During the planning and implementation, I worked with key teams to coordinate the migration across 30+ feature teams:

UI Platform Team: Assisted in the deployment of front-end applications and proper UI routing in the new environment.
DBRE Team: Managed database dump process and worked with feature teams to conduct switchovers to ensure data integrity.
Core Services and API Platform Teams: Developed and assist in migrating Kafka and Kong configurations to align with the new environment.
Quality Productivity Team: Updated end-to-end pipelines and collaborated with feature teams to make the new environment the primary CI/CD target.
SRE Team: Rolled out a new Datadog environment and developed a methodology to ensure proper monitors and alerts were configured across all services.
Infrastructure Team: Stood up the new EKS cluster for the environment, worked with teams to deploy services and mitigate any issues in real-time

We assembled champions from each feature team to centralize communication and coordinate efforts. These champions acted as points of contact for blockers and escalations, and facilitated prioritization and ensured alignment across teams.

In addition to the above, my role included:

Defining the new environment scope and feasible best practices to implement as a part of the migration.
Managing updates across hundreds of repositories, including Helm charts, CI/CD pipelines, and deployment configurations.
Overseeing infrastructure setup, including Kafka, Redis, Vault, Kong, and the EKS cluster, to ensure seamless operation.
Partnering with Customer Success to track progress and ensure cross-functional goals were met.
Hosting workshops, creating documentation, and establishing engineering practices for adoption and long-term consistency.

Result

Over a 12 month period, we were able to successfully standup a new environment and migrate all 100+ services over. The implementation delivered significant improvements in key metrics following the DORA standard:

Deployment Frequency: Increased by 35%, as teams gained a stable and isolated testing environment, enabling more frequent releases.
Lead Time for Changes: Reduced by 25%, as developers experienced faster feedback loops and fewer blockers in the deployment process.

These improvements not only enhanced developer efficiency but also reduced customer-facing incidents by 40%, improving reliability for demos, UAT, and go-lives. The dedicated environment established a scalable platform for future growth, supporting developer psychological safety during deployment and thus reducing time-to-market for new features.

Key Challenges and Resolutions

Resistance to Change
Some teams were initially hesitant about the migration due to concerns over its impact on their timelines.

Resolution: Phased rollouts and pilot migrations with smaller teams built confidence and generated early wins to gain broader buy-in.

Conflicting Priorities
Balancing the migration with ongoing feature development created scheduling conflicts.

Resolution: Leveraged champions and centralized communication to align priorities and reduce duplicated efforts between the old and new environments.

Lessons Learned

Flexibility Drives Success: Pivoting to move developers instead of customers highlighted the importance of remaining flexible and advocating for customer-first solutions. This approach minimized disruption and aligned with business goals.
Stakeholder Alignment is Critical: Building early consensus among engineering, Customer Success, and product leadership was key to ensuring a unified vision and smooth execution.
Discovery Shapes Better Solutions: Exploring ephemeral environments revealed critical insights that informed the ultimate decision to build a dedicated environment, demonstrating the value of iterative discovery.
Champions Accelerate Change: Empowering champions across teams ensured clear communication, faster resolution of blockers, and seamless adoption of the new environment.

Staging env overhaul

About

Problem

Action

Result

Key Challenges and Resolutions

Lessons Learned

Steven Yuan

Location

Staging env overhaul

About

Problem

Action

Result

Key Challenges and Resolutions

Lessons Learned

Design system migration

Ephemeral dev environments

Steven Yuan

Location