Why Testing in Production is the Ultimate Quality Safeguard
1. Introduction: The Reality of Modern Software Delivery
In the high-velocity world of modern engineering, the "textbook" approach to software perfection is a myth. We are currently in the era of "vibe coding" - where builders move fast, rely on intuition, and ship to get immediate value into the hands of customers. While high-speed delivery is the goal, it creates a constant tension with technical debt. Project timelines are rarely dictated by engineering ideals; they are driven by management whims, shifting deadlines, and the urgent need to prove a concept. Because resources are finite, we often take shortcuts on unit tests or end-to-end automation to hit a date.
The result is technical debt: the "piling on" of minor improvements, cleaner code, and better accessibility that gets deferred for the sake of the launch. We must accept a blunt reality: the system is never 100% perfect at release. The definition of "done" is not a static point in time but an evolving state of quality. Since software is a living entity built under pressure, our testing strategy must extend beyond the safety of sandboxes and into the live environment.
2. Challenging the Taboo: The Philosophy of Production Testing
For years, the industry has clung to a dogmatic absolute: "Never test in production." Strictly adhering to this taboo ignores the strategic necessity of validating software where it actually matters. Testing in production is not just a backup plan; it is a valid and critical tool, period. If you deploy a site and immediately visit the URL to see if it’s "up and running," you are testing in production.
The difference between a QA environment and production isn't about whether we test, but how we test. We must move from rigid dogma to a nuanced, risk-based perspective.
QA Environment Goals
  • High Information Density: Designed to provide exhaustive feedback on every minor code change.
  • Experimental Freedom: A playground for high-load, destructive, or complex operations.
  • Isolated Variables: Focuses on "textbook" scenarios and sandboxed, often stale, data.
Production Environment Realities
  • Careful Execution: Focuses on high-value, low-risk checks to avoid disrupting real users.
  • Risk Association: Every action is weighed against the cost of a stakeholder-facing failure.
  • Live Context: Validates the system in its final form with real data and real infrastructure.
3. The Risk-Based Framework: What to Test (and What to Avoid)
Strategy is about risk mitigation. Risks left unmitigated result in adverse effects for your team, your organization, and your users. As a Senior/Lead, your job isn't to run every test; it's to understand the business mission and what your stakeholders care about most. If you aren't finding issues that matter, you are wasting time. You must be professional enough to distinguish between a safe validation and a catastrophic gamble.
Safe and Valuable Tests
  • Database Integrity: Verifying that a specific row or column exists in the production database after a migration. If it’s a simple lookup that doesn't put unmanageable load on the system, there is no reason not to know if that data is there.
  • Public API Validation: Calling a public-facing, unauthenticated API - for example, one that returns the current temperature - to ensure it's returning correct data in the live environment.
  • Infrastructure as Code (IaC) Verification: After an IaC push, manually opening the cloud portal to verify that the resource was actually created and that the configurations match your file exactly.
High-Risk and Prohibited Tests
  • Performance and Load Testing: Running high-load operations that could bring down the system or degrade the experience for real users.
  • Destructive Operations: Running complex operations that risk locking the database or corrupting live user records.
  • Rate-Limiting Hazards: Any test that risks hitting production rate limits or accidentally locking real users out of their accounts.
4. Operationalizing Production Testing: Real-World Scenarios
The strategic value of production testing is "finding it sooner rather than later." Identifying a failure before a user reports it allows for a rapid hotfix, preventing the support team from being slammed with a million tickets.
Scenario: Post-Deployment Verification Action: Manually navigate the live site or app immediately after release to ensure it is "up and running." Strategic Value: Confirms the deployment was successful and the environment is accessible before the first users encounter a 404.
Scenario: Unauthenticated User Flows Action: Open the mobile app and perform basic actions like searching for an item and clicking a result without logging in. Strategic Value: You are mirroring the exact flow of a real user. Catching a bug here prevents a broad user-base impact and allows for a hotfix before peak traffic hits.
Scenario: Cloud Resource Validation Action: After an automated infrastructure push, log into the AWS or Azure portal to check that the newly created resource exists and matches the spec. Strategic Value: It provides a manual safety check on automated infra, ensuring the foundation of the application is stable.
5. From Tech Debt to Full Regression: The Technical Context
Technical debt is not isolated; it pops up everywhere. Eventually, the "piling on" of maintenance issues reaches a tipping point where developers must speak up and say, "We can't move further until we clean this up." When shortcuts have been taken across the board, a targeted regression isn't enough. You need a Full Regression - a top-to-bottom evaluation of the entire system.
The methodology for a successful Full Regression follows this sequence:
1. Automated Fast Feedback: Run all existing automated tests to gauge the current state of the system and identify immediate failures.
2. Manual Verification: If automation fails, recreate the issue manually. This allows you to capture the "look and feel," console logs, and failure details that automation often omits.
3. Exploratory Testing: Focus on critical functional flows. Forget performance or accessibility for a moment; simply ask: "Does the system work? Can I perform the main functions?" Pay close attention to the UI, API, and Database.
4. UI/UX Across Platforms: Once functionality is confirmed, use tools like BrowserStack or real devices to ensure the system looks good and performs across browsers and responsive modes.
This process is a circular feedback loop. The logs, screenshots, and videos captured in one regression become the data input for the next, ensuring the team is never starting from zero.
1
0 comments
Joe DeFilippo
3
Why Testing in Production is the Ultimate Quality Safeguard
skool.com/testers
Some theory, but no bullshit. Test real products under the guidance of a 20-year industry veteran. Learn new skills and unlock your testing mindset.
Leaderboard (30-day)
Powered by