How does GauntletCI compare to SonarQube?

SonarQube analyzes the entire codebase on a schedule and requires a server, account, and network access. GauntletCI analyzes only the lines that changed in the current diff, runs entirely on the developer's machine in under one second, requires no account or cloud connection, and installs as a pre-commit hook. SonarQube is a full codebase quality platform; GauntletCI is a focused change-risk detector that runs before a commit is ever created.

How does GauntletCI compare to Semgrep?

Semgrep is a pattern-matching engine that scans files or the full codebase. GauntletCI scopes every rule to the changed diff, meaning it only flags risks introduced by the current change, not pre-existing issues in unchanged files. GauntletCI also includes incident correlation, local LLM enrichment, and baseline delta mode, which Semgrep does not offer. GauntletCI's free tier includes all rules with no account required.

How does GauntletCI compare to Snyk?

Snyk is primarily a dependency and container vulnerability scanner that requires cloud connectivity and a Snyk account. GauntletCI detects behavioral, structural, and security risks in first-party code changes, not dependency vulnerabilities, and runs 100% locally with no data transmitted. GauntletCI is suitable for air-gapped environments and organizations with strict data residency requirements; Snyk is not.

How does GauntletCI compare to CodeQL / GitHub Advanced Security?

CodeQL performs deep semantic analysis of the full codebase and requires compilation and significant compute time, typically minutes per run. GauntletCI analyzes only the changed diff in under one second with no compilation step, making it suitable as a pre-commit hook in the developer's local workflow. CodeQL is better for periodic deep security audits; GauntletCI is better for fast, continuous feedback on every commit.

How does GauntletCI compare to Code Climate?

Code Climate is a SaaS platform that analyzes full repositories for maintainability and test coverage trends over time. It requires uploading code to a cloud service. GauntletCI is fully local, diff-scoped, and focused on change risk rather than codebase health metrics. GauntletCI does not require a cloud account, never transmits code, and produces results in under one second, making it a complement to, not a replacement for, Code Climate's longitudinal reporting.

Does GauntletCI work in air-gapped or offline environments?

Yes. GauntletCI runs entirely on the local machine. No diff, finding, file path, or telemetry is transmitted. The optional local LLM enrichment feature (--with-llm) uses a locally hosted Ollama model, with no network call. This makes GauntletCI suitable for classified, regulated, or air-gapped environments where cloud-based analysis tools are prohibited.

What is baseline delta mode and do other tools offer it?

Baseline delta mode (gauntletci baseline capture) snapshots all current findings and suppresses them from future runs. Only net-new findings (risks introduced after the baseline) are reported. This eliminates alert fatigue from pre-existing issues in legacy codebases. SonarQube offers a similar 'new code' concept but requires server setup. Semgrep, Snyk, CodeQL, and Code Climate do not offer equivalent pre-commit baseline suppression.

Technical Report

← All articles

Behavioral Change Risk: A Formal Framework for Validation Gaps in Evolving Software

What tests miss, why green builds lie, and how to audit change.

Eric Cogen·Founder, GauntletCI·April 21, 2026

Abstract

Modern continuous integration (CI) pipelines rely heavily on automated test suites to validate software changes. A passing test suite is widely interpreted as a signal of correctness. However, a growing body of empirical research demonstrates that test suites are structurally incapable of detecting specific classes of behavioral modification. This gap, wherein a code change alters runtime behavior without triggering a test failure, represents a distinct and under-researched category of software risk.

This article defines Behavioral Change Risk (BCR) and proposes Behavioral Change Risk Validation (BCRV), a complementary methodology focused on the systematic analysis of code change semantics rather than the verification of existing assertions. The article synthesizes recent findings from the software engineering literature, including the diagnostic value of flaky test failures and the limitations of code coverage metrics, to establish the intellectual foundation for BCRV as a necessary practice in the maintenance of evolving software systems.

The primary contributions are: (1) the formalization of Behavioral Change Risk (BCR) as a distinct software risk category, and (2) the introduction of Behavioral Change Risk Validation (BCRV) as a structured, diff-centric methodology for detecting and mitigating BCR before it reaches production.

1. Introduction

The software industry has invested decades in refining the practice of automated testing. Unit tests, integration tests, and end-to-end tests form the backbone of modern CI/CD pipelines. The logic is intuitive: if a code change does not break any existing tests, the change is presumed safe. This binary signal, green build or red build, governs the decision to merge, deploy, and release software to production. This paper refers to the implicit assumption that a passing build implies behavioral correctness as the Green Build Validity Assumption: a heuristic that is operationally useful but theoretically unsound.

Yet production incidents occur. Bugs ship. And often, the post-mortem reveals a disquieting fact: the test suite was green.

This phenomenon is not merely a failure of test coverage. It is a structural limitation in how test suites validate software behavior. Tests are oracles for expected behavior. They assert what the software must do. They are silent on behaviors that were never explicitly specified as assertions.

Recent academic research has begun to quantify this blindness. A 2023 study of the Chromium continuous integration system found that flaky tests, traditionally dismissed as noise, were responsible for detecting over one-third of all regression faults. When these flaky failures were filtered out by automated tooling, 76.2% of real faults were missed.^[1] Separately, an empirical study on automated program repair found that patches passing all available tests were frequently semantically incorrect, because the test suite under-specified the correct behavior.^[2]

These findings point to a gap in the software validation landscape. This article names that gap, formalizes its definition, and proposes a methodology to address it.

2. The Structural Blindness of Test Suites

To understand why tests miss certain bugs, one must first understand what a test can and cannot verify. A test case consists of three components: an input, an execution, and an oracle: an assertion that evaluates the output. The oracle problem^[5] states that a test can only detect a fault if that fault produces an observable output that violates a specific, pre-written assertion.

The oracle problem, in brief

A test suite cannot detect what it was never written to expect. Correctness is bounded by the completeness of prior specification.

Consider a developer who removes a null-check guard clause; the code changes from if (user == null) return; to a state where the guard is simply absent. If the test suite never exercises the code path with a null user, all tests will continue to pass. The behavior of the system has changed: it will now throw a NullReferenceException where it previously handled the condition gracefully, but no test will fail. The change is invisible to the validation mechanism.

This is not a coverage problem. The line of code may have been executed by other tests. It is an oracle problem. The test suite never asserted that the system should handle null input safely; it merely assumed the system would not crash under the tested inputs. This is the simplest possible instance of Behavioral Change Risk.

2.1 The Limits of Code Coverage

Code coverage is frequently used as a proxy for test suite quality. Coverage is a necessary baseline: code that is never executed cannot be tested at all, and low coverage is a reliable signal of undertested paths. Yet a 2018 study investigating faults missed by high-coverage test suites found that coverage metrics do not correlate with fault detection for several important bug classes.^[3] Specifically, missing guard clauses, logic inversions, and missing assignments were systematically missed even when line and branch coverage exceeded 90%.

Coverage measures execution. It does not measure behavioral specification.

2.2 The Limits of Mutation Testing

Mutation testing, which introduces artificial faults to evaluate test suite sensitivity, is the most rigorous proxy for test suite effectiveness currently available. However, recent work has shown that traditional mutation operators do not adequately model real-world faults.^[4] Many real faults involve the removal of behavior, a change that is difficult to simulate with standard syntactic mutants.

3. Defining Behavioral Change Risk (BCR)

The gap described above can be formalized. Let a codebase be denoted as C. A change set ΔC represents a modification to that codebase. The observable behavior space of the codebase is B(C). A test suite T validates a subset of that behavior space, denoted V(T, C).

Formal definition: Behavioral Change Risk (BCR)

BCR is defined as a condition where both of the following hold:

1.B(C + ΔC) ≠ B(C): the modification alters observable behavior; and
2.ΔB ∉ V(T, C + ΔC): the altered behavior is not covered by the test suite.

BCR arises when the system's behavior space extends beyond what is validated by tests.

More formally:
For a given system, behavior B is a function of code C. Any non-zero change ΔC introduces behavioral divergence potential such that B(C + ΔC) ≠ B(C).

In plain terms: BCR is the divergence between what the system does and what the tests can see. It exists whenever B(C + ΔC) expands beyond V(T, C + ΔC), the actual behavior space of the modified system outgrowing the validated behavior space of its test suite.

This definition distinguishes BCR from a traditional software defect. A bug is code that violates a stated requirement; it is detectable because a test or specification can be written to catch it. BCR is categorically different: the code may behave exactly as the developer intended, and yet introduce behavioral change that no existing test is positioned to observe. It is not a failure of implementation. It is a validation gap.

Scope boundary

BCR as defined here is bounded to changes in functional, observable behavior detectable in principle by a correctly written test oracle. It explicitly excludes: performance regressions (changes in execution speed, memory, or throughput that do not alter observable outputs); security vulnerabilities (weaknesses requiring threat-model analysis beyond behavioral assertion); and concurrency hazards (race conditions, deadlocks, and non-deterministic interleavings outside the scope of a functional test oracle). BCR addresses the specific gap between what a change does and what the test suite is positioned to see.

4. A Taxonomy of Behavioral Change

Behavioral changes that escape test detection can be categorized. The following taxonomy, derived from recurring patterns documented in production incident analyses and the empirical studies cited in §2, provides a structured lens for analysis.

Category	Description	Example
Removed Guard Clause	A defensive condition is deleted, exposing the system to previously handled edge cases.	`Deletion of if (input == null) return;`
Stricter Condition	A logical operator is tightened, excluding previously valid inputs.	`Changing age > 18 to age >= 21`
Implicit Contract Change	The order of side effects or the timing of state mutations is altered without changing return values.	`Reordering cache invalidation and database write`
Error Handling Alteration	The system's response to exceptional conditions is modified, but the exception path is untested.	`Changing catch (Exception) to catch (SpecificException)`
State Transition Modification	The rules governing state machine transitions are updated, but only the happy path is tested.	`Removing a validation check before state advancement`
Configuration Drift	A change relies on an environment variable or setting that is absent in production.	`Adding a feature flag that defaults to false in tests but true in staging`

Each of these changes can pass a thorough test suite while introducing meaningful behavioral risk.

5. The Diagnostic Value of Flaky Tests

One of the most counter-intuitive findings in recent software engineering research concerns flaky tests, which exhibit non-deterministic behavior, passing and failing without apparent code changes. The conventional engineering response is to quarantine, disable, or automatically retry flaky tests to reduce CI noise.

The Chromium study challenges this practice.^[1] The researchers analyzed over 1.5 million test executions across 14,000 commits. They found:

Flaky tests exposed more than one-third of all regression faults in the Chromium system.

State-of-the-art flakiness detection tools, while achieving 99.2% precision, caused 76.2% of real regression faults to be missed.

The Chromium CI system is substantially larger than most industrial codebases; the precise fault suppression rate will vary by system. The directional finding, that automated flakiness filtering discards genuine fault signal, is corroborated by independent work on non-deterministic test behavior.^[8]

A flaky test is often a test that is sensitive to a behavioral change that deterministic tests ignore. It may fail due to a timing shift, a resource contention issue, or an altered execution order, all of which are genuine behavioral changes. By silencing the flaky test, the CI system silences the signal.

This finding has direct implications for the BCR framework. A flaky test failure is not noise to be suppressed; it is an early indicator of unvalidated behavioral change. Behavioral Change Risk Validation incorporates this insight by treating flaky failures as diagnostic artifacts rather than engineering nuisances.

6. Preliminary Corpus Analysis

The theoretical case for BCR rests on structural arguments about what tests can and cannot detect. A preliminary empirical signal supports the framework's practical relevance.

Corpus construction

A corpus of 598 pull requests from 57 open-source .NET repositorieswas assembled using GauntletCI's corpus pipeline. Repositories were identified via GitHub code search across the .NET ecosystem; the full set includes Polly, Dapper, Newtonsoft.Json, Avalonia, PowerShell, dotnet/aspnetcore, dotnet/efcore, dotnet/maui, dotnet/roslyn, and dotnet/runtime, among others. Pull requests were selected with a bias toward changes involving substantive review activity, which introduces a selection effect favoring higher-complexity changes over routine maintenance commits. Each pull request was evaluated against the behavioral change taxonomy described in §4 using the automated rule engine. The corpus metadata (repository names, pull request numbers, size classification, test-change presence, and per-finding counts) is published at github.com/EricCogen/GauntletCI/blob/main/data/corpus-fixtures.csv for independent review and replication.

Test file classification

The field has_tests_changed was determined by automated path-pattern classification: a pull request was marked as including test changes if any modified file matched test naming conventions, including files with Tests.cs, .test.cs, or .tests.cs suffixes, or files residing in /test/ or /tests/ path segments. This is a structural proxy, not a measure of whether behavioral assertions were added or updated.

Confidence scoring

GauntletCI assigns each finding one of three internal confidence tiers: 0.25 (low), 0.5 (medium), or 1.0 (high). The high-confidence tier (1.0) reflects the strongest structural pattern match; it does not represent external validation or human review. Lower tiers were excluded from the primary counts below to reduce noise from ambiguous matches.

Two findings emerge from this analysis:

34.6%

of pull requests (207 of 598) contained at least one high-confidence behavioral risk indicator, spanning 11 distinct rule categories.

71%

of pull requests submitted without test file modifications (118 of 166) contained at least one behavioral risk indicator. When test authorship effort is absent, risk patterns are not merely possible; they are prevalent.

Methodological limitations

The current corpus carries no human-labeled ground truth. Precision and recall of these rates are unknown: findings represent automated pattern matches, and false positives are expected. The corpus was not a random sample of production software; it was drawn from well-maintained open-source projects with active code review histories, which may exhibit different behavioral change patterns than closed-source enterprise or legacy codebases. Formal empirical validation (including human labeling of findings, precision and recall measurement, and controlled studies across broader repository populations) is identified as future work.

The preliminary signal is nonetheless consistent with the BCR framework's central prediction: behavioral risk patterns occur in a substantial fraction of real-world pull requests, and their incidence is elevated precisely in the pull requests that arrive without test coverage updates: the gap that BCRV is designed to address.

7. Introducing Behavioral Change Risk Validation (BCRV)

The Chromium data makes the case directly: a CI pipeline achieving 99.2% precision in flakiness detection still caused 76.2% of real regression faults to be missed.^[1]A green build, in that system, was an unreliable signal. If state-of-the-art tooling on one of the world's largest CI systems cannot be trusted to surface behavioral regression, the implication is clear: the diff must be audited independently of what the test suite reports.

7.1 Requirements for Addressing BCR

Before introducing the methodology, the requirements it must satisfy are worth stating explicitly, since they emerge from the problem, not from the solution.

Diff-scoped

Analysis must be anchored to the change, not the full codebase. BCR is introduced by a specific modification; the audit must match that scope to remain tractable.

Semantics-aware

The methodology must reason about behavioral meaning (what the code does and what it no longer does) rather than structural properties such as line count or test coverage percentage alone.

Validation-aware

Findings must be interpreted in relation to the existing test suite. A behavioral change that is fully covered by updated assertions carries low risk; a behavioral change that is unobserved by any assertion represents an unresolved gap.

Low integration cost

Pre-merge validation that imposes significant workflow friction will be bypassed. An effective BCR methodology must integrate into existing review and CI practices without requiring new infrastructure or cultural upheaval.

These requirements do not emerge from any particular tool or workflow preference. They emerge from the structure of the problem itself. A methodology that satisfies all four addresses BCR at its root.

Behavioral Change Risk Validation (BCRV) is a methodology for systematically evaluating the behavioral implications of a code change before or during the review process. It shifts the unit of analysis from test results to code semantics.

The core principle of BCRV

A code change must be audited for behavioral risk independently of test suite output.

This is not a replacement for testing. It is an augmentation. BCRV acknowledges that tests are a partial specification and that the diff contains information about behavioral intent that tests cannot fully capture.

BCRV is also the economically rational choice. Auditing a diff at the moment of authorship requires evaluating tens or hundreds of changed lines in context. The alternative, discovering the behavioral gap in production, requires reproducing the fault, tracing it back through deployment history, and remediating under pressure. The engineering tax of a pre-commit audit is a fraction of the cost of a post-incident post-mortem. Shift-left is not a slogan; it is arithmetic.

It is worth stating explicitly: BCRV is complementary to Test-Driven Development, not a replacement for it. TDD builds the behavioral specification: it encodes what the system must do before the code is written. BCRV audits the evolution of that specification: it asks whether a subsequent change has altered behavior in ways the original specification no longer covers. The two practices address different moments in the software lifecycle: TDD governs creation; BCRV governs change.

The term "audit-driven" has appeared informally in prior software engineering discourse (e.g., "Audit Driven Design" in 2007, "Audit-Driven SRE" in 2026), and it is worth distinguishing those uses from the methodology proposed here. While BCRV does not carry the word "audit" in its name, it shares a commitment to structured examination. The prior uses are retrospective and organizationally oriented. BCRV is a pre-merge validation discipline applied to the code diff, concerned with behavioral risk to a running system rather than with organizational visibility or post-incident remediation.

7.2 The BCRV Workflow

BCRV can be integrated into existing development practices with minimal disruption. The workflow consists of three stages:

BCRV Three-Stage Workflow: Diff Analysis feeds into Impact Assessment, which feeds into Risk Mitigation. Risk Mitigation branches into three outcomes: Add an Assertion (preferred), Document Accepted Risk (intentional tradeoff), or Revert or Redesign (unintended consequence). — **Figure 1.** The BCRV three-stage workflow. Every flagged diff change resolves to exactly one of three outcomes.

Stage 1: Diff Analysis

The developer or reviewer examines the change set with a specific focus on removed or altered logic, not just added code. Deletions of conditional branches, changes to loop boundaries, and modifications to error handling are flagged for deeper scrutiny.

Stage 2: Behavioral Impact Assessment

For each flagged change, the reviewer asks: Does this change alter the system's response to a specific input or state? Is that input or state represented in the existing test suite? If not, is the new behavior intentional and documented?

Stage 3: Risk Mitigation

If a behavioral change is identified as unvalidated, one of three actions is taken: (1) Add an assertion to capture the new behavior. (2) Document the accepted risk: record the change as intentional with explicit justification. (3) Revert or redesign the change to eliminate the unvalidated behavioral shift.

7.3 Tooling Considerations

While BCRV is a human-centric methodology, tooling can assist in flagging high-risk change patterns. A prototype implementation, GauntletCI, was developed to explore the feasibility of automated BCR detection. The prototype analyzes code diffs to identify structural patterns associated with BCR categories.

Existing static analysis tools such as Semgrep, CodeQL, and SonarQube perform pattern-based analysis across the full codebase and offer partial overlap with automated BCR detection. The distinguishing characteristic of a BCR-oriented tool is that analysis is scoped to the diff rather than the full repository, and the integration point is pre-commit rather than post-merge, ensuring findings are surfaced at the moment of lowest remediation cost.

This reflects a principled division of labor. Automated tooling excels at pattern recognition: it can reliably flag that a guard clause was removed or that an exception handler was narrowed. What it cannot determine is whether that removal was intentional, whether the edge case is reachable in production, or whether the behavioral shift is acceptable given the system's current requirements. That semantic judgment belongs to the human auditor. GauntletCI surfaces the what. The developer is responsible for the why.

8. Related Work

The limitations of test suites have been documented extensively. Inozemtseva and Holmes^[6] demonstrated that coverage metrics are a poor predictor of test suite effectiveness. Just et al.^[7] showed that mutation testing, while valuable, does not fully capture real-world fault characteristics. The oracle problem was formally surveyed by Barr et al.,^[5] establishing the theoretical bound on test-based validation. The core bound, that a test can only detect faults observable through pre-written assertions, has not been substantially revised; subsequent work has focused on automated oracle generation as a mitigation rather than a challenge to the bound itself.^[9]

8.1 Change-Aware Testing and Test Gap Analysis

The relationship between code change and test coverage has been studied as a distinct problem from global coverage metrics. Test Gap Analysis (TGA) examines the alignment between code modifications and the tests that exercise those modifications, independently of aggregate coverage measurements. An industrial study by Eder et al. found that a substantial proportion of modified code paths ship without corresponding test updates, and that error probability in untested changed code is significantly higher than in changed code accompanied by test modifications.^[10] Contemporary work has extended TGA to risk-based prioritization, enabling teams to triage uncovered changes by defect likelihood rather than treating all test gaps equally.^[11]

Change-aware testing, which restricts testing effort to the scope of a specific change rather than the full system, has a parallel history in unit testing research. Wloka et al. introduced JUnitMX, a change-aware unit testing tool that uses a change model to guide the authoring of new tests in direct response to specific code modifications.^[12] The tool operationalizes the principle that test authoring effort should be directed by what changed, not by what exists, an orientation that anticipates the diff-centric analysis proposed by BCRV.

Behavioral Regression Testing (BERT), introduced by Orso, Xie, and Jin, takes a dynamic approach: it executes an existing test suite against both the pre-change and post-change versions of a system and flags behavioral divergences.^[13] BERT detects differences observable through executed assertions, but is structurally bounded by the oracle problem^[5]: if no test exercises a changed code path, the divergence remains invisible. This limitation motivates the pre-merge, diff-side analysis that BCRV proposes.

More recently, LLM-based approaches have extended change-aware reasoning to GUI testing. RippleGUItester applies change-impact analysis to direct LLM-driven GUI exploration toward regions of an application affected by a specific commit.^[14] An evaluation across four open-source applications identified 26 previously unknown defects, demonstrating that change-scoped exploration substantially outperforms undirected test generation in surface area relevance.

This article contributes to that literature by defining BCR as a distinct risk category and proposing BCRV as a structured response. Unlike prior work focused on test generation or oracle improvement, BCRV addresses the diff-side of the validation equation: the change itself.

9. Limitations and Future Work

Behavioral Change Risk Validation is a methodology, not a formal verification technique. It relies on human judgment and does not guarantee the absence of behavioral risk. The taxonomy presented is descriptive, not exhaustive. Additional categories of behavioral change may emerge as the practice is applied across diverse codebases.

Two additional limitations warrant explicit acknowledgment. First, BCRV depends on the quality of the human auditor. A reviewer who lacks domain knowledge of the changed system may fail to recognize the behavioral significance of a flagged pattern, a risk that increases as codebases grow and team ownership becomes diffuse. Second, audit fatigue is a real operational concern. If every diff surfaces a large number of flags, reviewers will begin to dismiss findings as noise, recreating the same suppression problem that motivates the methodology. Effective BCRV practice requires tuning signal quality: surfacing fewer, higher-confidence flags rather than exhaustive pattern lists.

Future work

: Empirical measurement of BCR prevalence in industrial codebases.
: Development of lightweight static analysis rules to flag high-BCR change patterns.
: Integration of flaky test signal analysis into BCRV workflows.

10. Threats to Validity

Internal Validity

The primary threat to internal validity is interpretation bias in the BCR taxonomy. The six change categories in §4 were derived from pattern analysis and practitioner judgment rather than a systematic fault taxonomy study. Categories may overlap, under-specify, or conflate distinct phenomena. Additionally, the corpus findings reported in §6 are produced by an automated rule engine with no human-labeled ground truth: the correlation between GauntletCI's confidence scores and actual behavioral divergence has not been empirically established. The selection bias in the corpus, favoring pull requests with substantive review activity, may inflate the observed BCR rate relative to a random sample of commits.

External Validity

The corpus analysis is restricted to open-source C# repositories. BCR patterns may manifest differently in dynamically typed languages, functional codebases, or systems with non-standard control flow idioms. The findings may not generalize to closed-source enterprise software, where codebase age, ownership diffusion, and testing culture differ substantially from well-maintained open-source projects. The BCRV workflow itself has not been evaluated in a controlled industrial study; its effectiveness under real team conditions, varying auditor expertise, and large-scale diff volumes remains to be demonstrated empirically.

Construct Validity

Two construct validity threats warrant acknowledgment. First, behavioral change is operationalized as pattern matches against known BCR indicators, a structural proxy for the formal definition in §3. A pattern match is not a proof that ΔB ∉ V(T, C + ΔC); it is a heuristic signal that the diff contains a change class historically associated with validation gaps. Second, test coverage of behavioral changes is approximated by the presence of modified test files (has_tests_changed), not by assertion-level analysis of whether the specific behavioral delta is newly covered. A pull request may include test file changes entirely unrelated to the flagged behavioral pattern, overstating coverage.

11. Conclusion

The green checkmark of a passing CI build has become the primary expression of the Green Build Validity Assumption in practice, a symbol of software quality that obscures a structural blind spot. Tests validate what was written; they cannot validate what was removed. They assert expected outcomes; they are silent on the consequences of altered behavior.

Behavioral Change Risk (BCR) is the formal name for this gap. It is the risk that a code change introduces new behavior that no test is positioned to detect. Empirical evidence from large-scale CI systems and automated program repair research confirms that this risk is both real and underappreciated.

Behavioral Change Risk Validation (BCRV) offers a methodology for addressing BCR. By shifting attention from test results to change semantics, BCRV provides a framework for identifying and mitigating the behavioral risks that slip through conventional validation pipelines.

The software industry has spent decades learning to test what code does. It is time to develop the discipline to audit what code changes.

References

[1]Haben, G., Habchi, S., Papadakis, M., Cordy, M., & Le Traon, Y. (2023). The Importance of Discerning Flaky from Fault-triggering Test Failures: A Case Study on the Chromium CI. arXiv:2302.10594. https://arxiv.org/abs/2302.10594
[2]Zemín, L., Godio, A., Cornejo, C., Degiovanni, R., Gutiérrez Brida, S., Regis, G., Aguirre, N., & Frias, M.F. (2025). An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria. ACM Transactions on Software Engineering and Methodology, 34(3), 57:1-57:20. DOI: 10.1145/3702971.
[3]Schwartz, A., Puckett, D., Meng, Y., & Gay, G. (2018). Investigating Faults Missed by Test Suites Achieving High Code Coverage. Journal of Systems and Software, 144, 106-120. DOI: 10.1016/j.jss.2018.06.024.
[4]Gay, G. & Salahirad, A. (2023). How Closely are Common Mutation Operators Coupled to Real Faults? IEEE ICST, pp. 129-140. DOI: 10.1109/ICST57152.2023.00021.
[5]Barr, E. T., et al. (2015). The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering.
[6]Inozemtseva, L., & Holmes, R. (2014). Coverage is Not Strongly Correlated with Test Suite Effectiveness. ICSE. https://dl.acm.org/doi/10.1145/2568225.2568271
[7]Just, R., et al. (2014). Are Mutants a Valid Substitute for Real Faults in Software Testing? FSE.
[8]Luo, Q., et al. (2014). An Empirical Analysis of Flaky Tests. FSE.
[9]Fraser, G. & Arcuri, A. (2013). Whole Test Suite Generation. IEEE Transactions on Software Engineering, 39(2), 276-291. DOI: 10.1109/TSE.2012.14.
[10]Eder, S., Hauptmann, B., Junker, M., Jürgens, E., Vaas, R., & Prommer, J. (2013). Did we test our changes? Assessing alignment between tests and development in practice. AST@ICSE 2013, pp. 107-110. DOI: 10.1109/IWAST.2013.6595800.
[11]Haas, D., Sailer, L., Joblin, M., Juergens, E., & Apel, S. (2025). Prioritizing Test Gaps by Risk in Industrial Practice. IEEE Transactions on Software Engineering. DOI: 10.1109/TSE.2025.3556248.
[12]Wloka, J., Ryder, B. G., & Tip, F. (2009). JUnitMX: A change-aware unit testing tool. ICSE 2009, pp. 567-570. DOI: 10.1109/ICSE.2009.5070557.
[13]Jin, W., Orso, A., & Xie, T. (2010). Automated Behavioral Regression Testing. ICST 2010, pp. 137-146. DOI: 10.1109/ICST.2010.64.
[14]Su, Y., Pradel, M., & Chen, C. (2026). RippleGUItester: Change-Aware Exploratory Testing. arXiv:2603.03121. https://arxiv.org/abs/2603.03121

Real-world examples from .NET OSS

The BCR categories above are not theoretical. These case studies show GauntletCI detecting the exact patterns in real pull requests to widely-used .NET libraries.

StackExchange/StackExchange.RedisPR#2995

Swallowed Exception in StackExchange.Redis

GCI0007 catches a bare catch block that silently drops all exceptions in the message dispatch loop.

nunit/nunitPR#5192

Thread.Sleep in Async Context - NUnit

GCI0016 catches Thread.Sleep blocking the thread pool in an async context inside the NUnit test framework itself.

Try GauntletCI free Why tests miss bugs →What is diff-based analysis? →

About the author

Eric Cogen -- Founder, GauntletCI

Twenty years in .NET production. Most of those years, the bugs that hurt me were not the ones tests caught. They were the assumptions I did not know I was making: a removed guard clause, a renamed method that still did the old thing, a catch {} that turned a page into a silent dashboard lie. GauntletCI is the checklist I wish I had run before every commit. It runs the rules I learned the hard way, so you do not have to.

@EricCogen/GauntletCI on GitHub/More about Eric