Counting matters: this page separates raw findings from high-confidence findings. Raw findings preserve every affected file, framework surface, and rule hit. High-confidence findings use the corpus database confidence threshold of 0.70 or higher. The former measures blast radius; the latter is the cleaner signal for prioritization.
The uncomfortable pattern: contract risk dominates
The largest finding groups are not exotic security bugs or dramatic runtime crashes. They are ordinary-looking API and contract changes: visibility changes, signature changes, nullable edge cases, exception paths, and async behavior. Those are exactly the changes that can look reasonable in review while still changing what downstream callers experience.
That is the core reason GauntletCI treats pull request risk as a diff problem instead of a whole-codebase cleanliness score. The question is not "is this repository good?" The question is "what did this PR make newly dangerous?"
Top rule families in the corpus
| Rule | Signal family | Raw findings |
|---|---|---|
| GCI0004 | [Obsolete] attribute transitions on public APIs | 59,965 |
| GCI0003 | Method signature and contract changes | 39,628 |
| GCI0006 | Null and edge-case handling changes | 10,978 |
| GCI0015 | Data integrity and silent discard risks | 10,389 |
| GCI0016 | Async and deadlock candidates | 4,040 |
| GCI0024 | Resource lifecycle and undisposed disposables | 3,435 |
| GCI0010 | Hardcoded secrets, URLs, and connection strings | 3,225 |
| GCI0001 | Mixed-scope or diff-integrity risk | 2,674 |
| GCI0036 | Pure context mutation and side effects in getters | 2,524 |
| GCI0047 | Additional behavioral-change signals | 1,450 |
GCI0004 and GCI0003 together account for 99,593 raw findings. That does not mean every finding is a defect. It means API shape and contract changes are the dominant risk surface in this corpus.
Repository distribution, with the outlier left visible
The corpus is intentionally not flattened to hide uncomfortable skew. Large SDK and framework PRs produce more signals because they touch more published surface area. That skew is a feature of the data, not a reason to erase it.
| Repository | Corpus PRs | Raw findings | High-confidence |
|---|---|---|---|
| Azure/azure-sdk-for-net | 18 | 42,919 | 16,875 |
| JamesNK/Newtonsoft.Json | 10 | 12,086 | 1,034 |
| googleapis/google-api-dotnet-client | 17 | 12,009 | 3,236 |
| DapperLib/Dapper | 7 | 9,696 | 107 |
| StackExchange/StackExchange.Redis | 10 | 5,568 | 825 |
| dotnet/reactive | 12 | 5,546 | 217 |
| apache/logging-log4net | 9 | 4,716 | 1,359 |
| dotnet/orleans | 14 | 4,188 | 681 |
| grpc/grpc-dotnet | 12 | 3,935 | 243 |
| DevToys-app/DevToys | 12 | 3,787 | 1,011 |
Outlier disclosure: Azure SDK PR #57223
Azure SDK for .NET PR #57223 contributes 40,155 raw findings and 16,611 high-confidence findings. That is 27.1% of the corpus raw total and 46.3% of the high-confidence total. Any honest reading of the corpus has to say that out loud.
The right conclusion is not "Azure SDK is bad." The useful conclusion is that multiframework, published-surface-area changes create a different risk profile than small application PRs. For libraries, one signature or visibility change can echo through multiple target frameworks and generated surfaces.
Read the Azure SDK PR #57223 analysis →Test changes are not a reliable proxy for risk
The corpus contains 178 PRs with no test-file changes recorded. Of those, 131 had at least one Behavioral Change Risk finding, and 46 had at least one high-confidence finding. That does not prove the PRs were wrong. It proves that "tests changed" and "risk was introduced" are different signals.
A reviewer needs both. A test diff shows what behavior the author chose to prove. A risk diff shows what behavior the author may have changed without making that choice explicit.
What this changes for PR review
Review the delta, not the vibe
A polished PR can still alter contracts, exception paths, and runtime assumptions. Risk analysis gives reviewers a concrete checklist tied to the diff.
Treat API shape as production behavior
Visibility and signature changes dominated the corpus. For library and platform teams, API shape is not metadata; it is behavior customers compile against.
Use findings as evidence, not theater
A finding is not a verdict. It is a deterministic pointer to changed behavior that deserves a human decision before merge.
Toward a finding ledger
The next credibility step is not a louder claim about what GauntletCI can find. It is a clearer public surface for how findings move from candidate signal to reviewer decision. Anthropic's Mythos Preview dashboard is useful here as a workflow reference: candidates are triaged, validated, disclosed, patched, and tied to ledger entries.
A GauntletCI finding ledger would be narrower and product-specific: PR, rule ID, changed file, evidence snippet, confidence band, reviewer verdict, disposition, and follow-up outcome. That would make the corpus easier to audit and make it obvious that a finding is evidence for human review, not an automatic defect accusation.
Methodology and limitations
The current local corpus database contains 610 public, already-merged C# pull requests across 61 repositories. The analyzed findings table contains 147,958 triggered findings across 529 PRs and 28 rule IDs. A high-confidence finding is a triggered finding with an `actual_confidence` value of at least 0.70.
The report is not a benchmark of repository quality, maintainer skill, or defect rate. It is a field report about where deterministic change-risk rules fire when applied to real merged diffs. Some findings represent intentional changes. Some represent generated or multiframework duplication. That is why this page reports raw counts, high-confidence counts, and the largest outlier separately.
Sources
- GauntletCI corpus fixture export — Public fixture-level corpus export used to cross-check PR, repository, test-change, and finding totals.
- Corpus audit runner — Audit script that hydrates corpus fixtures and runs GauntletCI analysis against the local corpus database.
- Azure SDK for .NET PR #57223 — The largest outlier in the current corpus; it accounts for 40,155 raw findings and 16,611 high-confidence findings.
- Azure SDK PR #57223 deep dive — Internal case study explaining why multiframework API-surface changes can produce unusually large raw finding counts.
- Behavioral Change Risk framework — Internal methodology article defining the change-risk categories used throughout the corpus analysis.
- Anthropic coordinated vulnerability disclosure dashboard — Public example of a candidate finding, triage, disclosure, patch, advisory, and ledger workflow for AI-assisted vulnerability research.
