Inside the machine
The Silver Benchmark
618 real .NET OSS pull requests. 30 rules. Every number earned through iteration.
This page documents the precision and recall of each GauntletCI detection rule, measured against a labeled corpus of real open-source pull requests. The numbers reflect what it took to get here: labeler rewrites, rule narrowing, skip-guard additions, and calibration passes that surfaced misalignments between what a rule detects and what its labeler measured.
The Corpus
The Silver corpus contains 618 fixtures drawn from pull requests across the most-downloaded .NET open-source projects on GitHub: dotnet/aspnetcore, dotnet/runtime, dotnet/efcore, StackExchange.Redis, Newtonsoft.Json, NUnit, xUnit, MassTransit, gRPC-dotnet, Jellyfin, ImageSharp, and others. Each fixture is a raw diff from a real PR, paired with per-rule labels indicating whether a finding was expected.
The corpus was built to cover diverse change patterns - refactors, feature additions, bug fixes, dependency updates - across repositories with very different coding styles. A benchmark built from a single project would overfit to one team's conventions. The Silver corpus deliberately includes edge cases from projects that use patterns unusual enough to trip rules calibrated on more conventional codebases.
How labeling works
Silver labels are generated by a heuristic engine (SilverLabelEngine) that mirrors each rule's detection logic. They are not human-reviewed. Each rule has a dedicated labeler block that tracks which files changed, whether they are test files or production code, and whether the specific patterns the rule looks for appear in the added lines.
When rule logic is hardened, the labeler is updated to match, and all 618 fixtures are re-labeled from scratch. Silver metrics are directional - they measure labeler-rule agreement, not ground truth correctness. The distinction matters: a rule can achieve 100% Silver precision and recall while still having real-world edge cases the labeler does not model.
On confidence intervals
Precision and recall on a small corpus have a known failure mode: a rule that fires correctly on 5 fixtures and incorrectly on 0 reports 100% precision. That number is accurate on the sample but statistically uninformative. A 95% Wilson score interval for that result spans 57% to 100%.
Per-rule cards include Wilson score confidence intervals alongside each reported figure. The Wilson method is preferred over the standard Wald interval because it stays well-behaved when the proportion is near 0 or 1 and when sample counts are small: both conditions apply to several rules here. A wide interval is not a criticism of the rule. It is a statement about estimation uncertainty given the current corpus size.
The labeler-rule gap
The most common failure mode during calibration was labeler-rule misalignment: the labeler and the rule were measuring different things, so all metrics were meaningless regardless of the numbers. Three examples from the calibration log:
The labeler was mapping binary and generated file presence (.dll, .png, .min.js) to GCI0022 "Idempotency and Retry Safety" - a complete semantic mismatch. The rule looks for HTTP POST operations without idempotency keys. Starting precision: 33.3%. Starting recall: 3%. After fixing the labeler to mirror the rule's actual signals, both reached 100%.
The labeler emitted a Positive label whenever one or more null-forgiving operators (!.) appeared in added lines. The rule has an early-return guard: it exits without findings when matchingLines.Count <= 1. The labeler was firing on exactly the cases the rule was designed to skip. Result: 65 false-negative labels. After raising the labeler threshold to count > 1, recall went from 45.8% to 79.5%.
The labeler used a global addedLines list spanning all files in the diff - .ps1, .yml, .md, .cs, test files, everything. The rule only processes non-test .cs files. So "# TODO:" in a PowerShell script or "throw new NotImplementedException" in a test file would generate a Positive label that the rule correctly never fired on. After rewriting the labeler to iterate per-file with path-header tracking, recall went from 56.1% to 100%.
Per-rule breakdown
Passing
P >= 90%, R >= 75% - 12 rulesDetects guard clause or null-check removal from non-trivial methods. Callers relying on the contract receive NullReferenceException deeper in the call stack.
Metrics
Confusion matrix
LogicKeywords narrowed from 7 to 4 tokens; logic-removal threshold raised 5 to 15 lines; empty-catch labeler heuristic redirected to GCI0032.
Detects public method signature changes - added required parameters, changed return types, renamed members - that break callers compiled against the previous signature.
Metrics
Confusion matrix
Detects async void methods and event handlers. Exceptions thrown inside async void are unobservable and crash the process in .NET.
Metrics
Confusion matrix
Detects non-backward-compatible changes to serialization contracts - field removal, type changes, renamed properties without aliases.
Metrics
Confusion matrix
Low prevalence in corpus (1 fixture). Metrics are directional at this sample size.
Detects HTTP POST operations or INSERT statements added without an idempotency key or upsert guard.
Metrics
Confusion matrix
Labeler was mapping binary/generated file presence to this rule. Replaced with correct idempotency signals: HTTP POST attribute and INSERT without upsert guard.
Detects IDisposable instances created with new without a using statement or explicit Dispose call.
Metrics
Confusion matrix
Added four skip guards: return new X (caller takes ownership), callee-owns paren check for service registration, static singletons, Enumerator suffix removed from disposable types.
Detects assignment to a shared field inside a property getter or other pure-context method, breaking the side-effect-free contract.
Metrics
Confusion matrix
Labeler global early-return replaced with per-file tracking. Generated files (.Designer.cs, .g.cs) excluded. IsNullGuardedInLabelerScope helper added with 20-line lookback.
Detects HttpClient usage without an explicit timeout or without CancellationToken propagation on outbound HTTP calls.
Metrics
Confusion matrix
CheckMissingTimeout narrowed from 'HttpClient ' substring to 'new HttpClient(' only - a LoggerMessage attribute string literal was triggering it. IHttpClientFactory factory config guard added. DeleteAsync excluded from CancellationToken check - DynamoDB and AMQP SDKs use the same method name.
Notable FP
gRPC's GrpcCallInvokerFactory.cs had a [LoggerMessage] attribute with Message = '...only some HttpClient properties...' - the substring 'HttpClient ' triggered the timeout check on a file that never instantiates an HttpClient.
Detects TODO, FIXME, HACK, and NotImplementedException in added lines of non-test production C# files.
Metrics
Confusion matrix
Rule: for comment lines, marker must be the first token after // - prevents natural-language matches. Labeler: rewritten with per-file rawDiff iteration restricted to non-test .cs files.
Notable FP
Jellyfin's codec path had '// add a spec-compliant dvh1/dav1 variant before the hvc1 hack variant.' - the word 'hack' used as a codec term in a prose comment, not as a HACK: marker.
Detects the result of an as-cast used without a null check in the same or immediately following expression.
Metrics
Confusion matrix
Labeler threshold raised from any (1+) null-forgiving operator to count > 1, matching the rule's matchingLines.Count <= 1 early-return guard. This was generating 65 false-negative labels.
Detects Thread.Sleep, LINQ enumeration inside loops, and collection .Add() inside unbounded loops.
Metrics
Confusion matrix
Labeler rewritten to mirror rule's three-check structure. Unsafe.Add( excluded from .Add() check. Loop detection added for LINQ-in-loop using non-removed-lines context lookback.
Detects direct == or != comparisons between float or double values, which fail silently due to IEEE 754 rounding.
Metrics
Confusion matrix
In Progress
Measurable but below threshold - 7 rulesDetects unsafe .Value access on Nullable<T> without a null guard, and public methods adding nullable parameters without validation.
Metrics
Confusion matrix
Precision fixed: .Value= on LHS skipped, IOptions<T>.Value skipped, constructors excluded, same-line null guard narrowed to regex. Recall gap (62 FNs) remains - the rule processes only added lines and misses .Value access on unchanged context lines.
Detects hardcoded connection strings and credentials embedded directly in source code.
Metrics
Confusion matrix
Only 2 positive fixtures in corpus. P=50% reflects 2 FPs out of 4 total fires. Low sample size makes metrics noisy.
Detects empty catch blocks, bare catch, and swallowed exceptions where the exception object is never logged or rethrown.
Metrics
Confusion matrix
Precision is excellent (98.2%). The 81 false negatives are the active work item - the rule's sub-checks cover specific empty-catch patterns but many real exception-swallowing patterns (logging without rethrowing, catch-then-return-null) are not yet detected.
Detects scoped service resolution from root container, missing required service registrations, and lifecycle mismatches in DI configuration.
Metrics
Confusion matrix
Detects skipped tests, empty assertions, and tests with uninformative names in test project files.
Metrics
Confusion matrix
SilencePatterns narrowed: [Skip] split into [Skip] and [Skip( to avoid matching [SkipLocalsInit]. IsTestFile now excludes paths containing 'testdata'. AssertionKeywords expanded with 6 real-world patterns from MongoDB, Azure SDK, ImageSharp, and ASP.NET Core.
Notable FP
NUnit's testdata directory had [SkipLocalsInit] on a performance-sensitive method. The attribute is a .NET runtime hint with no connection to test skipping - only the substring 'Skip' triggered the match.
Detects significant increases in cyclomatic complexity, deeply nested control flow, and methods exceeding length thresholds.
Metrics
Confusion matrix
Detects deviations from established patterns in the same codebase - inconsistent error handling, inconsistent async usage, inconsistent null handling.
Metrics
Confusion matrix
Limited corpus coverage
0 fixtures or no signal - 11 rulesDetects removal of existing error handling - exception handlers replaced with empty blocks, logging removed from catch clauses.
Metrics
Confusion matrix
10 fixtures were added for GCI0007 but labeling was not completed. Metrics are unavailable.
Detects hardcoded localhost/private IP URLs and environment-specific configuration embedded in source code.
Metrics
Confusion matrix
Rule was narrowed from any http:// literal to localhost/private IP only (docs URLs, nuget.org, github.com excluded). The labeler still marks 13 fixtures as positive from the broader original criteria. Rule fires nothing on current corpus.
Detects PII field names (email, ssn, dateofbirth, creditcard, passport, etc.) passed to structured loggers.
Metrics
Confusion matrix
PII terms were narrowed from 21 to 16 high-confidence terms, removing 'token', 'address', 'username', 'ipaddress', 'deviceid'. The labeler was updated to match. The 4 remaining FPs fire on the corpus but no fixture is labeled positive - the remaining PII term hits are likely false positives the labeler correctly marks as negative.
Detects public member renames that break naming conventions or deviate from established patterns in the same namespace.
Metrics
Confusion matrix
Rule fires on no fixtures in the corpus. Either the Silver corpus does not contain fixtures where naming contract violations occur, or the rule's detection patterns need adjustment.
Precision vs recall for a pre-commit tool
For a pre-commit tool, the cost of false positives and false negatives is asymmetric - but not in the way you might expect.
A false positive (rule fires when it should not) generates noise. Developers learn to ignore noisy tools. If a rule fires on every commit for non-issues, it gets disabled or bypassed. False positive tolerance is near zero for tools that run on every commit.
A false negative (rule does not fire when it should) means a real issue reaches code review or production. This is the failure mode the tool exists to prevent - but a single missed finding on a specific commit is usually less catastrophic than a tool that cries wolf on every commit.
The practical implication: during calibration, precision was prioritized over recall. A rule at 98% precision / 40% recall (GCI0032) is more useful in production than a rule at 70% precision / 90% recall. The recall gap is an active improvement target; the precision floor is treated as a hard constraint.
What comes next
Active calibration targets for the 7 in-progress rules:
