Inside the machine

The Silver Benchmark

618 real .NET OSS pull requests. 30 rules. Every number earned through iteration.

This page documents the precision and recall of each GauntletCI detection rule, measured against a labeled corpus of real open-source pull requests. The numbers reflect what it took to get here: labeler rewrites, rule narrowing, skip-guard additions, and calibration passes that surfaced misalignments between what a rule detects and what its labeler measured.

618
Fixtures
23
Rules Benchmarked
90.7%
Macro Precision
avg across 19 active rules
80.9%
Macro Recall
avg across 19 active rules

The Corpus

The Silver corpus contains 618 fixtures drawn from pull requests across the most-downloaded .NET open-source projects on GitHub: dotnet/aspnetcore, dotnet/runtime, dotnet/efcore, StackExchange.Redis, Newtonsoft.Json, NUnit, xUnit, MassTransit, gRPC-dotnet, Jellyfin, ImageSharp, and others. Each fixture is a raw diff from a real PR, paired with per-rule labels indicating whether a finding was expected.

The corpus was built to cover diverse change patterns - refactors, feature additions, bug fixes, dependency updates - across repositories with very different coding styles. A benchmark built from a single project would overfit to one team's conventions. The Silver corpus deliberately includes edge cases from projects that use patterns unusual enough to trip rules calibrated on more conventional codebases.

How labeling works

Silver labels are generated by a heuristic engine (SilverLabelEngine) that mirrors each rule's detection logic. They are not human-reviewed. Each rule has a dedicated labeler block that tracks which files changed, whether they are test files or production code, and whether the specific patterns the rule looks for appear in the added lines.

When rule logic is hardened, the labeler is updated to match, and all 618 fixtures are re-labeled from scratch. Silver metrics are directional - they measure labeler-rule agreement, not ground truth correctness. The distinction matters: a rule can achieve 100% Silver precision and recall while still having real-world edge cases the labeler does not model.

On confidence intervals

Precision and recall on a small corpus have a known failure mode: a rule that fires correctly on 5 fixtures and incorrectly on 0 reports 100% precision. That number is accurate on the sample but statistically uninformative. A 95% Wilson score interval for that result spans 57% to 100%.

Per-rule cards include Wilson score confidence intervals alongside each reported figure. The Wilson method is preferred over the standard Wald interval because it stays well-behaved when the proportion is near 0 or 1 and when sample counts are small: both conditions apply to several rules here. A wide interval is not a criticism of the rule. It is a statement about estimation uncertainty given the current corpus size.

The labeler-rule gap

The most common failure mode during calibration was labeler-rule misalignment: the labeler and the rule were measuring different things, so all metrics were meaningless regardless of the numbers. Three examples from the calibration log:

GCI0022Complete semantic mismatch

The labeler was mapping binary and generated file presence (.dll, .png, .min.js) to GCI0022 "Idempotency and Retry Safety" - a complete semantic mismatch. The rule looks for HTTP POST operations without idempotency keys. Starting precision: 33.3%. Starting recall: 3%. After fixing the labeler to mirror the rule's actual signals, both reached 100%.

GCI0043Early-return guard not reflected in labeler

The labeler emitted a Positive label whenever one or more null-forgiving operators (!.) appeared in added lines. The rule has an early-return guard: it exits without findings when matchingLines.Count <= 1. The labeler was firing on exactly the cases the rule was designed to skip. Result: 65 false-negative labels. After raising the labeler threshold to count > 1, recall went from 45.8% to 79.5%.

GCI0042Labeler scope wider than rule scope

The labeler used a global addedLines list spanning all files in the diff - .ps1, .yml, .md, .cs, test files, everything. The rule only processes non-test .cs files. So "# TODO:" in a PowerShell script or "throw new NotImplementedException" in a test file would generate a Positive label that the rule correctly never fired on. After rewriting the labeler to iterate per-file with path-header tracking, recall went from 56.1% to 100%.

Per-rule breakdown

Passing

P >= 90%, R >= 75% - 12 rules
GCI0003Behavioral Change DetectionCorrectnessfires on 20.4% of fixtures

Detects guard clause or null-check removal from non-trivial methods. Callers relying on the contract receive NullReferenceException deeper in the call stack.

Metrics

97.6%
±3.0%
Precision
75.5%
±6.6%
Recall
85.1%
F1

Confusion matrix

123
TP
3
FP
40
FN
452
TN
Improved:P: 59.0% -> 97.6%R: 40.4% -> 75.5%

LogicKeywords narrowed from 7 to 4 tokens; logic-removal threshold raised 5 to 15 lines; empty-catch labeler heuristic redirected to GCI0032.

GCI0004Breaking Change RiskAPI Contractsfires on 4.0% of fixtures

Detects public method signature changes - added required parameters, changed return types, renamed members - that break callers compiled against the previous signature.

Metrics

100%
±6.7%
Precision
100%
±6.7%
Recall
100%
F1

Confusion matrix

25
TP
0
FP
0
FN
593
TN
GCI0016Async Concurrency RiskConcurrencyfires on 4.9% of fixtures

Detects async void methods and event handlers. Exceptions thrown inside async void are unobservable and crash the process in .NET.

Metrics

100%
±5.7%
Precision
88.2%
±11.0%
Recall
93.7%
F1

Confusion matrix

30
TP
0
FP
4
FN
584
TN
GCI0021Data and Schema CompatibilitySchemafires on 0.2% of fixtures

Detects non-backward-compatible changes to serialization contracts - field removal, type changes, renamed properties without aliases.

Metrics

100%
±39.7%
Precision
100%
±39.7%
Recall
100%
F1

Confusion matrix

1
TP
0
FP
0
FN
617
TN

Low prevalence in corpus (1 fixture). Metrics are directional at this sample size.

GCI0022Idempotency and Retry SafetyAPI Contractsfires on 0.2% of fixtures

Detects HTTP POST operations or INSERT statements added without an idempotency key or upsert guard.

Metrics

100%
±39.7%
Precision
100%
±39.7%
Recall
100%
F1

Confusion matrix

1
TP
0
FP
0
FN
617
TN
Improved:P: 33.3% -> 100%R: 3.0% -> 100%

Labeler was mapping binary/generated file presence to this rule. Replaced with correct idempotency signals: HTTP POST attribute and INSERT without upsert guard.

GCI0024Resource LifecycleResourcesfires on 4.4% of fixtures

Detects IDisposable instances created with new without a using statement or explicit Dispose call.

Metrics

92.6%
±10.7%
Precision
83.3%
±13.1%
Recall
87.7%
F1

Confusion matrix

25
TP
2
FP
5
FN
586
TN
Improved:P: 38.8% -> 92.6%R: 55.9% -> 83.3%

Added four skip guards: return new X (caller takes ownership), callee-owns paren check for service registration, static singletons, Enumerator suffix removed from disposable types.

GCI0036Pure Context MutationCorrectnessfires on 0.3% of fixtures

Detects assignment to a shared field inside a property getter or other pure-context method, breaking the side-effect-free contract.

Metrics

100%
±32.9%
Precision
100%
±32.9%
Recall
100%
F1

Confusion matrix

2
TP
0
FP
0
FN
616
TN
Improved:P: 14.3% -> 100%R: 33.3% -> 100%

Labeler global early-return replaced with per-file tracking. Generated files (.Designer.cs, .g.cs) excluded. IsNullGuardedInLabelerScope helper added with 20-line lookback.

GCI0039External Service SafetyReliabilityfires on 0.8% of fixtures

Detects HttpClient usage without an explicit timeout or without CancellationToken propagation on outbound HTTP calls.

Metrics

100%
±21.7%
Precision
100%
±21.7%
Recall
100%
F1

Confusion matrix

5
TP
0
FP
0
FN
613
TN
Improved:P: 62.5% -> 100%R: 55.6% -> 100%

CheckMissingTimeout narrowed from 'HttpClient ' substring to 'new HttpClient(' only - a LoggerMessage attribute string literal was triggering it. IHttpClientFactory factory config guard added. DeleteAsync excluded from CancellationToken check - DynamoDB and AMQP SDKs use the same method name.

Notable FP

gRPC's GrpcCallInvokerFactory.cs had a [LoggerMessage] attribute with Message = '...only some HttpClient properties...' - the substring 'HttpClient ' triggered the timeout check on a file that never instantiates an HttpClient.

GCI0042TODO/Stub DetectionCode Qualityfires on 5.7% of fixtures

Detects TODO, FIXME, HACK, and NotImplementedException in added lines of non-test production C# files.

Metrics

100%
±4.9%
Precision
100%
±4.9%
Recall
100%
F1

Confusion matrix

35
TP
0
FP
0
FN
583
TN
Improved:P: 60.5% -> 100%R: 56.1% -> 100%

Rule: for comment lines, marker must be the first token after // - prevents natural-language matches. Labeler: rewritten with per-file rawDiff iteration restricted to non-test .cs files.

Notable FP

Jellyfin's codec path had '// add a spec-compliant dvh1/dav1 variant before the hvc1 hack variant.' - the word 'hack' used as a codec term in a prose comment, not as a HACK: marker.

GCI0043Nullability and Type SafetyType Safetyfires on 11.0% of fixtures

Detects the result of an as-cast used without a null check in the same or immediately following expression.

Metrics

97.1%
±4.6%
Precision
79.5%
±8.6%
Recall
87.4%
F1

Confusion matrix

66
TP
2
FP
17
FN
533
TN
Improved:P: 73.3% -> 97.1%R: 45.8% -> 79.5%

Labeler threshold raised from any (1+) null-forgiving operator to count > 1, matching the rule's matchingLines.Count <= 1 early-return guard. This was generating 65 false-negative labels.

GCI0044Performance Hotpath RiskPerformancefires on 4.7% of fixtures

Detects Thread.Sleep, LINQ enumeration inside loops, and collection .Add() inside unbounded loops.

Metrics

100%
±5.8%
Precision
96.7%
±8.0%
Recall
98.3%
F1

Confusion matrix

29
TP
0
FP
1
FN
588
TN
Improved:P: 33.3% -> 100%R: 44.0% -> 96.7%

Labeler rewritten to mirror rule's three-check structure. Unsafe.Add( excluded from .Add() check. Loop detection added for LINQ-in-loop using non-removed-lines context lookback.

GCI0049Float/Double Equality ComparisonType Safetyfires on 0.8% of fixtures

Detects direct == or != comparisons between float or double values, which fail silently due to IEEE 754 rounding.

Metrics

100%
±21.7%
Precision
83.3%
±26.7%
Recall
90.9%
F1

Confusion matrix

5
TP
0
FP
1
FN
612
TN

In Progress

Measurable but below threshold - 7 rules
GCI0006Edge Case HandlingCorrectnessfires on 14.6% of fixtures

Detects unsafe .Value access on Nullable<T> without a null guard, and public methods adding nullable parameters without validation.

Metrics

95.6%
±4.6%
Precision
58.1%
±7.9%
Recall
72.3%
F1

Confusion matrix

86
TP
4
FP
62
FN
466
TN
Improved:P: 43.5% -> 95.6%R: 38.5% -> 58.1%

Precision fixed: .Value= on LHS skipped, IOptions<T>.Value skipped, constructors excluded, same-line null guard narrowed to regex. Recall gap (62 FNs) remains - the rule processes only added lines and misses .Value access on unchanged context lines.

GCI0012Security RiskSecurityfires on 0.6% of fixtures

Detects hardcoded connection strings and credentials embedded directly in source code.

Metrics

50.0%
±35.0%
Precision
100%
±32.9%
Recall
66.7%
F1

Confusion matrix

2
TP
2
FP
0
FN
614
TN

Only 2 positive fixtures in corpus. P=50% reflects 2 FPs out of 4 total fires. Low sample size makes metrics noisy.

GCI0032Uncaught Exception PathError Handlingfires on 9.1% of fixtures

Detects empty catch blocks, bare catch, and swallowed exceptions where the exception object is never logged or rethrown.

Metrics

98.2%
±4.6%
Precision
40.4%
±8.1%
Recall
57.2%
F1

Confusion matrix

55
TP
1
FP
81
FN
481
TN

Precision is excellent (98.2%). The 81 false negatives are the active work item - the rule's sub-checks cover specific empty-catch patterns but many real exception-swallowing patterns (logging without rethrowing, catch-then-return-null) are not yet detected.

GCI0038Dependency Injection SafetyResourcesfires on 4.7% of fixtures

Detects scoped service resolution from root container, missing required service registrations, and lifecycle mismatches in DI configuration.

Metrics

72.4%
±15.5%
Precision
47.7%
±14.2%
Recall
57.5%
F1

Confusion matrix

21
TP
8
FP
23
FN
566
TN
GCI0041Test Quality GapsTestingfires on 2.4% of fixtures

Detects skipped tests, empty assertions, and tests with uninformative names in test project files.

Metrics

73.3%
±20.5%
Precision
100%
±12.9%
Recall
84.6%
F1

Confusion matrix

11
TP
4
FP
0
FN
603
TN
Improved:P: 62.5% -> 73.3%R: 80.0% -> 100%

SilencePatterns narrowed: [Skip] split into [Skip] and [Skip( to avoid matching [SkipLocalsInit]. IsTestFile now excludes paths containing 'testdata'. AssertionKeywords expanded with 6 real-world patterns from MongoDB, Azure SDK, ImageSharp, and ASP.NET Core.

Notable FP

NUnit's testdata directory had [SkipLocalsInit] on a performance-sensitive method. The attribute is a .NET runtime hint with no connection to test skipping - only the substring 'Skip' triggered the match.

GCI0045Complexity ControlMaintainabilityfires on 3.2% of fixtures

Detects significant increases in cyclomatic complexity, deeply nested control flow, and methods exceeding length thresholds.

Metrics

65.0%
±19.3%
Precision
27.1%
±12.2%
Recall
38.3%
F1

Confusion matrix

13
TP
7
FP
35
FN
563
TN
GCI0046Pattern Consistency DeviationMaintainabilityfires on 3.4% of fixtures

Detects deviations from established patterns in the same codebase - inconsistent error handling, inconsistent async usage, inconsistent null handling.

Metrics

81.0%
±16.2%
Precision
56.7%
±16.7%
Recall
66.7%
F1

Confusion matrix

17
TP
4
FP
13
FN
584
TN

Limited corpus coverage

0 fixtures or no signal - 11 rules
GCI0007Error Handling IntegrityError Handlingno trigger data

Detects removal of existing error handling - exception handlers replaced with empty blocks, logging removed from catch clauses.

Metrics

--
Precision
--
Recall
--
F1

Confusion matrix

0
TP
0
FP
0
FN
0
TN

10 fixtures were added for GCI0007 but labeling was not completed. Metrics are unavailable.

GCI0010Hardcoding and ConfigurationSecurityfires on 0.0% of fixtures

Detects hardcoded localhost/private IP URLs and environment-specific configuration embedded in source code.

Metrics

--
Precision
0.0%
±11.4%
Recall
--
F1

Confusion matrix

0
TP
0
FP
13
FN
605
TN

Rule was narrowed from any http:// literal to localhost/private IP only (docs URLs, nuget.org, github.com excluded). The labeler still marks 13 fixtures as positive from the broader original criteria. Rule fires nothing on current corpus.

GCI0029PII Entity Logging LeakSecurityfires on 0.6% of fixtures

Detects PII field names (email, ssn, dateofbirth, creditcard, passport, etc.) passed to structured loggers.

Metrics

0.0%
±24.5%
Precision
--
Recall
--
F1

Confusion matrix

0
TP
4
FP
0
FN
614
TN

PII terms were narrowed from 21 to 16 high-confidence terms, removing 'token', 'address', 'username', 'ipaddress', 'deviceid'. The labeler was updated to match. The 4 remaining FPs fire on the corpus but no fixture is labeled positive - the remaining PII term hits are likely false positives the labeler correctly marks as negative.

GCI0047Naming/Contract AlignmentAPI Contractsfires on 0.0% of fixtures

Detects public member renames that break naming conventions or deviate from established patterns in the same namespace.

Metrics

--
Precision
--
Recall
--
F1

Confusion matrix

0
TP
0
FP
0
FN
618
TN

Rule fires on no fixtures in the corpus. Either the Silver corpus does not contain fixtures where naming contract violations occur, or the rule's detection patterns need adjustment.

Precision vs recall for a pre-commit tool

For a pre-commit tool, the cost of false positives and false negatives is asymmetric - but not in the way you might expect.

A false positive (rule fires when it should not) generates noise. Developers learn to ignore noisy tools. If a rule fires on every commit for non-issues, it gets disabled or bypassed. False positive tolerance is near zero for tools that run on every commit.

A false negative (rule does not fire when it should) means a real issue reaches code review or production. This is the failure mode the tool exists to prevent - but a single missed finding on a specific commit is usually less catastrophic than a tool that cries wolf on every commit.

The practical implication: during calibration, precision was prioritized over recall. A rule at 98% precision / 40% recall (GCI0032) is more useful in production than a rule at 70% precision / 90% recall. The recall gap is an active improvement target; the precision floor is treated as a hard constraint.

What comes next

Active calibration targets for the 7 in-progress rules:

GCI003298.2% / 40.4%Recall gap. 81 FNs. Expanding sub-checks for catch-then-return-null, logging without rethrow.
GCI000695.6% / 58.1%Recall gap. 62 FNs. Rule processes only added lines, missing .Value access on context lines.
GCI004565.0% / 27.1%Both gaps. The hardest rule to calibrate - cyclomatic complexity signals are noisy by nature.
GCI003872.4% / 47.7%Both gaps. DI safety patterns are highly framework-specific.
GCI004681.0% / 56.7%Recall gap. Pattern deviation detection depends on what patterns exist in context lines.
GCI004173.3% / 100%Precision gap. 4 FPs remain after hardening. Recall is perfect.
GCI001250.0% / 100%Precision gap. Only 4 total fires, 2 FPs. Insufficient sample to tune further.