How does GauntletCI compare to SonarQube?

SonarQube analyzes the entire codebase on a schedule and requires a server, account, and network access. GauntletCI analyzes only the lines that changed in the current diff, runs entirely on the developer's machine in under one second, requires no account or cloud connection, and installs as a pre-commit hook. SonarQube is a full codebase quality platform; GauntletCI is a focused change-risk detector that runs before a commit is ever created.

How does GauntletCI compare to Semgrep?

Semgrep is a pattern-matching engine that scans files or the full codebase. GauntletCI scopes every rule to the changed diff, meaning it only flags risks introduced by the current change, not pre-existing issues in unchanged files. GauntletCI also includes incident correlation, local LLM enrichment, and baseline delta mode, which Semgrep does not offer. GauntletCI's free tier includes all rules with no account required.

How does GauntletCI compare to Snyk?

Snyk is primarily a dependency and container vulnerability scanner that requires cloud connectivity and a Snyk account. GauntletCI detects behavioral, structural, and security risks in first-party code changes, not dependency vulnerabilities, and runs 100% locally with no data transmitted. GauntletCI is suitable for air-gapped environments and organizations with strict data residency requirements; Snyk is not.

How does GauntletCI compare to CodeQL / GitHub Advanced Security?

CodeQL performs deep semantic analysis of the full codebase and requires compilation and significant compute time, typically minutes per run. GauntletCI analyzes only the changed diff in under one second with no compilation step, making it suitable as a pre-commit hook in the developer's local workflow. CodeQL is better for periodic deep security audits; GauntletCI is better for fast, continuous feedback on every commit.

How does GauntletCI compare to Code Climate?

Code Climate is a SaaS platform that analyzes full repositories for maintainability and test coverage trends over time. It requires uploading code to a cloud service. GauntletCI is fully local, diff-scoped, and focused on change risk rather than codebase health metrics. GauntletCI does not require a cloud account, never transmits code, and produces results in under one second, making it a complement to, not a replacement for, Code Climate's longitudinal reporting.

Does GauntletCI work in air-gapped or offline environments?

Yes. GauntletCI runs entirely on the local machine. No diff, finding, file path, or telemetry is transmitted. The optional local LLM enrichment feature (--with-llm) uses a locally hosted Ollama model, with no network call. This makes GauntletCI suitable for classified, regulated, or air-gapped environments where cloud-based analysis tools are prohibited.

What is baseline delta mode and do other tools offer it?

Baseline delta mode (gauntletci baseline capture) snapshots all current findings and suppresses them from future runs. Only net-new findings (risks introduced after the baseline) are reported. This eliminates alert fatigue from pre-existing issues in legacy codebases. SonarQube offers a similar 'new code' concept but requires server setup. Semgrep, Snyk, CodeQL, and Code Climate do not offer equivalent pre-commit baseline suppression.

Inside the machine

The Silver Benchmark

618 real .NET OSS pull requests. 30 rules. Every number earned through iteration.

This page documents the precision and recall of each GauntletCI detection rule, measured against a labeled corpus of real open-source pull requests. The numbers reflect what it took to get here: labeler rewrites, rule narrowing, skip-guard additions, and calibration passes that surfaced misalignments between what a rule detects and what its labeler measured.

618

Fixtures

Rules Benchmarked

90.7%

Macro Precision

avg across 19 active rules

80.9%

Macro Recall

avg across 19 active rules

The Corpus

The Silver corpus contains 618 fixtures drawn from pull requests across the most-downloaded .NET open-source projects on GitHub: dotnet/aspnetcore, dotnet/runtime, dotnet/efcore, StackExchange.Redis, Newtonsoft.Json, NUnit, xUnit, MassTransit, gRPC-dotnet, Jellyfin, ImageSharp, and others. Each fixture is a raw diff from a real PR, paired with per-rule labels indicating whether a finding was expected.

The corpus was built to cover diverse change patterns - refactors, feature additions, bug fixes, dependency updates - across repositories with very different coding styles. A benchmark built from a single project would overfit to one team's conventions. The Silver corpus deliberately includes edge cases from projects that use patterns unusual enough to trip rules calibrated on more conventional codebases.

How labeling works

Silver labels are generated by a heuristic engine (SilverLabelEngine) that mirrors each rule's detection logic. They are not human-reviewed. Each rule has a dedicated labeler block that tracks which files changed, whether they are test files or production code, and whether the specific patterns the rule looks for appear in the added lines.

When rule logic is hardened, the labeler is updated to match, and all 618 fixtures are re-labeled from scratch. Silver metrics are directional - they measure labeler-rule agreement, not ground truth correctness. The distinction matters: a rule can achieve 100% Silver precision and recall while still having real-world edge cases the labeler does not model.

On confidence intervals

Precision and recall on a small corpus have a known failure mode: a rule that fires correctly on 5 fixtures and incorrectly on 0 reports 100% precision. That number is accurate on the sample but statistically uninformative. A 95% Wilson score interval for that result spans 57% to 100%.

Per-rule cards include Wilson score confidence intervals alongside each reported figure. The Wilson method is preferred over the standard Wald interval because it stays well-behaved when the proportion is near 0 or 1 and when sample counts are small: both conditions apply to several rules here. A wide interval is not a criticism of the rule. It is a statement about estimation uncertainty given the current corpus size.

The labeler-rule gap

The most common failure mode during calibration was labeler-rule misalignment: the labeler and the rule were measuring different things, so all metrics were meaningless regardless of the numbers. Three examples from the calibration log:

GCI0022Complete semantic mismatch

The labeler was mapping binary and generated file presence (.dll, .png, .min.js) to GCI0022 "Idempotency and Retry Safety" - a complete semantic mismatch. The rule looks for HTTP POST operations without idempotency keys. Starting precision: 33.3%. Starting recall: 3%. After fixing the labeler to mirror the rule's actual signals, both reached 100%.

GCI0043Early-return guard not reflected in labeler

The labeler emitted a Positive label whenever one or more null-forgiving operators (!.) appeared in added lines. The rule has an early-return guard: it exits without findings when matchingLines.Count <= 1. The labeler was firing on exactly the cases the rule was designed to skip. Result: 65 false-negative labels. After raising the labeler threshold to count > 1, recall went from 45.8% to 79.5%.

GCI0042Labeler scope wider than rule scope

The labeler used a global addedLines list spanning all files in the diff - .ps1, .yml, .md, .cs, test files, everything. The rule only processes non-test .cs files. So "# TODO:" in a PowerShell script or "throw new NotImplementedException" in a test file would generate a Positive label that the rule correctly never fired on. After rewriting the labeler to iterate per-file with path-header tracking, recall went from 56.1% to 100%.

Per-rule breakdown

Passing

P >= 90%, R >= 75% - 12 rules

GCI0003Behavioral Change DetectionCorrectnessfires on 20.4% of fixtures

Detects guard clause or null-check removal from non-trivial methods. Callers relying on the contract receive NullReferenceException deeper in the call stack.

Metrics

97.6%

±3.0%

Precision

75.5%

±6.6%

Recall

85.1%

Confusion matrix

123

452

Improved:P: 59.0% -> 97.6%R: 40.4% -> 75.5%

LogicKeywords narrowed from 7 to 4 tokens; logic-removal threshold raised 5 to 15 lines; empty-catch labeler heuristic redirected to GCI0032.

GCI0004Breaking Change RiskAPI Contractsfires on 4.0% of fixtures

Detects public method signature changes - added required parameters, changed return types, renamed members - that break callers compiled against the previous signature.

Metrics

100%

±6.7%

Precision

100%

±6.7%

Recall

100%

Confusion matrix

593

GCI0016Async Concurrency RiskConcurrencyfires on 4.9% of fixtures

Detects async void methods and event handlers. Exceptions thrown inside async void are unobservable and crash the process in .NET.

Metrics

100%

±5.7%

Precision

88.2%

±11.0%

Recall

93.7%

Confusion matrix

584

GCI0021Data and Schema CompatibilitySchemafires on 0.2% of fixtures

Detects non-backward-compatible changes to serialization contracts - field removal, type changes, renamed properties without aliases.

Metrics

100%

±39.7%

Precision

100%

±39.7%

Recall

100%

Confusion matrix

617

Low prevalence in corpus (1 fixture). Metrics are directional at this sample size.

GCI0022Idempotency and Retry SafetyAPI Contractsfires on 0.2% of fixtures

Detects HTTP POST operations or INSERT statements added without an idempotency key or upsert guard.

Metrics

100%

±39.7%

Precision

100%

±39.7%

Recall

100%

Confusion matrix

617

Improved:P: 33.3% -> 100%R: 3.0% -> 100%

Labeler was mapping binary/generated file presence to this rule. Replaced with correct idempotency signals: HTTP POST attribute and INSERT without upsert guard.

GCI0024Resource LifecycleResourcesfires on 4.4% of fixtures

Detects IDisposable instances created with new without a using statement or explicit Dispose call.

Metrics

92.6%

±10.7%

Precision

83.3%

±13.1%

Recall

87.7%

Confusion matrix

586

Improved:P: 38.8% -> 92.6%R: 55.9% -> 83.3%

Added four skip guards: return new X (caller takes ownership), callee-owns paren check for service registration, static singletons, Enumerator suffix removed from disposable types.

GCI0036Pure Context MutationCorrectnessfires on 0.3% of fixtures

Detects assignment to a shared field inside a property getter or other pure-context method, breaking the side-effect-free contract.

Metrics

100%

±32.9%

Precision

100%

±32.9%

Recall

100%

Confusion matrix

616

Improved:P: 14.3% -> 100%R: 33.3% -> 100%

Labeler global early-return replaced with per-file tracking. Generated files (.Designer.cs, .g.cs) excluded. IsNullGuardedInLabelerScope helper added with 20-line lookback.

GCI0039External Service SafetyReliabilityfires on 0.8% of fixtures

Detects HttpClient usage without an explicit timeout or without CancellationToken propagation on outbound HTTP calls.

Metrics

100%

±21.7%

Precision

100%

±21.7%

Recall

100%

Confusion matrix

613

Improved:P: 62.5% -> 100%R: 55.6% -> 100%

CheckMissingTimeout narrowed from 'HttpClient ' substring to 'new HttpClient(' only - a LoggerMessage attribute string literal was triggering it. IHttpClientFactory factory config guard added. DeleteAsync excluded from CancellationToken check - DynamoDB and AMQP SDKs use the same method name.

Notable FP

gRPC's GrpcCallInvokerFactory.cs had a [LoggerMessage] attribute with Message = '...only some HttpClient properties...' - the substring 'HttpClient ' triggered the timeout check on a file that never instantiates an HttpClient.

GCI0042TODO/Stub DetectionCode Qualityfires on 5.7% of fixtures

Detects TODO, FIXME, HACK, and NotImplementedException in added lines of non-test production C# files.

Metrics

100%

±4.9%

Precision

100%

±4.9%

Recall

100%

Confusion matrix

583

Improved:P: 60.5% -> 100%R: 56.1% -> 100%

Rule: for comment lines, marker must be the first token after // - prevents natural-language matches. Labeler: rewritten with per-file rawDiff iteration restricted to non-test .cs files.

Notable FP

Jellyfin's codec path had '// add a spec-compliant dvh1/dav1 variant before the hvc1 hack variant.' - the word 'hack' used as a codec term in a prose comment, not as a HACK: marker.

GCI0043Nullability and Type SafetyType Safetyfires on 11.0% of fixtures

Detects the result of an as-cast used without a null check in the same or immediately following expression.

Metrics

97.1%

±4.6%

Precision

79.5%

±8.6%

Recall

87.4%

Confusion matrix

533

Improved:P: 73.3% -> 97.1%R: 45.8% -> 79.5%

Labeler threshold raised from any (1+) null-forgiving operator to count > 1, matching the rule's matchingLines.Count <= 1 early-return guard. This was generating 65 false-negative labels.

GCI0044Performance Hotpath RiskPerformancefires on 4.7% of fixtures

Detects Thread.Sleep, LINQ enumeration inside loops, and collection .Add() inside unbounded loops.

Metrics

100%

±5.8%

Precision

96.7%

±8.0%

Recall

98.3%

Confusion matrix

588

Improved:P: 33.3% -> 100%R: 44.0% -> 96.7%

Labeler rewritten to mirror rule's three-check structure. Unsafe.Add( excluded from .Add() check. Loop detection added for LINQ-in-loop using non-removed-lines context lookback.

GCI0049Float/Double Equality ComparisonType Safetyfires on 0.8% of fixtures

Detects direct == or != comparisons between float or double values, which fail silently due to IEEE 754 rounding.

Metrics

100%

±21.7%

Precision

83.3%

±26.7%

Recall

90.9%

Confusion matrix

612

In Progress

Measurable but below threshold - 7 rules

GCI0006Edge Case HandlingCorrectnessfires on 14.6% of fixtures

Detects unsafe .Value access on Nullable<T> without a null guard, and public methods adding nullable parameters without validation.

Metrics

95.6%

±4.6%

Precision

58.1%

±7.9%

Recall

72.3%

Confusion matrix

466

Improved:P: 43.5% -> 95.6%R: 38.5% -> 58.1%

Precision fixed: .Value= on LHS skipped, IOptions<T>.Value skipped, constructors excluded, same-line null guard narrowed to regex. Recall gap (62 FNs) remains - the rule processes only added lines and misses .Value access on unchanged context lines.

GCI0012Security RiskSecurityfires on 0.6% of fixtures

Detects hardcoded connection strings and credentials embedded directly in source code.

Metrics

50.0%

±35.0%

Precision

100%

±32.9%

Recall

66.7%

Confusion matrix

614

Only 2 positive fixtures in corpus. P=50% reflects 2 FPs out of 4 total fires. Low sample size makes metrics noisy.

GCI0032Uncaught Exception PathError Handlingfires on 9.1% of fixtures

Detects empty catch blocks, bare catch, and swallowed exceptions where the exception object is never logged or rethrown.

Metrics

98.2%

±4.6%

Precision

40.4%

±8.1%

Recall

57.2%

Confusion matrix

481

Precision is excellent (98.2%). The 81 false negatives are the active work item - the rule's sub-checks cover specific empty-catch patterns but many real exception-swallowing patterns (logging without rethrowing, catch-then-return-null) are not yet detected.

GCI0038Dependency Injection SafetyResourcesfires on 4.7% of fixtures

Detects scoped service resolution from root container, missing required service registrations, and lifecycle mismatches in DI configuration.

Metrics

72.4%

±15.5%

Precision

47.7%

±14.2%

Recall

57.5%

Confusion matrix

566

GCI0041Test Quality GapsTestingfires on 2.4% of fixtures

Detects skipped tests, empty assertions, and tests with uninformative names in test project files.

Metrics

73.3%

±20.5%

Precision

100%

±12.9%

Recall

84.6%

Confusion matrix

603

Improved:P: 62.5% -> 73.3%R: 80.0% -> 100%

SilencePatterns narrowed: [Skip] split into [Skip] and [Skip( to avoid matching [SkipLocalsInit]. IsTestFile now excludes paths containing 'testdata'. AssertionKeywords expanded with 6 real-world patterns from MongoDB, Azure SDK, ImageSharp, and ASP.NET Core.

Notable FP

NUnit's testdata directory had [SkipLocalsInit] on a performance-sensitive method. The attribute is a .NET runtime hint with no connection to test skipping - only the substring 'Skip' triggered the match.

GCI0045Complexity ControlMaintainabilityfires on 3.2% of fixtures

Detects significant increases in cyclomatic complexity, deeply nested control flow, and methods exceeding length thresholds.

Metrics

65.0%

±19.3%

Precision

27.1%

±12.2%

Recall

38.3%

Confusion matrix

563

GCI0046Pattern Consistency DeviationMaintainabilityfires on 3.4% of fixtures

Detects deviations from established patterns in the same codebase - inconsistent error handling, inconsistent async usage, inconsistent null handling.

Metrics

81.0%

±16.2%

Precision

56.7%

±16.7%

Recall

66.7%

Confusion matrix

584

Limited corpus coverage

0 fixtures or no signal - 11 rules

GCI0007Error Handling IntegrityError Handlingno trigger data

Detects removal of existing error handling - exception handlers replaced with empty blocks, logging removed from catch clauses.

Metrics

Precision

Recall

Confusion matrix

10 fixtures were added for GCI0007 but labeling was not completed. Metrics are unavailable.

GCI0010Hardcoding and ConfigurationSecurityfires on 0.0% of fixtures

Detects hardcoded localhost/private IP URLs and environment-specific configuration embedded in source code.

Metrics

Precision

0.0%

±11.4%

Recall

Confusion matrix

605

Rule was narrowed from any http:// literal to localhost/private IP only (docs URLs, nuget.org, github.com excluded). The labeler still marks 13 fixtures as positive from the broader original criteria. Rule fires nothing on current corpus.

GCI0029PII Entity Logging LeakSecurityfires on 0.6% of fixtures

Detects PII field names (email, ssn, dateofbirth, creditcard, passport, etc.) passed to structured loggers.

Metrics

0.0%

±24.5%

Precision

Recall

Confusion matrix

614

PII terms were narrowed from 21 to 16 high-confidence terms, removing 'token', 'address', 'username', 'ipaddress', 'deviceid'. The labeler was updated to match. The 4 remaining FPs fire on the corpus but no fixture is labeled positive - the remaining PII term hits are likely false positives the labeler correctly marks as negative.

GCI0047Naming/Contract AlignmentAPI Contractsfires on 0.0% of fixtures

Detects public member renames that break naming conventions or deviate from established patterns in the same namespace.

Metrics

Precision

Recall

Confusion matrix

618

Rule fires on no fixtures in the corpus. Either the Silver corpus does not contain fixtures where naming contract violations occur, or the rule's detection patterns need adjustment.

Precision vs recall for a pre-commit tool

For a pre-commit tool, the cost of false positives and false negatives is asymmetric - but not in the way you might expect.

A false positive (rule fires when it should not) generates noise. Developers learn to ignore noisy tools. If a rule fires on every commit for non-issues, it gets disabled or bypassed. False positive tolerance is near zero for tools that run on every commit.

A false negative (rule does not fire when it should) means a real issue reaches code review or production. This is the failure mode the tool exists to prevent - but a single missed finding on a specific commit is usually less catastrophic than a tool that cries wolf on every commit.

The practical implication: during calibration, precision was prioritized over recall. A rule at 98% precision / 40% recall (GCI0032) is more useful in production than a rule at 70% precision / 90% recall. The recall gap is an active improvement target; the precision floor is treated as a hard constraint.

What comes next

Active calibration targets for the 7 in-progress rules:

GCI003298.2% / 40.4%Recall gap. 81 FNs. Expanding sub-checks for catch-then-return-null, logging without rethrow.

GCI000695.6% / 58.1%Recall gap. 62 FNs. Rule processes only added lines, missing .Value access on context lines.

GCI004565.0% / 27.1%Both gaps. The hardest rule to calibrate - cyclomatic complexity signals are noisy by nature.

GCI003872.4% / 47.7%Both gaps. DI safety patterns are highly framework-specific.

GCI004681.0% / 56.7%Recall gap. Pattern deviation detection depends on what patterns exist in context lines.

GCI004173.3% / 100%Precision gap. 4 FPs remain after hardening. Recall is perfect.

GCI001250.0% / 100%Precision gap. Only 4 total fires, 2 FPs. Insufficient sample to tune further.

Get started free View all detection rules