AI code review has a trust problem. Not because it is useless. Not because it cannot find real bugs. Not because developers are wrong to experiment with it. The problem is simpler than that.

Code review is not just a writing task. It is an engineering control. When a tool comments on a pull request, blocks a merge, flags a regression, or tells a team that a change is safe, it is participating in the software delivery process. At that point, helpfulness is not enough. The tool has to be repeatable.

That is where the question gets uncomfortable: Can an AI code reviewer give the same answer twice?

More importantly, if it cannot, what does that mean for the code we ship, the trust we place in our tools, and the engineering processes we build around them?

The answer depends on what we mean by "AI code review tool."

If we mean an LLM reading a pull request and deciding what it thinks, then probably not in the way engineering teams usually mean deterministic.

If we mean a deterministic analysis engine that uses AI to explain, summarize, prioritize, or help humans understand findings, then yes. But in that version, the AI is not the reviewer of record. It is the narrator. That distinction matters.

What deterministic means in code review

Most developers use deterministic in a practical way: Same input. Same configuration. Same result.

That is the expectation we bring to compilers, formatters, linters, unit tests, static analyzers, and CI gates. These tools may be incomplete. They may have bugs. They may miss important issues. They may produce false positives. But their failure modes are supposed to be repeatable.

A linter should not flag a line on Monday, ignore it on Tuesday, and flag it again on Wednesday if the code and configuration never changed. A test should not randomly assert a different expected value. A quality gate should not pass or fail because the reviewer phrased the same concern differently on a second run.

Traditional static analysis tools are built closer to this model. CodeQL describes itself as a semantic code analysis engine that lets developers query code as though it were data. Microsoft describes Roslyn analyzers as tools that inspect C# and Visual Basic code for style, quality, maintainability, design, and other issues.

These tools are not magic. They do not understand product intent. They do not know every business rule. They can be noisy, incomplete, and wrong. But they are designed around parseable inputs, explicit rules, structured findings, and repeatable execution. That is very different from asking a language model to read a diff and decide what it thinks.

The LLM problem

LLMs are not naturally deterministic systems in the way compilers and analyzers are.

Even when vendors provide reproducibility controls, the guarantees are limited. OpenAI describes seed-based outputs as "mostly" deterministic when the seed and request parameters are held constant. The OpenAI cookbook makes the same point: a fixed seed can help make outputs more consistent, but the result is still described as mostly deterministic, not guaranteed deterministic.

That word "mostly" matters.

"Mostly deterministic" may be fine for a chatbot. It may be fine for a writing assistant. It may even be fine for an optional pull request assistant that leaves suggestions humans can ignore.

But "mostly deterministic" is a weaker foundation for a CI gate.

A merge gate needs to be explainable and reproducible. When a developer asks, "Why did this fail?", the answer cannot be, "The model had a different interpretation this time." When a team asks, "Why did this pass last night but fail this morning?", the answer cannot be, "The same prompt produced a different review."

Deterministic does not mean correct

This is where the discussion often goes wrong. Some people hear "deterministic" and think it means "always right." It does not.

A deterministic tool can be wrong every time. A nondeterministic tool can be right on a particular run. Determinism is not a claim about perfect accuracy. It is a claim about repeatability.

That difference matters because code review is not only about detecting defects. It is also about creating a process teams can trust.

A deterministic rule might say:

This public method changed its return behavior.
This null check was removed.
This exception type changed.
This branch condition became broader.
This changed method has no nearby test update.
This security-sensitive sink now receives a new data path.

Those are structural claims. They can be inspected. They can be tested against fixtures. They can be versioned. They can be debated. If the rule is wrong, it can be fixed.

An LLM-generated review comment is different. It may say something insightful. It may also say something vague, inconsistent, or unsupported by the actual diff. The hard part is not that the model can be wrong. The hard part is that the reasoning path is not a stable engineering artifact.

Deterministic tools do not need to be smarter than AI to matter. They need to be accountable.

Why repeatability matters

Repeatability matters because developers need to trust the feedback loop.

If a tool flags a problem, a developer should be able to fix the code, rerun the tool, and see the result change for a clear reason. If the tool produces a different answer without a code change, the developer is no longer debugging the code. They are debugging the reviewer. That is poison for adoption.

Repeatability also matters for compliance and auditability.

If a team uses automated review as part of a regulated or high-stakes development process, they may need to show why a change was blocked, why a warning appeared, or why a merge was allowed. A deterministic finding can be tied to a rule version, a commit, a file, a line range, and a piece of evidence. A model-generated judgment is harder to defend. Not impossible. Harder.

The useful role for AI

The mistake is assuming this is a choice between deterministic tools and AI tools. That is the wrong frame.

The better frame is: What part of the review must be deterministic, and what part can be AI-assisted?

A code review finding should be deterministic. The explanation of that finding can be AI-assisted.

For example, a deterministic engine might produce this:

Rule: Behavioral change risk.
Evidence: A condition changed from accepting all non-null records to accepting only non-null active records.
Validation gap: No test file changed in the same diff.
Risk: Previously accepted inputs may now be excluded.

That finding can be generated without an LLM. It comes from parsing the diff, identifying the changed condition, mapping the affected method, and checking whether relevant tests changed.

Then an AI layer can help explain it:

This change appears to narrow the accepted input set. If inactive records should now be excluded, add a test proving that behavior. If not, this may be an unintended regression.

That is useful. It is readable. It helps the developer understand the issue. But the AI did not invent the finding. It explained the finding. That is the architecture that can work.

The open question

So can AI code review tools ever be deterministic? Maybe the better question is: Which part of the tool are we willing to let be nondeterministic?

If the AI is generating prose, summarizing risk, or helping explain deterministic findings, some variability may be acceptable.

If the AI is deciding whether a pull request passes or fails, variability becomes much harder to defend.

The future may not belong to pure static analysis or pure AI review. It may belong to tools that separate evidence from explanation.

What Mythos changes about the conversation

Anthropic's Mythos Preview work makes that separation harder to ignore. Its public coordinated vulnerability disclosure dashboard does not frame AI bug finding as a single model comment dropped into a pull request. It shows a pipeline: candidate findings, external human triage, confirmed findings, maintainer disclosure, patches, advisories, and a disclosure ledger.

The most important line is not the headline count. Anthropic says independent human triage and review are the rate-limiting step between candidate findings and disclosed vulnerabilities. That is the lesson for ordinary engineering teams too. As AI makes candidate discovery cheaper, the bottleneck moves to evidence, repeatability, triage, and accountable decisions.

GauntletCI is not trying to be Mythos and should not be described as a zero-day discovery system. The useful parallel is the workflow shape: a finding should carry enough evidence that a human can decide whether it is real, intentional, risky, patched, waived, or irrelevant. The AI may help explain the candidate. The engineering control should preserve the evidence trail.

Deterministic core.
AI-assisted interpretation.
Human-owned intent.

That architecture feels more trustworthy than an LLM acting alone. It also feels more realistic than pretending deterministic rules can understand everything a senior engineer understands during review.

If software teams increasingly expect AI to participate in code review, will they demand repeatable engineering evidence from those tools, or will they accept probabilistic judgment because the comments sound useful?

That answer may decide what this category becomes.

Can AI Code Review Tools Ever Be Deterministic?

What deterministic means in code review

The LLM problem

Deterministic does not mean correct

Why repeatability matters

The useful role for AI

The open question

What Mythos changes about the conversation

Sources

Related Reading

Why Code Review Misses Bugs

Why Tests Miss Bugs

What Is Diff-Based Analysis?

Behavioral Change Risk Framework