Automated QA for Non-Deterministic AI: Testing What You Cannot Predict

1. The Core Problem: AI Output Is Structurally Unreliable

Every SAGE reading—whether a 7-section Daily snapshot or an 8-section Premium deep-dive—follows a rigid structural contract. Each section has a defined purpose: the mystical opening, the court analysis, the verdict banner, the elemental field, the closing guidance table. This structure is not decorative; it is the interface between the AI engine and the user experience.

The problem is fundamental: Large Language Models are probabilistic systems being asked to produce deterministic output. No amount of prompt engineering can guarantee structural compliance over thousands of readings across 10 languages. The LLM will, eventually:

Drop an entire section silently
Leak internal placeholder tokens into visible text
Duplicate a structural label, confusing the renderer
Generate an empty section that passes a naive length check
Contaminate the verdict banner with narrative prose
Invent a figure name that doesn't exist in either tradition
Produce syntactically broken Markdown that collapses the layout

These are not theoretical risks. Every one of these failures has occurred in production and was caught by the pipeline described in this article.

⚠️ The Fundamental Tension

Traditional software testing assumes deterministic output: given input X, expect output Y. With generative AI, the same input produces different output every time. You cannot write a simple assertion. You must instead define structural invariants—properties that must hold true regardless of the specific prose the LLM generates.

2. The 11-Gate Evaluation Pipeline

SAGE's QA pipeline is a fully automated evaluation suite that runs after every AI-generated reading. It is not a spot-check or a sampling strategy—it is a mandatory gate that every reading must pass before delivery. The pipeline validates five dimensions of reading integrity: structural completeness, semantic depth, data fidelity, encoding purity, and formatting hygiene.

Validation Gates

Languages Covered

Traditions (Western & Ramal)

Evaluation Paths

Gate	Check Name	What It Validates
1	Section Presence	All 7 (Daily) or 8 (Premium) structural sections exist in the output
2	Narrative Depth	Each section exceeds minimum character thresholds — rejects empty or platitude-only sections
3	Mandatory Labels	Mode-aware subheader verification: Premium Section 8 closing table, Daily Section 3 guidance bullets
4	Token Leakage Guard	Detects leaked internal `ZRX_*_XZR` placeholder tokens that were never meant for user eyes
5	Duplicate Label Guard	Ensures no structural label (e.g., "Strategic Alignment", "Vibrational Key") appears more than once
6	Verdict Banner Integrity	Validates the verdict/timing/confidence banner contains all three mandatory metrics and isn't contaminated with prose
7	Data Fidelity	Verifies that the court figures (Houses 13–15) calculated by the chart engine are actually referenced in the AI-generated text
8	Protected Term Integrity	Confirms that sacred Roman-script terms (figure names, Ramal vocabulary) survive translation into non-Latin scripts
9	Markdown Hygiene	Catches unbalanced bold markers, empty bold tags, and orphaned formatting artifacts that break rendering
10	Purushartha Table Integrity	Ensures the spiritual pillars table has populated Graha and Gati columns, not blank dashes
11	Section 3 Directness	Detects and rejects technical summary bloat in sections that should contain spiritual narrative prose

Each gate produces a binary PASS or FAIL result. A single failure in any gate rejects the entire reading. There is no "partial pass." This is the engineering equivalent of a cryptographic checksum: the output is either structurally valid or it is not delivered.

3. Deep Dive: The ZRX Canary Token System

One of SAGE's most distinctive QA techniques is borrowed from cybersecurity: canary tokens.

When SAGE constructs the data payload that is sent to the LLM, it does not pass raw geomantic figure names directly. Instead, it wraps every figure reference in an obfuscated placeholder pattern:

// Internal payload (never seen by users)
"house_1": "ZRX_Tariq_XZR"
"house_13": "ZRX_Humrah_XZR"
"house_15": "ZRX_Qabz-el-Kharij_XZR"

These ZRX_*_XZR tokens serve two purposes:

Injection boundary: They mark exactly where the AI is supposed to reference figure data. The post-processing pipeline uses them as anchors for deterministic replacement with properly formatted figure names.
Leakage detection: If any ZRX_ token survives the post-processing pipeline and appears in the final reading, it means the normalizer failed to replace it—and the reading must be rejected.

# Gate 4: Token Leakage Guard
zrx_leaks = regex.findall('ZRX_[A-Za-z0-9_]+_XZR', reading_text)
if zrx_leaks:
  reject("Token leakage detected", leaked=zrx_leaks)

💡 The Cybersecurity Parallel

This is the oracle equivalent of a canary token in penetration testing. Security engineers plant invisible marker files in sensitive directories; if those markers appear in unauthorized locations, it signals a breach. SAGE plants obfuscated markers in AI payloads; if those markers appear in user-facing text, it signals a pipeline failure. The principle is identical: inject a signal that should never be visible, then monitor for its escape.

4. Deep Dive: Multilingual Verdict Banner Validation

Every SAGE reading contains a verdict banner—a structured block that communicates three mandatory metrics to the user: the verdict itself (favorable/unfavorable/mixed), the timing window (e.g., 2–4 weeks), and a confidence percentage. This banner must be present and correctly formatted in all 10 supported languages.

The challenge: these keywords have no shared lexical root across Arabic, Hindi, Chinese, Japanese, Russian, and the Latin-script European languages. The pipeline cannot use a single regex. Instead, it maintains localized keyword arrays for each metric dimension:

# Verdict detection across 10 languages
verdict_keywords = ["verdict", "veredito", "urteil", "निर्णय", "حكم", "裁定", "判定", "вердикт"]

# Timing detection
timing_keywords  = ["timing", "tiempo", "temps", "zeit", "समय", "التوقيت", "时间", "время", "⏳"]

# Confidence detection
confidence_keywords = ["confidence", "confianza", "confiance", "विश्वास", "الثقة", "置信度", "信頼度", "🎯"]

The pipeline requires all three keyword dimensions to be present. But presence alone is not enough. Gate 6 also enforces a banner length ceiling of 600 characters. Why? Because when the LLM hallucinates, it often does so by injecting narrative prose into what should be a tightly structured data block. A 1,200-character "verdict banner" is almost certainly contaminated with essay-style content that belongs in a narrative section.

The pipeline also counts bullet points within the banner. In Daily mode, the banner should contain exactly 3 structured bullets. In Premium mode, 3–4. If the bullet count exceeds the limit, it triggers a hallucinated label detection failure—the LLM has invented structural elements that don't exist in the reading schema.

🔬 Why Emoji Keywords?

Notice that the keyword arrays include emoji like ⏳ and 🎯. This is intentional. When the LLM generates non-Latin-script readings (Arabic, Chinese, Japanese), it sometimes uses emoji as universal anchors for structured data. The pipeline treats these as valid signals for banner detection, ensuring no reading is falsely rejected because the LLM chose iconographic shorthand over a translated keyword.

5. Deep Dive: The Mizan — Mathematical Chart Validation

Before the AI generates a single word, SAGE validates the mathematical integrity of the geomantic chart itself. This is the Mizan (Arabic: "balance")—a two-layer validation system that ensures the 16-house chart is internally consistent.

Layer 1: Structural Mizan (Binary Proof)

A geomantic chart is not random. The 16 houses are derived from 4 mother figures through a deterministic process:

Daughters (Houses 5–8) are derived from the rows of Mothers 1–4 via transposition
Nieces (Houses 9–12) are derived by XOR-parity addition of adjacent pairs
Court figures (Houses 13–16) are derived by cascading parity through the Nieces, Witnesses, and Judge

The Structural Mizan checks every single derivation. If House 9 does not equal the binary sum of Houses 1 and 2, the chart is mathematically broken—and no amount of AI prose can salvage it. The reading is rejected at the chart level, before the LLM is ever invoked.

# XOR-Parity Rule (universal across all traditions)
def add_figures(figure_a, figure_b):
  for row_a, row_b in zip(figure_a, figure_b):
    yield 2 if row_a == row_b else 1

# Niece Check: H9 must equal H1 ⊕ H2
expected_h9 = add_figures(pattern[1], pattern[2])
assert expected_h9 == pattern[9], "Chart integrity violation"

Layer 2: Interpretive Mizan (Quality Scoring)

Beyond mathematical correctness, the Interpretive Mizan scores the interpretive reliability of the chart on a 0–100 scale. Three factors contribute:

Figural Diversity: How many unique figures appear in the first 12 houses? Higher diversity means more data points for the AI to synthesize, producing richer readings.
Elemental Representation: Are all four elements (Fire, Water, Air, Earth) present? A chart dominated by a single element provides less nuance.
Chart Stability: How many Sabit (fixed) figures are present? High stability suggests concrete, grounded results; low stability warns of volatility.

This score is surfaced to the user as part of the reading metadata. A "Strong" Mizan (80+) tells the user the chart itself is a reliable foundation for interpretation. A "Weak" Mizan (30–49) signals that the question may need to be re-approached.

💡 Why This Matters

The Mizan is SAGE's quality signal before the AI is invoked. Most AI products can only validate output. SAGE validates the input as well—ensuring the AI is working from a mathematically sound foundation. This is analogous to validating a database schema before running queries against it.

6. The Zero-Trust AI Philosophy

The 11-gate pipeline, the ZRX canary tokens, the Mizan, and the multilingual keyword arrays all share a single engineering philosophy: Zero-Trust AI.

In cybersecurity, "zero trust" means you never assume a network actor is legitimate—you verify every request, every time. SAGE applies this same principle to generative AI:

Never trust the LLM to follow the schema. Validate every structural section independently.
Never trust the LLM to preserve terminology. Run deterministic replacement passes after generation, then check for leakage.
Never trust the LLM to maintain script purity. Validate character encoding and script boundaries for every non-English language.
Never trust the LLM to reference real data. Cross-check figure references in the prose against the mathematically derived chart.
Never trust the LLM to format cleanly. Parse every Markdown construct for syntactic correctness.

This is not pessimism—it is engineering realism. Probabilistic systems will, given sufficient volume, produce every possible failure mode. The only reliable strategy is to assume failure and engineer deterministic verification around it.

📐 The Scale

SAGE supports 10 languages (English, Hindi, Spanish, French, German, Russian, Arabic, Chinese, Japanese, Portuguese), 2 traditions (Western Geomancy and Indian Ramal Shastra), and 2 reading modes (Daily and Premium). That's 40 distinct evaluation paths—each with its own localized headers, subheaders, keyword arrays, depth thresholds, and structural expectations. Every single path runs through the full 11-gate pipeline.

The result: when a SAGE reading reaches your screen, it has been mathematically validated at the chart level, structurally validated at the content level, and encoding-validated at the script level. It is not a best-effort output from a language model. It is a verified artifact of a deterministic validation process applied to a non-deterministic generation engine.

That is what it means to test what you cannot predict.

Experience the Oracle

The engineering described here protects the integrity of every SAGE reading. Try it for free and see what a verified AI oracle delivers.