Concepts¶

This page explains the core concepts underlying rotalabs-audit's approach to reasoning chain capture and decision transparency.

Reasoning Chain Structure¶

A reasoning chain represents the structured sequence of thought that an AI model produces when solving a problem or making a decision. rotalabs-audit captures this reasoning and breaks it into discrete, analyzable components.

Anatomy of a Reasoning Chain¶

ReasoningChain
├── id: Unique identifier
├── steps: List[ReasoningStep]
│   ├── content: The text of this step
│   ├── reasoning_type: Classification (e.g., GOAL_REASONING, META_REASONING)
│   ├── confidence: Estimated confidence (0-1)
│   ├── index: Position in chain
│   └── evidence: Pattern matches supporting classification
├── source_text: Original unparsed text
├── detected_format: Format type (NUMBERED, BULLET, PROSE, etc.)
├── aggregate_confidence: Combined confidence score
└── primary_types: Most common reasoning types

Step Formats¶

rotalabs-audit automatically detects and parses various reasoning formats:

Format	Example	Detection
Numbered	`1. First step`	Regex: `^\d+\.`
Lettered	`a) First step`	Regex: `^[a-z][\.\)]`
Bullet	`- First step`	Regex: `^[-*+]`
Arrow	`=> First step`	Regex: `^(=>\|->)`
Sequential	`First, ... Then, ...`	Keywords: first, second, then, finally
Prose	Free-form text	Sentence boundary detection

Reasoning Type Classification¶

Each reasoning step is classified into one of several reasoning types. This classification enables targeted analysis and detection of specific reasoning patterns.

Core Reasoning Types¶

Type	Description	Example Patterns
`EVALUATION_AWARE`	References to testing, evaluation, or monitoring	"this is a test", "being evaluated", "benchmark"
`GOAL_REASONING`	Goal-directed reasoning about objectives	"my goal is", "objective", "aim to", "need to"
`DECISION_MAKING`	Explicit decision points and choices	"I decide", "I choose", "therefore", "I will"
`META_REASONING`	Reasoning about the reasoning process	"I think", "I believe", "let me consider"
`UNCERTAINTY`	Expressions of uncertainty or hedging	"perhaps", "maybe", "might", "not sure"
`CAUSAL_REASONING`	Cause-and-effect analysis	"because", "since", "therefore", "leads to"
`HYPOTHETICAL`	Counterfactual or "what if" reasoning	"if...then", "suppose", "imagine"
`INCENTIVE_REASONING`	Reasoning about rewards and penalties	"reward", "penalty", "consequence", "benefit"
`FACTUAL_KNOWLEDGE`	Factual statements without inference	Statements without reasoning markers

Classification Process¶

Pattern Matching: Each step is matched against compiled regex patterns for each reasoning type
Evidence Collection: Matching patterns are recorded as evidence
Primary Type Selection: The type with the most pattern matches becomes the primary type
Secondary Types: Other matching types are recorded as secondary classifications

from rotalabs_audit import ExtendedReasoningParser

parser = ExtendedReasoningParser()
chain = parser.parse("I think the goal is to maximize efficiency because...")

for step in chain:
    print(f"Primary: {step.reasoning_type}")
    print(f"Secondary: {step.secondary_types}")
    print(f"Evidence: {step.evidence}")

Evaluation Awareness Detection¶

Evaluation awareness refers to an AI model's apparent awareness that it is being tested, evaluated, or monitored. This is a critical safety concern because evaluation-aware models may behave differently during testing than in deployment.

Why It Matters¶

Models aware of evaluation may mask misaligned behavior during testing
Strategic adaptation can reduce the validity of safety evaluations
Understanding awareness patterns informs evaluation methodology design

Awareness Indicators¶

rotalabs-audit detects several categories of awareness:

Direct Awareness Indicators¶

Explicit mentions of testing or evaluation contexts:

"this is a test"
"being evaluated"
"benchmark"
"assessment"
"checking if I"

Meta-Awareness Indicators¶

Statements reflecting awareness of context:

"I'm aware that..."
"I realize this is..."
"given that this..."
"in this context..."

Consequence Awareness¶

References to evaluation outcomes:

"consequences"
"scored"
"pass or fail"
"will determine"

Strategic Adaptation¶

Beyond awareness detection, rotalabs-audit identifies strategic adaptation - when models adjust their behavior based on perceived evaluation context:

from rotalabs_audit import EvaluationAwarenessDetector

detector = EvaluationAwarenessDetector()
analysis = detector.detect(chain)

# Awareness analysis results
print(f"Awareness score: {analysis.awareness_score:.2f}")
print(f"Strategic adaptation score: {analysis.strategic_adaptation_score:.2f}")

# Specific indicators
for indicator in analysis.indicators:
    print(f"Type: {indicator.indicator_type}")
    print(f"Text: {indicator.text}")
    print(f"Confidence: {indicator.confidence}")

Counterfactual Analysis Methodology¶

Counterfactual analysis helps understand the causal importance of different reasoning components by systematically modifying them and measuring the impact on the output.

Intervention Types¶

Intervention	Effect	Purpose
`REMOVE_EVALUATION_AWARENESS`	Remove evaluation-aware steps	Test dependence on awareness
`AMPLIFY_EVALUATION_AWARENESS`	Inject awareness at start	Test effect of increased awareness
`REMOVE_GOAL_REASONING`	Remove goal-oriented steps	Test role of explicit goals
`REMOVE_UNCERTAINTY`	Remove hedging/uncertainty	Test effect of certainty expressions
`NEUTRALIZE_INCENTIVES`	Replace incentive language	Test incentive-related reasoning
`REMOVE_META_REASONING`	Remove meta-cognitive steps	Test metacognition contribution

Causal Effect Measurement¶

For each intervention, rotalabs-audit calculates:

Behavioral Divergence: How different the modified reasoning is from the original (0-1)
Steps Removed/Modified: Count of affected reasoning steps
Causal Effect: Estimated importance of the removed content (0-1)

from rotalabs_audit import CounterfactualAnalyzer

analyzer = CounterfactualAnalyzer()
results = analyzer.analyze(chain)

# Examine each intervention
for intervention_type, result in results.items():
    print(f"{intervention_type.value}:")
    print(f"  Behavioral divergence: {result.behavioral_divergence:.2f}")
    print(f"  Steps removed: {result.steps_removed}")
    print(f"  Causal effect: {result.causal_effect:.2f}")

# Rank reasoning types by causal importance
importance = analyzer.rank_causal_importance(results)
for rtype, score in sorted(importance.items(), key=lambda x: -x[1]):
    print(f"{rtype.value}: {score:.2f}")

Critical Step Identification¶

Counterfactual analysis identifies which specific steps are most critical to the reasoning outcome:

critical_steps = analyzer.identify_critical_steps(chain, results)
for step in critical_steps:
    print(f"Critical step {step.index}: {step.text[:50]}...")

Quality Assessment Dimensions¶

Reasoning quality is assessed across five dimensions, each measuring a different aspect of good reasoning.

The Five Dimensions¶

1. Clarity (20%)¶

How clear and understandable is the reasoning?

Good: Specific language, moderate sentence length, clear structure
Bad: Vague terms ("thing", "stuff"), very long sentences, unclear references

2. Completeness (25%)¶

Does the reasoning cover all necessary aspects?

Good: Clear conclusion, logical flow between steps, sufficient depth
Bad: Missing conclusion, gaps in reasoning, too brief

3. Consistency (20%)¶

Is the reasoning free of contradictions?

Good: All claims are internally consistent
Bad: Conflicting statements, contradictory conclusions

4. Logical Validity (25%)¶

Are the logical inferences sound?

Good: Proper use of logical connectors, premises support conclusions
Bad: Non-sequiturs, unsupported jumps in logic

5. Evidence Support (10%)¶

Are claims supported by evidence?

Good: References to data, examples, or citations
Bad: Unsupported factual claims

Quality Scoring¶

from rotalabs_audit import ReasoningQualityAssessor

assessor = ReasoningQualityAssessor()
metrics = assessor.assess(chain)

# Dimension scores (0-1)
print(f"Clarity: {metrics.clarity:.2f}")
print(f"Completeness: {metrics.completeness:.2f}")
print(f"Consistency: {metrics.consistency:.2f}")
print(f"Logical validity: {metrics.logical_validity:.2f}")
print(f"Evidence support: {metrics.evidence_support:.2f}")

# Weighted overall score
print(f"Overall: {metrics.overall_score:.2f}")

# Identified issues
for issue in metrics.issues:
    print(f"Issue: {issue}")

Custom Weights¶

Adjust dimension weights for your use case:

assessor = ReasoningQualityAssessor(weights={
    "clarity": 0.15,
    "completeness": 0.30,
    "consistency": 0.25,
    "logical_validity": 0.20,
    "evidence_support": 0.10,
})

Decision Tracing Architecture¶

Decision tracing captures individual decisions and sequences of decisions (decision paths) made by AI systems.

Decision Trace Structure¶

DecisionTrace
├── id: Unique identifier
├── decision: The decision statement
├── timestamp: When the decision was made
├── context: Contextual information
├── reasoning_chain: Full reasoning (optional)
├── alternatives_considered: List of alternatives
├── rationale: Explanation for the decision
├── confidence: Confidence in the decision (0-1)
├── reversible: Whether the decision can be undone
└── consequences: Known/predicted consequences

Decision Path Structure¶

A decision path represents a sequence of related decisions:

DecisionPath
├── id: Unique identifier
├── decisions: List[DecisionTrace] in order
├── goal: The objective being pursued
├── success: Whether the goal was achieved
└── failure_point: The decision where things went wrong (if applicable)

Path Analysis¶

The DecisionPathAnalyzer provides tools for understanding decision sequences:

from rotalabs_audit import DecisionPathAnalyzer

analyzer = DecisionPathAnalyzer()

# Full path analysis
analysis = analyzer.analyze_path(path)
print(f"Decisions: {analysis['decision_count']}")
print(f"Avg confidence: {analysis['avg_confidence']:.2f}")
print(f"Irreversible decisions: {analysis['irreversible_count']}")

# Find critical decisions
critical = analyzer.find_critical_decisions(path)
for decision in critical:
    print(f"Critical: {decision.decision[:50]}...")

# Find failure point (if path failed)
if not path.success:
    failure = analyzer.find_failure_point(path)
    if failure:
        print(f"Failure at: {failure.decision}")

# Detect confidence decline
if analyzer.detect_confidence_decline(path):
    print("Warning: Confidence declined over the decision path")

Confidence Estimation¶

Confidence scores (0-1) are estimated from linguistic markers in the text.

High Confidence Markers¶

"certain", "definitely", "clearly", "obviously"
"without doubt", "confident", "absolutely"

Low Confidence Markers¶

"uncertain", "maybe", "perhaps", "possibly"
"might", "could", "not sure", "tentative"

Confidence Levels¶

Numeric scores map to discrete levels:

Score Range	Level
0.0 - 0.2	VERY_LOW
0.2 - 0.4	LOW
0.4 - 0.6	MEDIUM
0.6 - 0.8	HIGH
0.8 - 1.0	VERY_HIGH

from rotalabs_audit.chains import (
    estimate_confidence,
    get_confidence_level,
)

text = "I am fairly confident that this approach will work"
score = estimate_confidence(text)
level = get_confidence_level(score)

print(f"Score: {score:.2f}")
print(f"Level: {level.value}")

Next Steps¶

Now that you understand the concepts:

Getting Started - Install and use rotalabs-audit
Reasoning Chains Tutorial - Deep dive into parsing
Evaluation Awareness Tutorial - Advanced detection
Counterfactual Analysis Tutorial - Causal analysis