Concepts¶
This page explains the core concepts underlying rotalabs-audit's approach to reasoning chain capture and decision transparency.
Reasoning Chain Structure¶
A reasoning chain represents the structured sequence of thought that an AI model produces when solving a problem or making a decision. rotalabs-audit captures this reasoning and breaks it into discrete, analyzable components.
Anatomy of a Reasoning Chain¶
ReasoningChain
├── id: Unique identifier
├── steps: List[ReasoningStep]
│ ├── content: The text of this step
│ ├── reasoning_type: Classification (e.g., GOAL_REASONING, META_REASONING)
│ ├── confidence: Estimated confidence (0-1)
│ ├── index: Position in chain
│ └── evidence: Pattern matches supporting classification
├── source_text: Original unparsed text
├── detected_format: Format type (NUMBERED, BULLET, PROSE, etc.)
├── aggregate_confidence: Combined confidence score
└── primary_types: Most common reasoning types
Step Formats¶
rotalabs-audit automatically detects and parses various reasoning formats:
| Format | Example | Detection |
|---|---|---|
| Numbered | 1. First step |
Regex: ^\d+\. |
| Lettered | a) First step |
Regex: ^[a-z][\.\)] |
| Bullet | - First step |
Regex: ^[-*+] |
| Arrow | => First step |
Regex: ^(=>|->) |
| Sequential | First, ... Then, ... |
Keywords: first, second, then, finally |
| Prose | Free-form text | Sentence boundary detection |
Reasoning Type Classification¶
Each reasoning step is classified into one of several reasoning types. This classification enables targeted analysis and detection of specific reasoning patterns.
Core Reasoning Types¶
| Type | Description | Example Patterns |
|---|---|---|
EVALUATION_AWARE |
References to testing, evaluation, or monitoring | "this is a test", "being evaluated", "benchmark" |
GOAL_REASONING |
Goal-directed reasoning about objectives | "my goal is", "objective", "aim to", "need to" |
DECISION_MAKING |
Explicit decision points and choices | "I decide", "I choose", "therefore", "I will" |
META_REASONING |
Reasoning about the reasoning process | "I think", "I believe", "let me consider" |
UNCERTAINTY |
Expressions of uncertainty or hedging | "perhaps", "maybe", "might", "not sure" |
CAUSAL_REASONING |
Cause-and-effect analysis | "because", "since", "therefore", "leads to" |
HYPOTHETICAL |
Counterfactual or "what if" reasoning | "if...then", "suppose", "imagine" |
INCENTIVE_REASONING |
Reasoning about rewards and penalties | "reward", "penalty", "consequence", "benefit" |
FACTUAL_KNOWLEDGE |
Factual statements without inference | Statements without reasoning markers |
Classification Process¶
- Pattern Matching: Each step is matched against compiled regex patterns for each reasoning type
- Evidence Collection: Matching patterns are recorded as evidence
- Primary Type Selection: The type with the most pattern matches becomes the primary type
- Secondary Types: Other matching types are recorded as secondary classifications
from rotalabs_audit import ExtendedReasoningParser
parser = ExtendedReasoningParser()
chain = parser.parse("I think the goal is to maximize efficiency because...")
for step in chain:
print(f"Primary: {step.reasoning_type}")
print(f"Secondary: {step.secondary_types}")
print(f"Evidence: {step.evidence}")
Evaluation Awareness Detection¶
Evaluation awareness refers to an AI model's apparent awareness that it is being tested, evaluated, or monitored. This is a critical safety concern because evaluation-aware models may behave differently during testing than in deployment.
Why It Matters¶
- Models aware of evaluation may mask misaligned behavior during testing
- Strategic adaptation can reduce the validity of safety evaluations
- Understanding awareness patterns informs evaluation methodology design
Awareness Indicators¶
rotalabs-audit detects several categories of awareness:
Direct Awareness Indicators¶
Explicit mentions of testing or evaluation contexts:
- "this is a test"
- "being evaluated"
- "benchmark"
- "assessment"
- "checking if I"
Meta-Awareness Indicators¶
Statements reflecting awareness of context:
- "I'm aware that..."
- "I realize this is..."
- "given that this..."
- "in this context..."
Consequence Awareness¶
References to evaluation outcomes:
- "consequences"
- "scored"
- "pass or fail"
- "will determine"
Strategic Adaptation¶
Beyond awareness detection, rotalabs-audit identifies strategic adaptation - when models adjust their behavior based on perceived evaluation context:
from rotalabs_audit import EvaluationAwarenessDetector
detector = EvaluationAwarenessDetector()
analysis = detector.detect(chain)
# Awareness analysis results
print(f"Awareness score: {analysis.awareness_score:.2f}")
print(f"Strategic adaptation score: {analysis.strategic_adaptation_score:.2f}")
# Specific indicators
for indicator in analysis.indicators:
print(f"Type: {indicator.indicator_type}")
print(f"Text: {indicator.text}")
print(f"Confidence: {indicator.confidence}")
Counterfactual Analysis Methodology¶
Counterfactual analysis helps understand the causal importance of different reasoning components by systematically modifying them and measuring the impact on the output.
Intervention Types¶
| Intervention | Effect | Purpose |
|---|---|---|
REMOVE_EVALUATION_AWARENESS |
Remove evaluation-aware steps | Test dependence on awareness |
AMPLIFY_EVALUATION_AWARENESS |
Inject awareness at start | Test effect of increased awareness |
REMOVE_GOAL_REASONING |
Remove goal-oriented steps | Test role of explicit goals |
REMOVE_UNCERTAINTY |
Remove hedging/uncertainty | Test effect of certainty expressions |
NEUTRALIZE_INCENTIVES |
Replace incentive language | Test incentive-related reasoning |
REMOVE_META_REASONING |
Remove meta-cognitive steps | Test metacognition contribution |
Causal Effect Measurement¶
For each intervention, rotalabs-audit calculates:
- Behavioral Divergence: How different the modified reasoning is from the original (0-1)
- Steps Removed/Modified: Count of affected reasoning steps
- Causal Effect: Estimated importance of the removed content (0-1)
from rotalabs_audit import CounterfactualAnalyzer
analyzer = CounterfactualAnalyzer()
results = analyzer.analyze(chain)
# Examine each intervention
for intervention_type, result in results.items():
print(f"{intervention_type.value}:")
print(f" Behavioral divergence: {result.behavioral_divergence:.2f}")
print(f" Steps removed: {result.steps_removed}")
print(f" Causal effect: {result.causal_effect:.2f}")
# Rank reasoning types by causal importance
importance = analyzer.rank_causal_importance(results)
for rtype, score in sorted(importance.items(), key=lambda x: -x[1]):
print(f"{rtype.value}: {score:.2f}")
Critical Step Identification¶
Counterfactual analysis identifies which specific steps are most critical to the reasoning outcome:
critical_steps = analyzer.identify_critical_steps(chain, results)
for step in critical_steps:
print(f"Critical step {step.index}: {step.text[:50]}...")
Quality Assessment Dimensions¶
Reasoning quality is assessed across five dimensions, each measuring a different aspect of good reasoning.
The Five Dimensions¶
1. Clarity (20%)¶
How clear and understandable is the reasoning?
- Good: Specific language, moderate sentence length, clear structure
- Bad: Vague terms ("thing", "stuff"), very long sentences, unclear references
2. Completeness (25%)¶
Does the reasoning cover all necessary aspects?
- Good: Clear conclusion, logical flow between steps, sufficient depth
- Bad: Missing conclusion, gaps in reasoning, too brief
3. Consistency (20%)¶
Is the reasoning free of contradictions?
- Good: All claims are internally consistent
- Bad: Conflicting statements, contradictory conclusions
4. Logical Validity (25%)¶
Are the logical inferences sound?
- Good: Proper use of logical connectors, premises support conclusions
- Bad: Non-sequiturs, unsupported jumps in logic
5. Evidence Support (10%)¶
Are claims supported by evidence?
- Good: References to data, examples, or citations
- Bad: Unsupported factual claims
Quality Scoring¶
from rotalabs_audit import ReasoningQualityAssessor
assessor = ReasoningQualityAssessor()
metrics = assessor.assess(chain)
# Dimension scores (0-1)
print(f"Clarity: {metrics.clarity:.2f}")
print(f"Completeness: {metrics.completeness:.2f}")
print(f"Consistency: {metrics.consistency:.2f}")
print(f"Logical validity: {metrics.logical_validity:.2f}")
print(f"Evidence support: {metrics.evidence_support:.2f}")
# Weighted overall score
print(f"Overall: {metrics.overall_score:.2f}")
# Identified issues
for issue in metrics.issues:
print(f"Issue: {issue}")
Custom Weights¶
Adjust dimension weights for your use case:
assessor = ReasoningQualityAssessor(weights={
"clarity": 0.15,
"completeness": 0.30,
"consistency": 0.25,
"logical_validity": 0.20,
"evidence_support": 0.10,
})
Decision Tracing Architecture¶
Decision tracing captures individual decisions and sequences of decisions (decision paths) made by AI systems.
Decision Trace Structure¶
DecisionTrace
├── id: Unique identifier
├── decision: The decision statement
├── timestamp: When the decision was made
├── context: Contextual information
├── reasoning_chain: Full reasoning (optional)
├── alternatives_considered: List of alternatives
├── rationale: Explanation for the decision
├── confidence: Confidence in the decision (0-1)
├── reversible: Whether the decision can be undone
└── consequences: Known/predicted consequences
Decision Path Structure¶
A decision path represents a sequence of related decisions:
DecisionPath
├── id: Unique identifier
├── decisions: List[DecisionTrace] in order
├── goal: The objective being pursued
├── success: Whether the goal was achieved
└── failure_point: The decision where things went wrong (if applicable)
Path Analysis¶
The DecisionPathAnalyzer provides tools for understanding decision sequences:
from rotalabs_audit import DecisionPathAnalyzer
analyzer = DecisionPathAnalyzer()
# Full path analysis
analysis = analyzer.analyze_path(path)
print(f"Decisions: {analysis['decision_count']}")
print(f"Avg confidence: {analysis['avg_confidence']:.2f}")
print(f"Irreversible decisions: {analysis['irreversible_count']}")
# Find critical decisions
critical = analyzer.find_critical_decisions(path)
for decision in critical:
print(f"Critical: {decision.decision[:50]}...")
# Find failure point (if path failed)
if not path.success:
failure = analyzer.find_failure_point(path)
if failure:
print(f"Failure at: {failure.decision}")
# Detect confidence decline
if analyzer.detect_confidence_decline(path):
print("Warning: Confidence declined over the decision path")
Confidence Estimation¶
Confidence scores (0-1) are estimated from linguistic markers in the text.
High Confidence Markers¶
- "certain", "definitely", "clearly", "obviously"
- "without doubt", "confident", "absolutely"
Low Confidence Markers¶
- "uncertain", "maybe", "perhaps", "possibly"
- "might", "could", "not sure", "tentative"
Confidence Levels¶
Numeric scores map to discrete levels:
| Score Range | Level |
|---|---|
| 0.0 - 0.2 | VERY_LOW |
| 0.2 - 0.4 | LOW |
| 0.4 - 0.6 | MEDIUM |
| 0.6 - 0.8 | HIGH |
| 0.8 - 1.0 | VERY_HIGH |
from rotalabs_audit.chains import (
estimate_confidence,
get_confidence_level,
)
text = "I am fairly confident that this approach will work"
score = estimate_confidence(text)
level = get_confidence_level(score)
print(f"Score: {score:.2f}")
print(f"Level: {level.value}")
Next Steps¶
Now that you understand the concepts:
- Getting Started - Install and use rotalabs-audit
- Reasoning Chains Tutorial - Deep dive into parsing
- Evaluation Awareness Tutorial - Advanced detection
- Counterfactual Analysis Tutorial - Causal analysis