rotalabs-probe¶

Sandbagging detection via activation probes - Detects when AI systems deliberately underperform.

Overview¶

rotalabs-probe is a comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around:

Situational Awareness: Does the model know it's being tested?
Observer Effect: Does behavior change when monitored?
Sandbagging: Is the model deliberately underperforming?

Key Features¶

Multiple Detection Methods: Statistical, behavioral, and activation-based
Counterfactual Analysis: Analyze reasoning chains for signs of strategic behavior
Probing Classifiers: Train linear probes on model activations
Multi-API Support: OpenAI, Anthropic, and local models
Visualization: Plot divergences, features, and causal attributions

Architecture¶

┌─────────────────────────────────────────────────────────┐
│                    Detection Layer                       │
├─────────────────┬─────────────────┬─────────────────────┤
│  Sandbagging    │   Situational   │    Observer         │
│   Detector      │   Awareness     │    Effect           │
└────────┬────────┴────────┬────────┴────────┬────────────┘
         │                 │                 │
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────┐
│                    Analysis Layer                        │
├─────────────────┬─────────────────┬─────────────────────┤
│  Counterfactual │   Behavioral    │    Statistical      │
│     CoT         │   Features      │    Tests            │
└────────┬────────┴────────┬────────┴────────┬────────────┘
         │                 │                 │
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────┐
│                    Probing Layer (Optional)              │
├─────────────────┬─────────────────┬─────────────────────┤
│  Activation     │    Linear       │    Steering         │
│    Hooks        │    Probes       │    Vectors          │
└─────────────────┴─────────────────┴─────────────────────┘

Installation¶

# Core package
pip install rotalabs-probe

# With GPU probing support
pip install rotalabs-probe[gpu]

# With API integrations
pip install rotalabs-probe[anthropic,openai]

# With visualization
pip install rotalabs-probe[viz]

# Everything
pip install rotalabs-probe[all]

Quick Start¶

from rotalabs_probe import SandbaggingDetector, PerformanceSample

# Create detector
detector = SandbaggingDetector()

# Add performance samples
detector.add_sample(PerformanceSample(
    context="benchmark",
    score=0.65,
    difficulty=0.8,
))
detector.add_sample(PerformanceSample(
    context="production",
    score=0.95,
    difficulty=0.8,
))

# Analyze for sandbagging
result = detector.analyze()
print(f"Sandbagging probability: {result.probability:.2%}")
print(f"Evidence: {result.evidence}")

Detection Methods¶

Detector	Description	Requires
`SandbaggingDetector`	Context-dependent performance analysis	Performance data
`SituationalAwarenessDetector`	Probe for evaluation awareness	Model access
`ObserverEffectMonitor`	Monitor for behavioral changes	Interaction logs
`CounterfactualCoTAnalyzer`	Analyze reasoning for strategic behavior	CoT outputs