Skip to content

rotalabs-probe

Sandbagging detection via activation probes - Detects when AI systems deliberately underperform.

Overview

rotalabs-probe is a comprehensive toolkit for detecting and analyzing metacognitive capabilities in AI systems, particularly around:

  • Situational Awareness: Does the model know it's being tested?
  • Observer Effect: Does behavior change when monitored?
  • Sandbagging: Is the model deliberately underperforming?

Key Features

  • Multiple Detection Methods: Statistical, behavioral, and activation-based
  • Counterfactual Analysis: Analyze reasoning chains for signs of strategic behavior
  • Probing Classifiers: Train linear probes on model activations
  • Multi-API Support: OpenAI, Anthropic, and local models
  • Visualization: Plot divergences, features, and causal attributions

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Detection Layer                       │
├─────────────────┬─────────────────┬─────────────────────┤
│  Sandbagging    │   Situational   │    Observer         │
│   Detector      │   Awareness     │    Effect           │
└────────┬────────┴────────┬────────┴────────┬────────────┘
         │                 │                 │
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────┐
│                    Analysis Layer                        │
├─────────────────┬─────────────────┬─────────────────────┤
│  Counterfactual │   Behavioral    │    Statistical      │
│     CoT         │   Features      │    Tests            │
└────────┬────────┴────────┬────────┴────────┬────────────┘
         │                 │                 │
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────┐
│                    Probing Layer (Optional)              │
├─────────────────┬─────────────────┬─────────────────────┤
│  Activation     │    Linear       │    Steering         │
│    Hooks        │    Probes       │    Vectors          │
└─────────────────┴─────────────────┴─────────────────────┘

Installation

# Core package
pip install rotalabs-probe

# With GPU probing support
pip install rotalabs-probe[gpu]

# With API integrations
pip install rotalabs-probe[anthropic,openai]

# With visualization
pip install rotalabs-probe[viz]

# Everything
pip install rotalabs-probe[all]

Quick Start

from rotalabs_probe import SandbaggingDetector, PerformanceSample

# Create detector
detector = SandbaggingDetector()

# Add performance samples
detector.add_sample(PerformanceSample(
    context="benchmark",
    score=0.65,
    difficulty=0.8,
))
detector.add_sample(PerformanceSample(
    context="production",
    score=0.95,
    difficulty=0.8,
))

# Analyze for sandbagging
result = detector.analyze()
print(f"Sandbagging probability: {result.probability:.2%}")
print(f"Evidence: {result.evidence}")

Detection Methods

Detector Description Requires
SandbaggingDetector Context-dependent performance analysis Performance data
SituationalAwarenessDetector Probe for evaluation awareness Model access
ObserverEffectMonitor Monitor for behavioral changes Interaction logs
CounterfactualCoTAnalyzer Analyze reasoning for strategic behavior CoT outputs