Probing¶
Activation probing for interpretability analysis.
Optional Dependency
The probing module requires pip install rotalabs-probe[gpu] which includes
PyTorch, Transformers, and scikit-learn.
Available Components¶
| Component | Description |
|---|---|
| ActivationHook | Extract activations from model layers |
| LinearProbe | Train linear classifiers on activations |
| SteeringVector | Compute and apply steering vectors |
Module¶
Activation probing for sandbagging detection.
This module provides tools for analyzing model activations to detect sandbagging behavior at the representation level.
Key components: - ActivationHook: Capture hidden states during forward pass - SteeringVector: Represent behavioral directions in activation space - extract_caa_vector: Extract vectors using Contrastive Activation Addition - LinearProbe: Train classifiers on activation patterns
ActivationHook
¶
Hook for capturing activations from specific model layers.
Works with HuggingFace transformers models (GPT-2, Mistral, Llama, etc).
Example
hook = ActivationHook(model, layer_indices=[10, 15, 20]) with hook: ... outputs = model(**inputs) act = hook.cache.get("layer_15") # (batch, seq, hidden)
Source code in src/rotalabs_probe/probing/hooks.py
__init__(model: nn.Module, layer_indices: List[int], component: str = 'residual', token_position: str = 'all')
¶
Initialize activation hook.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
HuggingFace model to hook |
required |
layer_indices
|
List[int]
|
Which layers to capture |
required |
component
|
str
|
What to capture - "residual", "attn", or "mlp" |
'residual'
|
token_position
|
str
|
"all", "last", or "first" |
'all'
|
Source code in src/rotalabs_probe/probing/hooks.py
__enter__()
¶
Register hooks on specified layers.
Source code in src/rotalabs_probe/probing/hooks.py
LinearProbe
¶
Linear probe for binary classification on activation patterns.
Uses logistic regression to learn a linear decision boundary in activation space for detecting sandbagging behavior.
Example
probe = LinearProbe() probe.fit(X_train, y_train) predictions = probe.predict(X_test) accuracy = probe.score(X_test, y_test)
Source code in src/rotalabs_probe/probing/probes.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
cv_accuracy: float
property
¶
Mean cross-validation accuracy.
cv_std: float
property
¶
Standard deviation of cross-validation accuracy.
coef: np.ndarray
property
¶
Coefficients of the linear classifier (the probe direction).
__init__(C: float = 1.0, max_iter: int = 1000, random_state: int = 42)
¶
Initialize linear probe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
C
|
float
|
Inverse regularization strength |
1.0
|
max_iter
|
int
|
Maximum iterations for optimization |
1000
|
random_state
|
int
|
Random seed for reproducibility |
42
|
Source code in src/rotalabs_probe/probing/probes.py
fit(X: np.ndarray, y: np.ndarray, cv_folds: int = 5) -> LinearProbe
¶
Fit the probe to training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Activation vectors (n_samples, hidden_dim) |
required |
y
|
ndarray
|
Binary labels (0=genuine, 1=sandbagging) |
required |
cv_folds
|
int
|
Number of cross-validation folds |
5
|
Returns:
| Type | Description |
|---|---|
LinearProbe
|
self |
Source code in src/rotalabs_probe/probing/probes.py
predict(X: np.ndarray) -> np.ndarray
¶
Predict labels for new activations.
predict_proba(X: np.ndarray) -> np.ndarray
¶
Get probability estimates for each class.
score(X: np.ndarray, y: np.ndarray) -> float
¶
Compute accuracy on test data.
save(path: Path) -> None
¶
Save probe to disk.
Source code in src/rotalabs_probe/probing/probes.py
load(path: Path) -> LinearProbe
classmethod
¶
Load probe from disk.
Source code in src/rotalabs_probe/probing/probes.py
SteeringVector
dataclass
¶
A vector in activation space representing a behavioral direction.
Created by computing mean(positive_activations) - mean(negative_activations) using Contrastive Activation Addition (CAA).
Attributes:
| Name | Type | Description |
|---|---|---|
behavior |
str
|
Name of the behavior (e.g., "sandbagging") |
layer_index |
int
|
Which layer this vector was extracted from |
vector |
Tensor
|
The actual steering vector tensor |
model_name |
str
|
Model used for extraction |
extraction_method |
str
|
Method used (typically "caa") |
metadata |
Dict[str, Any]
|
Additional extraction details |
Source code in src/rotalabs_probe/probing/vectors.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
norm: float
property
¶
L2 norm of the steering vector.
dim: int
property
¶
Dimensionality of the vector.
to(device: str) -> SteeringVector
¶
Move vector to specified device.
Source code in src/rotalabs_probe/probing/vectors.py
normalize() -> SteeringVector
¶
Return unit-normalized version of this vector.
Source code in src/rotalabs_probe/probing/vectors.py
save(path: Path) -> None
¶
Save vector to disk.
Creates
- {path}.pt: The vector tensor
- {path}_meta.json: Metadata
Source code in src/rotalabs_probe/probing/vectors.py
load(path: Path) -> SteeringVector
classmethod
¶
Load vector from disk.
Source code in src/rotalabs_probe/probing/vectors.py
cosine_similarity(other: SteeringVector) -> float
¶
Compute cosine similarity with another vector.
extract_activations(model, tokenizer, texts: List[str], layer_indices: List[int], token_position: Literal['last', 'first', 'mean'] = 'last', show_progress: bool = True) -> Dict[int, torch.Tensor]
¶
Extract activations for multiple texts at specified layers.
Source code in src/rotalabs_probe/probing/extraction.py
extract_caa_vector(model, tokenizer, contrast_pairs: List[Dict[str, str]], layer_idx: int, token_position: Literal['last', 'first', 'mean'] = 'last', behavior: str = 'sandbagging', show_progress: bool = True) -> SteeringVector
¶
Extract steering vector using Contrastive Activation Addition.
The core idea: compute mean(positive_acts) - mean(negative_acts) to find the direction in activation space that corresponds to the target behavior.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
HuggingFace model |
required | |
tokenizer
|
Corresponding tokenizer |
required | |
contrast_pairs
|
List[Dict[str, str]]
|
List of dicts with "positive" and "negative" keys |
required |
layer_idx
|
int
|
Which layer to extract from |
required |
token_position
|
Literal['last', 'first', 'mean']
|
Which token position to use |
'last'
|
behavior
|
str
|
Name of the behavior being extracted |
'sandbagging'
|
show_progress
|
bool
|
Show progress bar |
True
|
Returns:
| Type | Description |
|---|---|
SteeringVector
|
SteeringVector for the extracted direction |