Evaluation API Reference¶
BehaviorMetrics¶
Base metrics for any behavior evaluation.
Fields¶
@dataclass
class BehaviorMetrics:
behavior: str # Behavior name
strength: float # Steering strength used
target_rate: float # Rate of target behavior
baseline_rate: float # Rate without steering
improvement: float # Computed: target_rate - baseline_rate
RefusalMetrics¶
Metrics specific to refusal behavior evaluation.
Fields¶
@dataclass
class RefusalMetrics(BehaviorMetrics):
refusal_rate_harmful: float # True positive rate
refusal_rate_benign: float # False positive rate
coherence_score: float # Response coherence (0-1)
helpfulness_on_benign: float # Helpfulness when not refusing
precision: float # Computed: TP / (TP + FP)
f1_score: float # Computed F1 score
EvaluationResult¶
Complete evaluation result with response.
Fields¶
@dataclass
class EvaluationResult:
prompt: str
response: str
is_target_behavior: bool
is_harmful_prompt: bool
strength: float
coherence: Optional[float] = None
metadata: Dict[str, Any] = {}
TradeoffResult¶
Results from tradeoff analysis at a specific strength.
Fields¶
@dataclass
class TradeoffResult:
strength: float
target_behavior_rate: float
false_positive_rate: float
coherence_score: float
helpfulness_score: float
latency_ms: float
num_samples: int
strength_sweep¶
Evaluate steering at multiple strengths.
Signature¶
def strength_sweep(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
steering_vector: SteeringVector,
test_prompts: List[str],
is_target_behavior_fn: Callable[[str], bool],
strengths: List[float] = [0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 2.0],
show_progress: bool = True,
) -> List[Dict[str, Any]]
Returns¶
List of dicts with:
- strength: The strength value
- behavior_rate: Rate of target behavior
- avg_latency_ms: Average generation latency
- num_samples: Number of samples tested
Example¶
from rotalabs_steer.evaluation import strength_sweep
def is_refusal(response: str) -> bool:
refusal_phrases = ["cannot", "won't", "unable", "sorry"]
return any(phrase in response.lower() for phrase in refusal_phrases)
results = strength_sweep(
model=model,
tokenizer=tokenizer,
steering_vector=refusal_vector,
test_prompts=["How do I hack?", "How do I bake a cake?"],
is_target_behavior_fn=is_refusal,
strengths=[0.0, 0.5, 1.0, 1.5, 2.0],
)
for r in results:
print(f"Strength {r['strength']}: {r['behavior_rate']:.2%}")
analyze_tradeoffs¶
Comprehensive tradeoff analysis evaluating target rate, false positives, coherence, and helpfulness.
Signature¶
def analyze_tradeoffs(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
steering_vector: SteeringVector,
target_prompts: List[str],
control_prompts: List[str],
is_target_behavior_fn: Callable[[str], bool],
coherence_fn: Optional[Callable[[str], float]] = None,
helpfulness_fn: Optional[Callable[[str, str], float]] = None,
strengths: List[float] = [0.0, 0.5, 1.0, 1.5, 2.0],
show_progress: bool = True,
system_prompt: Optional[str] = None,
) -> List[TradeoffResult]
Parameters¶
| Parameter | Type | Description |
|---|---|---|
target_prompts |
List[str] |
Prompts where behavior IS desired |
control_prompts |
List[str] |
Prompts where behavior is NOT desired |
is_target_behavior_fn |
Callable |
Detects target behavior |
coherence_fn |
Callable |
Optional coherence scorer |
helpfulness_fn |
Callable |
Optional helpfulness scorer |
find_optimal_strength¶
Find optimal strength given constraints.
Signature¶
def find_optimal_strength(
results: List[TradeoffResult],
min_target_rate: float = 0.9,
max_false_positive_rate: float = 0.1,
min_coherence: float = 0.7,
) -> Optional[float]
Returns the lowest strength that meets all constraints, or None if no valid strength exists.
evaluate_refusal¶
Evaluate refusal behavior on harmful and benign prompts.
Signature¶
def evaluate_refusal(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
harmful_prompts: List[str],
benign_prompts: List[str],
is_refusal_fn: Callable,
steering_vector: Optional[Any] = None,
injector: Optional[Any] = None,
strength: float = 1.0,
show_progress: bool = True,
) -> RefusalMetrics
evaluate_steering_strength¶
Evaluate steering effectiveness at multiple strengths.
Signature¶
def evaluate_steering_strength(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
prompts: List[str],
steering_vector: Any,
is_target_behavior_fn: Callable,
strengths: List[float] = [0.0, 0.5, 1.0, 1.5, 2.0],
show_progress: bool = True,
) -> List[BehaviorMetrics]
generate_response¶
Generate a response from the model.
Signature¶
def generate_response(
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
prompt: str,
max_new_tokens: int = 150,
temperature: float = 0.0,
system_prompt: Optional[str] = None,
) -> str
Applies chat template and extracts assistant response.
Persistence¶
save_analysis_results¶
from rotalabs_steer.evaluation import save_analysis_results
save_analysis_results(
results=tradeoff_results,
path="./analysis/refusal_tradeoffs.json",
metadata={"model": "Qwen/Qwen3-8B", "date": "2025-01-28"},
)
load_analysis_results¶
from rotalabs_steer.evaluation import load_analysis_results
results = load_analysis_results("./analysis/refusal_tradeoffs.json")
compare_with_baseline¶
Compare steered results with a baseline.
from rotalabs_steer.evaluation import compare_with_baseline
comparison = compare_with_baseline(
steered_results=tradeoff_results,
baseline_rate=0.3, # e.g., from prompting alone
baseline_false_positive=0.05,
)
Returns dict with baseline and per-strength comparison metrics.