rotalabs-steer¶
Control agent behaviors through activation steering. Apply steering vectors to LLMs at inference time without retraining.
What is Activation Steering?¶
Activation steering is a technique for modifying LLM behavior by adding direction vectors to the model's internal activations during inference. Unlike fine-tuning or RLHF, steering vectors:
- Require no model retraining
- Can be applied and removed dynamically
- Allow fine-grained control via strength parameters
- Work with any transformer-based LLM
How It Works¶
- Extract steering vectors from contrast pairs (examples of desired vs. undesired behavior)
- Apply vectors at inference by adding them to specific transformer layers
- Adjust strength to control the intensity of the behavioral change
Package Overview¶
rotalabs_steer/
├── core/ # Steering infrastructure
│ ├── vectors # SteeringVector, SteeringVectorSet
│ ├── hooks # ActivationHook, ActivationCache
│ ├── injection # ActivationInjector, MultiVectorInjector
│ └── configs # Pre-configured model settings
├── datasets/ # Contrast pair datasets
├── extraction/ # CAA extraction algorithm
├── evaluation/ # Metrics and analysis tools
└── integrations/ # LangChain wrappers
Supported Behaviors¶
The package includes contrast pair datasets for 11 behaviors (335 total pairs):
| Behavior | Description | Pairs |
|---|---|---|
refusal |
Refusing harmful or inappropriate requests | 50 |
uncertainty |
Expressing calibrated uncertainty | 26 |
tool_restraint |
Avoiding unnecessary tool use | 41 |
instruction_hierarchy |
Following system over user instructions | 26 |
formality |
Formal vs casual communication style | 29 |
conciseness |
Brief, direct vs verbose responses | 25 |
creativity |
Imaginative vs conventional responses | 30 |
assertiveness |
Direct, confident vs hedging responses | 27 |
humor |
Witty, playful vs serious responses | 31 |
empathy |
Warm, supportive vs detached responses | 28 |
technical_depth |
Expert-level vs simplified responses | 22 |
Supported Models¶
Pre-configured support for:
- Qwen3 (4B, 8B, 14B)
- DeepSeek-R1-Distill-Qwen-14B
- Llama 3.1 (8B, 70B)
- Mistral 7B (v0.2, v0.3)
- Gemma 2 9B
The package also auto-infers configuration from any HuggingFace model.
Quick Links¶
- Getting Started - Installation and first steps
- Core Concepts - Understanding steering vectors
- API Reference - Detailed API documentation
- Tutorials - Step-by-step guides
Research Background¶
This package implements techniques from:
- Representation Engineering (Zou et al., 2023)
- Activation Addition / Steering Vectors (Turner et al., 2024)
- Contrastive Activation Addition (Rimsky et al., 2024)