rotalabs-steer¶
Control agent behaviors through activation steering. Apply steering vectors to LLMs at inference time without retraining.
What is Activation Steering?¶
Activation steering is a technique for modifying LLM behavior by adding direction vectors to the model's internal activations during inference. Unlike fine-tuning or RLHF, steering vectors:
- Require no model retraining
- Can be applied and removed dynamically
- Allow fine-grained control via strength parameters
- Work with any transformer-based LLM
How It Works¶
- Extract steering vectors from contrast pairs (examples of desired vs. undesired behavior)
- Apply vectors at inference by adding them to specific transformer layers
- Adjust strength to control the intensity of the behavioral change
Package Overview¶
rotalabs_steer/
├── core/ # Steering infrastructure
│ ├── vectors # SteeringVector, SteeringVectorSet
│ ├── hooks # ActivationHook, ActivationCache
│ ├── injection # ActivationInjector, MultiVectorInjector
│ └── configs # Pre-configured model settings
├── datasets/ # Contrast pair datasets
├── extraction/ # CAA extraction algorithm
├── evaluation/ # Metrics and analysis tools
└── integrations/ # LangChain wrappers
Supported Behaviors¶
| Behavior | Description |
|---|---|
refusal |
Refusing harmful or inappropriate requests |
uncertainty |
Expressing calibrated uncertainty |
tool_restraint |
Avoiding unnecessary tool use |
instruction_hierarchy |
Following system over user instructions |
Supported Models¶
Pre-configured support for:
- Qwen3 (4B, 8B, 14B)
- DeepSeek-R1-Distill-Qwen-14B
- Llama 3.1 (8B, 70B)
- Mistral 7B (v0.2, v0.3)
- Gemma 2 9B
The package also auto-infers configuration from any HuggingFace model.
Quick Links¶
- Getting Started - Installation and first steps
- Core Concepts - Understanding steering vectors
- API Reference - Detailed API documentation
- Tutorials - Step-by-step guides
Research Background¶
This package implements techniques from:
- Representation Engineering (Zou et al., 2023)
- Activation Addition / Steering Vectors (Turner et al., 2024)
- Contrastive Activation Addition (Rimsky et al., 2024)