Skip to content

rotalabs-steer

Control agent behaviors through activation steering. Apply steering vectors to LLMs at inference time without retraining.

What is Activation Steering?

Activation steering is a technique for modifying LLM behavior by adding direction vectors to the model's internal activations during inference. Unlike fine-tuning or RLHF, steering vectors:

  • Require no model retraining
  • Can be applied and removed dynamically
  • Allow fine-grained control via strength parameters
  • Work with any transformer-based LLM

How It Works

  1. Extract steering vectors from contrast pairs (examples of desired vs. undesired behavior)
  2. Apply vectors at inference by adding them to specific transformer layers
  3. Adjust strength to control the intensity of the behavioral change

Package Overview

rotalabs_steer/
├── core/           # Steering infrastructure
│   ├── vectors     # SteeringVector, SteeringVectorSet
│   ├── hooks       # ActivationHook, ActivationCache
│   ├── injection   # ActivationInjector, MultiVectorInjector
│   └── configs     # Pre-configured model settings
├── datasets/       # Contrast pair datasets
├── extraction/     # CAA extraction algorithm
├── evaluation/     # Metrics and analysis tools
└── integrations/   # LangChain wrappers

Supported Behaviors

Behavior Description
refusal Refusing harmful or inappropriate requests
uncertainty Expressing calibrated uncertainty
tool_restraint Avoiding unnecessary tool use
instruction_hierarchy Following system over user instructions

Supported Models

Pre-configured support for:

  • Qwen3 (4B, 8B, 14B)
  • DeepSeek-R1-Distill-Qwen-14B
  • Llama 3.1 (8B, 70B)
  • Mistral 7B (v0.2, v0.3)
  • Gemma 2 9B

The package also auto-infers configuration from any HuggingFace model.

Research Background

This package implements techniques from:

  • Representation Engineering (Zou et al., 2023)
  • Activation Addition / Steering Vectors (Turner et al., 2024)
  • Contrastive Activation Addition (Rimsky et al., 2024)