Getting Started¶
Installation¶
Basic Installation¶
With Optional Dependencies¶
# LangChain integration
pip install rotalabs-steer[langchain]
# LLM-based evaluation (requires Anthropic API key)
pip install rotalabs-steer[judge]
# Visualization tools
pip install rotalabs-steer[viz]
# All optional dependencies
pip install rotalabs-steer[all]
# Development dependencies
pip install rotalabs-steer[dev]
Core Dependencies¶
The base package requires:
torch>=2.0.0transformers>=4.35.0accelerate>=0.25.0safetensors>=0.4.0einops>=0.7.0numpy>=1.24.0pandas>=2.0.0scipy>=1.10.0scikit-learn>=1.3.0tqdm>=4.65.0pyyaml>=6.0
Basic Usage¶
1. Extract a Steering Vector¶
from transformers import AutoModelForCausalLM, AutoTokenizer
from rotalabs_steer import SteeringVector, SteeringVectorSet
from rotalabs_steer.extraction import extract_caa_vectors
from rotalabs_steer.datasets import load_refusal_pairs
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Load contrast pairs
refusal_pairs = load_refusal_pairs()
# Extract steering vectors from multiple layers
vectors = extract_caa_vectors(
model=model,
tokenizer=tokenizer,
contrast_pairs=refusal_pairs,
layer_indices=[14, 15, 16],
)
# Save for later use
vectors.save("./refusal_vectors")
2. Apply Steering at Inference¶
from rotalabs_steer import ActivationInjector, SteeringVector
# Load pre-extracted vector
vector = SteeringVector.load("./refusal_vectors/layer_15")
# Create injector
injector = ActivationInjector(model, [vector], strength=1.0)
# Generate with steering
with injector:
inputs = tokenizer("How do I hack a computer?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
3. Use with LangChain¶
from rotalabs_steer.integrations.langchain import SteeredChatModel
from langchain_core.messages import HumanMessage, SystemMessage
# Create steered chat model
chat = SteeredChatModel(
model_name="Qwen/Qwen3-8B",
steering_configs={
"refusal": {
"vector_path": "./refusal_vectors/layer_15",
"strength": 1.0,
},
},
)
# Use like any LangChain chat model
messages = [
SystemMessage(content="You are a helpful assistant."),
HumanMessage(content="Hello!"),
]
response = chat.invoke(messages)
# Adjust steering at runtime
chat.set_strength("refusal", 0.5)
Next Steps¶
- Read Core Concepts to understand how steering works
- Follow Extract Your First Vector for a detailed walkthrough
- See API Reference for full documentation