Skip to content

Model Support

Pre-configured Models

The following models have pre-configured settings including recommended layers for each behavior:

Qwen3 Family

Model Layers Hidden Size Refusal Layers
Qwen/Qwen3-4B 36 2560 14-18
Qwen/Qwen3-8B 36 4096 14-18
Qwen/Qwen3-14B 48 5120 20-24

DeepSeek

Model Layers Hidden Size Refusal Layers
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 48 5120 20-24

Llama 3.1

Model Layers Hidden Size Refusal Layers
meta-llama/Llama-3.1-8B-Instruct 32 4096 14-16
meta-llama/Llama-3.1-70B-Instruct 80 8192 35-40

Mistral

Model Layers Hidden Size Refusal Layers
mistralai/Mistral-7B-Instruct-v0.2 32 4096 12-20
mistralai/Mistral-7B-Instruct-v0.3 32 4096 12-20

Gemma 2

Model Layers Hidden Size Refusal Layers
google/gemma-2-9b-it 42 3584 14-22

Using Pre-configured Models

from rotalabs_steer import get_model_config, MODEL_CONFIGS

# Get config by exact name
config = get_model_config("Qwen/Qwen3-8B")

# Get recommended layers for a behavior
refusal_layers = config.get_recommended_layers("refusal")
uncertainty_layers = config.get_recommended_layers("uncertainty")

# List all pre-configured models
print(list(MODEL_CONFIGS.keys()))

Each model has empirically-determined recommended layers:

Behavior Typical Layer Range Notes
refusal Middle-upper layers Where safety behaviors are encoded
uncertainty Middle layers Confidence representations
tool_restraint Upper-middle layers Tool-use decision making
instruction_hierarchy Middle layers Instruction processing

Auto-inference for Other Models

For models not in the pre-configured list, the package can infer configuration:

from rotalabs_steer import infer_model_config
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("some/other-model")
config = infer_model_config(model)

print(f"Layers: {config.num_layers}")
print(f"Hidden size: {config.hidden_size}")
print(f"Default recommended layers: {config.get_recommended_layers('refusal')}")

Auto-inference: 1. Tries to find a matching config by model name 2. Falls back to inspecting model structure 3. Returns middle-third layers as default recommendation

Supported Architectures

The hook system supports models with these layer structures:

Llama/Qwen/Mistral Style

model.model.layers[i]          # Residual stream
model.model.layers[i].mlp      # MLP output
model.model.layers[i].self_attn # Attention output

GPT-2 Style

model.transformer.h[i]         # Residual stream
model.transformer.h[i].mlp     # MLP output
model.transformer.h[i].attn    # Attention output

Adding Support for New Models

To add a new model configuration:

from rotalabs_steer import ModelConfig, MODEL_CONFIGS

MODEL_CONFIGS["my-org/my-model"] = ModelConfig(
    name="my-org/my-model",
    num_layers=32,
    hidden_size=4096,
    recommended_layers={
        "refusal": [12, 14, 16, 18],
        "uncertainty": [10, 12, 14],
        "tool_restraint": [14, 16, 18],
        "instruction_hierarchy": [12, 14, 16],
    },
)

Finding Optimal Layers

To find the best layers for your specific model and behavior:

from rotalabs_steer.extraction import extract_caa_vectors
from rotalabs_steer.evaluation import strength_sweep

# Extract from many layers
all_layers = list(range(10, 25))  # Test layers 10-24
vectors = extract_caa_vectors(
    model=model,
    tokenizer=tokenizer,
    contrast_pairs=dataset,
    layer_indices=all_layers,
)

# Evaluate each layer
results = {}
for layer_idx in all_layers:
    vector = vectors[layer_idx]
    sweep = strength_sweep(
        model=model,
        tokenizer=tokenizer,
        steering_vector=vector,
        test_prompts=test_prompts,
        is_target_behavior_fn=is_refusal,
        strengths=[1.0],
    )
    results[layer_idx] = sweep[0]["behavior_rate"]

# Find best layers
sorted_layers = sorted(results.items(), key=lambda x: x[1], reverse=True)
print("Best layers:")
for layer, rate in sorted_layers[:5]:
    print(f"  Layer {layer}: {rate:.2%}")

Memory Requirements

Approximate VRAM requirements for inference:

Model Size float16 float32
4B ~8 GB ~16 GB
7-8B ~16 GB ~32 GB
13-14B ~28 GB ~56 GB
70B ~140 GB ~280 GB

Steering vectors add negligible memory overhead (~1-4 MB per vector).