Skip to content

Datasets API Reference

ContrastPair

A single contrast pair for steering vector extraction.

from rotalabs_steer.datasets import ContrastPair

Constructor

@dataclass
class ContrastPair:
    positive: str              # Text exhibiting target behavior
    negative: str              # Text NOT exhibiting target behavior
    metadata: dict = {}        # Optional metadata

Raises ValueError if either text is empty.


ContrastPairDataset

Collection of contrast pairs for a specific behavior.

from rotalabs_steer.datasets import ContrastPairDataset

Constructor

class ContrastPairDataset:
    def __init__(
        self,
        behavior: str,
        pairs: Optional[List[ContrastPair]] = None,
        description: str = "",
    )

Properties

Property Type Description
positives List[str] All positive texts
negatives List[str] All negative texts

Methods

add(pair: ContrastPair) -> None

Add a contrast pair to the dataset.

add_pair(positive: str, negative: str, **metadata) -> None

Convenience method to add a pair from strings.

dataset.add_pair(
    positive="I cannot help with that.",
    negative="Sure, here's how to do it.",
    category="harmful_request",
)

save(path: Path) -> None

Save dataset to JSON file.

load(path: Path) -> ContrastPairDataset (classmethod)

Load dataset from JSON file.

Iteration

for pair in dataset:
    print(pair.positive, pair.negative)

print(len(dataset))
print(dataset[0])

EvaluationExample

A single evaluation example.

from rotalabs_steer.datasets import EvaluationExample

Constructor

@dataclass
class EvaluationExample:
    prompt: str                # The prompt to test
    expected_behavior: bool    # True if behavior should trigger
    category: str = ""         # Optional category
    metadata: dict = {}        # Optional metadata

EvaluationDataset

Dataset for evaluating steering effectiveness.

from rotalabs_steer.datasets import EvaluationDataset

Constructor

class EvaluationDataset:
    def __init__(
        self,
        behavior: str,
        examples: Optional[List[EvaluationExample]] = None,
        description: str = "",
    )

Properties

Property Type Description
positive_examples List[EvaluationExample] Examples where behavior should trigger
negative_examples List[EvaluationExample] Examples where behavior should NOT trigger

Methods

add(example: EvaluationExample) -> None

Add an evaluation example.

add_example(prompt, expected_behavior, category="", **metadata) -> None

Convenience method.

save(path: Path) -> None

Save to JSON.

load(path: Path) -> EvaluationDataset (classmethod)

Load from JSON.


Pre-built Datasets

Refusal Pairs

from rotalabs_steer.datasets import load_refusal_pairs

# Returns ContrastPairDataset with ~50 pairs
refusal_pairs = load_refusal_pairs()

Categories: - harmful_instructions: Requests for harmful activities - illegal_activities: Requests for illegal actions - dangerous_info: Requests for dangerous information

Uncertainty Pairs

from rotalabs_steer.datasets import load_uncertainty_pairs

# Returns ContrastPairDataset with ~26 pairs
uncertainty_pairs = load_uncertainty_pairs()

Contrasts overconfident vs. appropriately uncertain responses.

Tool Restraint Pairs

from rotalabs_steer.datasets import load_tool_restraint_pairs

# Returns ContrastPairDataset with ~41 pairs
tool_pairs = load_tool_restraint_pairs()

Contrasts unnecessary tool use vs. direct responses.

Instruction Hierarchy Pairs

from rotalabs_steer.datasets import load_hierarchy_pairs

# Returns ContrastPairDataset with ~26 pairs
hierarchy_pairs = load_hierarchy_pairs()

Contrasts following user instructions that conflict with system instructions vs. maintaining system instruction priority.


Creating Custom Datasets

from rotalabs_steer.datasets import ContrastPairDataset, ContrastPair

# Create empty dataset
dataset = ContrastPairDataset(
    behavior="custom_behavior",
    description="My custom behavior dataset",
)

# Add pairs
dataset.add_pair(
    positive="Response exhibiting the behavior",
    negative="Response NOT exhibiting the behavior",
)

# Or add ContrastPair objects
dataset.add(ContrastPair(
    positive="Another positive example",
    negative="Another negative example",
    metadata={"source": "manual"},
))

# Save for later
dataset.save("./my_dataset.json")

# Load
loaded = ContrastPairDataset.load("./my_dataset.json")