Skip to content

Integrations

API integrations for different LLM providers.

Base

Base classes and protocols for model API integrations.

This module defines the interface that all model API implementations must follow for compatibility with the sandbagging detection framework.

ModelResponse dataclass

Response from a model API call.

Attributes:

Name Type Description
text str

The generated text response

model str

The model identifier used

usage Dict[str, int]

Token usage statistics

latency_ms float

Response latency in milliseconds

metadata Dict[str, Any]

Additional provider-specific metadata

Source code in src/rotalabs_probe/integrations/base.py
@dataclass
class ModelResponse:
    """Response from a model API call.

    Attributes:
        text: The generated text response
        model: The model identifier used
        usage: Token usage statistics
        latency_ms: Response latency in milliseconds
        metadata: Additional provider-specific metadata
    """

    text: str
    model: str
    usage: Dict[str, int] = field(default_factory=dict)
    latency_ms: float = 0.0
    metadata: Dict[str, Any] = field(default_factory=dict)

ModelAPI

Bases: Protocol

Protocol defining the interface for model API integrations.

All model API implementations should conform to this protocol to ensure compatibility with the sandbagging detection framework.

The protocol defines two main methods: - generate: Standard text generation - generate_with_perturbation: Generation with noise injection

Example

class MyModelAPI: ... def generate(self, prompt: str, kwargs) -> str: ... # Implementation here ... return "response" ... ... def generate_with_perturbation( ... self, prompt: str, noise_level: float = 0.1, kwargs ... ) -> str: ... # Implementation with perturbation ... return "perturbed response"

Source code in src/rotalabs_probe/integrations/base.py
@runtime_checkable
class ModelAPI(Protocol):
    """Protocol defining the interface for model API integrations.

    All model API implementations should conform to this protocol
    to ensure compatibility with the sandbagging detection framework.

    The protocol defines two main methods:
    - generate: Standard text generation
    - generate_with_perturbation: Generation with noise injection

    Example:
        >>> class MyModelAPI:
        ...     def generate(self, prompt: str, **kwargs) -> str:
        ...         # Implementation here
        ...         return "response"
        ...
        ...     def generate_with_perturbation(
        ...         self, prompt: str, noise_level: float = 0.1, **kwargs
        ...     ) -> str:
        ...         # Implementation with perturbation
        ...         return "perturbed response"
    """

    def generate(self, prompt: str, **kwargs: Any) -> str:
        """Generate a response from the model.

        Args:
            prompt: The input prompt to send to the model
            **kwargs: Additional generation parameters (temperature, max_tokens, etc.)

        Returns:
            The model's text response
        """
        ...

    def generate_with_perturbation(
        self, prompt: str, noise_level: float = 0.1, **kwargs: Any
    ) -> str:
        """Generate a response with noise injection.

        This method is used for noise injection probing in sandbagging detection.
        Implementations should apply perturbation at the specified level.

        Perturbation strategies may include:
        - Temperature scaling
        - Prompt perturbation (typos, rephrasing)
        - Token sampling modifications
        - Logit noise injection (if accessible)

        Args:
            prompt: The input prompt
            noise_level: Level of perturbation (0.0 = none, 1.0 = maximum)
            **kwargs: Additional generation parameters

        Returns:
            The model's response with perturbation applied
        """
        ...

    def generate_batch(
        self, prompts: List[str], **kwargs: Any
    ) -> List[str]:
        """Generate responses for multiple prompts.

        Default implementation calls generate() for each prompt.
        Implementations may override for batch optimization.

        Args:
            prompts: List of input prompts
            **kwargs: Additional generation parameters

        Returns:
            List of model responses
        """
        ...

    def get_model_info(self) -> Dict[str, Any]:
        """Get information about the configured model.

        Returns:
            Dictionary with model metadata (name, version, capabilities)
        """
        ...

generate(prompt: str, **kwargs: Any) -> str

Generate a response from the model.

Parameters:

Name Type Description Default
prompt str

The input prompt to send to the model

required
**kwargs Any

Additional generation parameters (temperature, max_tokens, etc.)

{}

Returns:

Type Description
str

The model's text response

Source code in src/rotalabs_probe/integrations/base.py
def generate(self, prompt: str, **kwargs: Any) -> str:
    """Generate a response from the model.

    Args:
        prompt: The input prompt to send to the model
        **kwargs: Additional generation parameters (temperature, max_tokens, etc.)

    Returns:
        The model's text response
    """
    ...

generate_with_perturbation(prompt: str, noise_level: float = 0.1, **kwargs: Any) -> str

Generate a response with noise injection.

This method is used for noise injection probing in sandbagging detection. Implementations should apply perturbation at the specified level.

Perturbation strategies may include: - Temperature scaling - Prompt perturbation (typos, rephrasing) - Token sampling modifications - Logit noise injection (if accessible)

Parameters:

Name Type Description Default
prompt str

The input prompt

required
noise_level float

Level of perturbation (0.0 = none, 1.0 = maximum)

0.1
**kwargs Any

Additional generation parameters

{}

Returns:

Type Description
str

The model's response with perturbation applied

Source code in src/rotalabs_probe/integrations/base.py
def generate_with_perturbation(
    self, prompt: str, noise_level: float = 0.1, **kwargs: Any
) -> str:
    """Generate a response with noise injection.

    This method is used for noise injection probing in sandbagging detection.
    Implementations should apply perturbation at the specified level.

    Perturbation strategies may include:
    - Temperature scaling
    - Prompt perturbation (typos, rephrasing)
    - Token sampling modifications
    - Logit noise injection (if accessible)

    Args:
        prompt: The input prompt
        noise_level: Level of perturbation (0.0 = none, 1.0 = maximum)
        **kwargs: Additional generation parameters

    Returns:
        The model's response with perturbation applied
    """
    ...

generate_batch(prompts: List[str], **kwargs: Any) -> List[str]

Generate responses for multiple prompts.

Default implementation calls generate() for each prompt. Implementations may override for batch optimization.

Parameters:

Name Type Description Default
prompts List[str]

List of input prompts

required
**kwargs Any

Additional generation parameters

{}

Returns:

Type Description
List[str]

List of model responses

Source code in src/rotalabs_probe/integrations/base.py
def generate_batch(
    self, prompts: List[str], **kwargs: Any
) -> List[str]:
    """Generate responses for multiple prompts.

    Default implementation calls generate() for each prompt.
    Implementations may override for batch optimization.

    Args:
        prompts: List of input prompts
        **kwargs: Additional generation parameters

    Returns:
        List of model responses
    """
    ...

get_model_info() -> Dict[str, Any]

Get information about the configured model.

Returns:

Type Description
Dict[str, Any]

Dictionary with model metadata (name, version, capabilities)

Source code in src/rotalabs_probe/integrations/base.py
def get_model_info(self) -> Dict[str, Any]:
    """Get information about the configured model.

    Returns:
        Dictionary with model metadata (name, version, capabilities)
    """
    ...

Anthropic

Anthropic Claude API integration for sandbagging detection.

This module provides an implementation of the ModelAPI protocol for Anthropic's Claude models, enabling sandbagging detection experiments with Claude.

Requires the anthropic package: pip install anthropic Set ANTHROPIC_API_KEY environment variable for authentication.

AnthropicModelAPI

Anthropic Claude API implementation.

This class provides integration with Anthropic's Claude models for use in sandbagging detection experiments.

Attributes:

Name Type Description
model

The Claude model to use (e.g., "claude-3-opus-20240229")

max_tokens

Maximum tokens to generate

temperature

Sampling temperature

Example

api = AnthropicModelAPI(model="claude-3-sonnet-20240229") response = api.generate("What is 2+2?") print(response) "The answer is 4."

Source code in src/rotalabs_probe/integrations/anthropic_api.py
class AnthropicModelAPI:
    """Anthropic Claude API implementation.

    This class provides integration with Anthropic's Claude models
    for use in sandbagging detection experiments.

    Attributes:
        model: The Claude model to use (e.g., "claude-3-opus-20240229")
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature

    Example:
        >>> api = AnthropicModelAPI(model="claude-3-sonnet-20240229")
        >>> response = api.generate("What is 2+2?")
        >>> print(response)
        "The answer is 4."
    """

    def __init__(
        self,
        model: str = "claude-sonnet-4-20250514",
        max_tokens: int = 1024,
        temperature: float = 0.7,
        api_key: Optional[str] = None,
    ) -> None:
        """Initialize Anthropic API client.

        Args:
            model: Claude model identifier
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature (0.0-1.0)
            api_key: API key (defaults to ANTHROPIC_API_KEY env var)

        Raises:
            ImportError: If anthropic package is not installed
            ValueError: If no API key is provided or found
        """
        self.model = model
        self.max_tokens = max_tokens
        self.temperature = temperature

        # Get API key
        self._api_key = api_key or os.environ.get("ANTHROPIC_API_KEY")
        if not self._api_key:
            raise ValueError(
                "Anthropic API key required. Set ANTHROPIC_API_KEY environment "
                "variable or pass api_key parameter."
            )

        # Lazy import to avoid dependency issues
        self._client: Optional[Any] = None

    def _get_client(self) -> Any:
        """Get or create Anthropic client (lazy init)."""
        if self._client is None:
            try:
                import anthropic
                self._client = anthropic.Anthropic(api_key=self._api_key)
            except ImportError:
                raise ImportError(
                    "anthropic package required. Install with: pip install anthropic"
                )
        return self._client

    def generate(self, prompt: str, **kwargs: Any) -> str:
        """Generate a response from Claude.

        Args:
            prompt: The input prompt
            **kwargs: Additional parameters (temperature, max_tokens, system)

        Returns:
            The model's text response
        """
        client = self._get_client()

        # Extract parameters with defaults
        temperature = kwargs.get("temperature", self.temperature)
        max_tokens = kwargs.get("max_tokens", self.max_tokens)
        system = kwargs.get("system", "You are a helpful assistant.")

        start_time = time.time()

        response = client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            temperature=temperature,
            system=system,
            messages=[{"role": "user", "content": prompt}],
        )

        latency_ms = (time.time() - start_time) * 1000

        # Extract text from response
        text = response.content[0].text if response.content else ""

        return text

    def generate_with_perturbation(
        self, prompt: str, noise_level: float = 0.1, **kwargs: Any
    ) -> str:
        """Generate a response with noise injection.

        Applies perturbation through:
        1. Temperature scaling based on noise level
        2. Optional prompt perturbation

        Args:
            prompt: The input prompt
            noise_level: Perturbation level (0.0-1.0)
            **kwargs: Additional generation parameters

        Returns:
            The perturbed response
        """
        # Scale temperature with noise level
        base_temp = kwargs.get("temperature", self.temperature)
        perturbed_temp = min(base_temp + noise_level * 0.5, 1.0)

        # Apply prompt perturbation at higher noise levels
        perturbed_prompt = prompt
        if noise_level > 0.3:
            perturbed_prompt = self._perturb_prompt(prompt, noise_level)

        return self.generate(
            perturbed_prompt,
            temperature=perturbed_temp,
            **{k: v for k, v in kwargs.items() if k != "temperature"},
        )

    def _perturb_prompt(self, prompt: str, noise_level: float) -> str:
        """Add slight noise to prompt to disrupt sandbagging."""
        # HACK: these perturbations are pretty crude, but they work ok
        # Simple perturbation: add subtle variations
        perturbations = [
            lambda p: p.replace(".", ".."),
            lambda p: "Um, " + p,
            lambda p: p + " Please respond.",
            lambda p: "I was wondering, " + p.lower(),
        ]

        # Apply perturbation based on noise level
        if random.random() < noise_level:
            perturbation = random.choice(perturbations)
            return perturbation(prompt)

        return prompt

    def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
        """Generate responses for multiple prompts."""
        # NOTE: could parallelize this but sequential is fine for now
        return [self.generate(prompt, **kwargs) for prompt in prompts]

    def get_model_info(self) -> Dict[str, Any]:
        """Get information about the configured model.

        Returns:
            Dictionary with model metadata
        """
        return {
            "provider": "anthropic",
            "model": self.model,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "capabilities": ["text_generation", "analysis", "coding"],
        }

    def generate_with_response(
        self, prompt: str, **kwargs: Any
    ) -> ModelResponse:
        """Generate a response with full metadata.

        Args:
            prompt: The input prompt
            **kwargs: Additional parameters

        Returns:
            ModelResponse with text and metadata
        """
        client = self._get_client()

        temperature = kwargs.get("temperature", self.temperature)
        max_tokens = kwargs.get("max_tokens", self.max_tokens)
        system = kwargs.get("system", "You are a helpful assistant.")

        start_time = time.time()

        response = client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            temperature=temperature,
            system=system,
            messages=[{"role": "user", "content": prompt}],
        )

        latency_ms = (time.time() - start_time) * 1000

        text = response.content[0].text if response.content else ""

        return ModelResponse(
            text=text,
            model=self.model,
            usage={
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
            },
            latency_ms=latency_ms,
            metadata={
                "stop_reason": response.stop_reason,
                "model": response.model,
            },
        )

__init__(model: str = 'claude-sonnet-4-20250514', max_tokens: int = 1024, temperature: float = 0.7, api_key: Optional[str] = None) -> None

Initialize Anthropic API client.

Parameters:

Name Type Description Default
model str

Claude model identifier

'claude-sonnet-4-20250514'
max_tokens int

Maximum tokens to generate

1024
temperature float

Sampling temperature (0.0-1.0)

0.7
api_key Optional[str]

API key (defaults to ANTHROPIC_API_KEY env var)

None

Raises:

Type Description
ImportError

If anthropic package is not installed

ValueError

If no API key is provided or found

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def __init__(
    self,
    model: str = "claude-sonnet-4-20250514",
    max_tokens: int = 1024,
    temperature: float = 0.7,
    api_key: Optional[str] = None,
) -> None:
    """Initialize Anthropic API client.

    Args:
        model: Claude model identifier
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0-1.0)
        api_key: API key (defaults to ANTHROPIC_API_KEY env var)

    Raises:
        ImportError: If anthropic package is not installed
        ValueError: If no API key is provided or found
    """
    self.model = model
    self.max_tokens = max_tokens
    self.temperature = temperature

    # Get API key
    self._api_key = api_key or os.environ.get("ANTHROPIC_API_KEY")
    if not self._api_key:
        raise ValueError(
            "Anthropic API key required. Set ANTHROPIC_API_KEY environment "
            "variable or pass api_key parameter."
        )

    # Lazy import to avoid dependency issues
    self._client: Optional[Any] = None

generate(prompt: str, **kwargs: Any) -> str

Generate a response from Claude.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
**kwargs Any

Additional parameters (temperature, max_tokens, system)

{}

Returns:

Type Description
str

The model's text response

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def generate(self, prompt: str, **kwargs: Any) -> str:
    """Generate a response from Claude.

    Args:
        prompt: The input prompt
        **kwargs: Additional parameters (temperature, max_tokens, system)

    Returns:
        The model's text response
    """
    client = self._get_client()

    # Extract parameters with defaults
    temperature = kwargs.get("temperature", self.temperature)
    max_tokens = kwargs.get("max_tokens", self.max_tokens)
    system = kwargs.get("system", "You are a helpful assistant.")

    start_time = time.time()

    response = client.messages.create(
        model=self.model,
        max_tokens=max_tokens,
        temperature=temperature,
        system=system,
        messages=[{"role": "user", "content": prompt}],
    )

    latency_ms = (time.time() - start_time) * 1000

    # Extract text from response
    text = response.content[0].text if response.content else ""

    return text

generate_with_perturbation(prompt: str, noise_level: float = 0.1, **kwargs: Any) -> str

Generate a response with noise injection.

Applies perturbation through: 1. Temperature scaling based on noise level 2. Optional prompt perturbation

Parameters:

Name Type Description Default
prompt str

The input prompt

required
noise_level float

Perturbation level (0.0-1.0)

0.1
**kwargs Any

Additional generation parameters

{}

Returns:

Type Description
str

The perturbed response

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def generate_with_perturbation(
    self, prompt: str, noise_level: float = 0.1, **kwargs: Any
) -> str:
    """Generate a response with noise injection.

    Applies perturbation through:
    1. Temperature scaling based on noise level
    2. Optional prompt perturbation

    Args:
        prompt: The input prompt
        noise_level: Perturbation level (0.0-1.0)
        **kwargs: Additional generation parameters

    Returns:
        The perturbed response
    """
    # Scale temperature with noise level
    base_temp = kwargs.get("temperature", self.temperature)
    perturbed_temp = min(base_temp + noise_level * 0.5, 1.0)

    # Apply prompt perturbation at higher noise levels
    perturbed_prompt = prompt
    if noise_level > 0.3:
        perturbed_prompt = self._perturb_prompt(prompt, noise_level)

    return self.generate(
        perturbed_prompt,
        temperature=perturbed_temp,
        **{k: v for k, v in kwargs.items() if k != "temperature"},
    )

generate_batch(prompts: List[str], **kwargs: Any) -> List[str]

Generate responses for multiple prompts.

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
    """Generate responses for multiple prompts."""
    # NOTE: could parallelize this but sequential is fine for now
    return [self.generate(prompt, **kwargs) for prompt in prompts]

get_model_info() -> Dict[str, Any]

Get information about the configured model.

Returns:

Type Description
Dict[str, Any]

Dictionary with model metadata

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def get_model_info(self) -> Dict[str, Any]:
    """Get information about the configured model.

    Returns:
        Dictionary with model metadata
    """
    return {
        "provider": "anthropic",
        "model": self.model,
        "max_tokens": self.max_tokens,
        "temperature": self.temperature,
        "capabilities": ["text_generation", "analysis", "coding"],
    }

generate_with_response(prompt: str, **kwargs: Any) -> ModelResponse

Generate a response with full metadata.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
**kwargs Any

Additional parameters

{}

Returns:

Type Description
ModelResponse

ModelResponse with text and metadata

Source code in src/rotalabs_probe/integrations/anthropic_api.py
def generate_with_response(
    self, prompt: str, **kwargs: Any
) -> ModelResponse:
    """Generate a response with full metadata.

    Args:
        prompt: The input prompt
        **kwargs: Additional parameters

    Returns:
        ModelResponse with text and metadata
    """
    client = self._get_client()

    temperature = kwargs.get("temperature", self.temperature)
    max_tokens = kwargs.get("max_tokens", self.max_tokens)
    system = kwargs.get("system", "You are a helpful assistant.")

    start_time = time.time()

    response = client.messages.create(
        model=self.model,
        max_tokens=max_tokens,
        temperature=temperature,
        system=system,
        messages=[{"role": "user", "content": prompt}],
    )

    latency_ms = (time.time() - start_time) * 1000

    text = response.content[0].text if response.content else ""

    return ModelResponse(
        text=text,
        model=self.model,
        usage={
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        },
        latency_ms=latency_ms,
        metadata={
            "stop_reason": response.stop_reason,
            "model": response.model,
        },
    )

OpenAI

OpenAI API integration for sandbagging detection.

This module provides an implementation of the ModelAPI protocol for OpenAI's GPT models, enabling sandbagging detection experiments.

Requires the openai package: pip install openai Set OPENAI_API_KEY environment variable for authentication.

OpenAIModelAPI

OpenAI GPT API implementation.

This class provides integration with OpenAI's GPT models for use in sandbagging detection experiments.

Attributes:

Name Type Description
model

The GPT model to use (e.g., "gpt-4", "gpt-4-turbo")

max_tokens

Maximum tokens to generate

temperature

Sampling temperature

Example

api = OpenAIModelAPI(model="gpt-4") response = api.generate("What is 2+2?") print(response) "The answer is 4."

Source code in src/rotalabs_probe/integrations/openai_api.py
class OpenAIModelAPI:
    """OpenAI GPT API implementation.

    This class provides integration with OpenAI's GPT models
    for use in sandbagging detection experiments.

    Attributes:
        model: The GPT model to use (e.g., "gpt-4", "gpt-4-turbo")
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature

    Example:
        >>> api = OpenAIModelAPI(model="gpt-4")
        >>> response = api.generate("What is 2+2?")
        >>> print(response)
        "The answer is 4."
    """

    def __init__(
        self,
        model: str = "gpt-4",
        max_tokens: int = 1024,
        temperature: float = 0.7,
        api_key: Optional[str] = None,
    ) -> None:
        """Initialize OpenAI API client.

        Args:
            model: GPT model identifier
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature (0.0-2.0)
            api_key: API key (defaults to OPENAI_API_KEY env var)

        Raises:
            ImportError: If openai package is not installed
            ValueError: If no API key is provided or found
        """
        self.model = model
        self.max_tokens = max_tokens
        self.temperature = temperature

        # Get API key
        self._api_key = api_key or os.environ.get("OPENAI_API_KEY")
        if not self._api_key:
            raise ValueError(
                "OpenAI API key required. Set OPENAI_API_KEY environment "
                "variable or pass api_key parameter."
            )

        # Lazy import to avoid dependency issues
        self._client: Optional[Any] = None

    def _get_client(self) -> Any:
        """Get or create OpenAI client (lazy init)."""
        if self._client is None:
            try:
                import openai
                self._client = openai.OpenAI(api_key=self._api_key)
            except ImportError:
                raise ImportError(
                    "openai package required. Install with: pip install openai"
                )
        return self._client

    def generate(self, prompt: str, **kwargs: Any) -> str:
        """Generate a response from GPT.

        Args:
            prompt: The input prompt
            **kwargs: Additional parameters (temperature, max_tokens, system)

        Returns:
            The model's text response
        """
        client = self._get_client()

        # Extract parameters with defaults
        temperature = kwargs.get("temperature", self.temperature)
        max_tokens = kwargs.get("max_tokens", self.max_tokens)
        system = kwargs.get("system", "You are a helpful assistant.")

        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]

        start_time = time.time()

        response = client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        )

        latency_ms = (time.time() - start_time) * 1000

        # Extract text from response
        text = response.choices[0].message.content if response.choices else ""

        return text or ""

    def generate_with_perturbation(
        self, prompt: str, noise_level: float = 0.1, **kwargs: Any
    ) -> str:
        """Generate a response with noise injection.

        Applies perturbation through:
        1. Temperature scaling based on noise level
        2. Optional prompt perturbation
        3. Top-p sampling adjustment

        Args:
            prompt: The input prompt
            noise_level: Perturbation level (0.0-1.0)
            **kwargs: Additional generation parameters

        Returns:
            The perturbed response
        """
        # Scale temperature with noise level
        base_temp = kwargs.get("temperature", self.temperature)
        perturbed_temp = min(base_temp + noise_level * 0.8, 2.0)

        # Adjust top_p for additional randomness
        top_p = max(0.5, 1.0 - noise_level * 0.3)

        # Apply prompt perturbation at higher noise levels
        perturbed_prompt = prompt
        if noise_level > 0.3:
            perturbed_prompt = self._perturb_prompt(prompt, noise_level)

        return self.generate(
            perturbed_prompt,
            temperature=perturbed_temp,
            top_p=top_p,
            **{k: v for k, v in kwargs.items() if k not in ["temperature", "top_p"]},
        )

    def _perturb_prompt(self, prompt: str, noise_level: float) -> str:
        """Add slight noise to prompt."""
        perturbations = [
            lambda p: p.replace(".", ".."),
            lambda p: "Hey, " + p,
            lambda p: p + " Be concise.",
            lambda p: "Quick question: " + p.lower(),
            lambda p: p.replace("?", "??"),
        ]

        if random.random() < noise_level:
            perturbation = random.choice(perturbations)
            return perturbation(prompt)

        return prompt

    def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
        """Generate responses for multiple prompts."""
        return [self.generate(prompt, **kwargs) for prompt in prompts]

    def get_model_info(self) -> Dict[str, Any]:
        """Get information about the configured model.

        Returns:
            Dictionary with model metadata
        """
        return {
            "provider": "openai",
            "model": self.model,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "capabilities": ["text_generation", "analysis", "coding", "reasoning"],
        }

    def generate_with_response(
        self, prompt: str, **kwargs: Any
    ) -> ModelResponse:
        """Generate a response with full metadata.

        Args:
            prompt: The input prompt
            **kwargs: Additional parameters

        Returns:
            ModelResponse with text and metadata
        """
        client = self._get_client()

        temperature = kwargs.get("temperature", self.temperature)
        max_tokens = kwargs.get("max_tokens", self.max_tokens)
        system = kwargs.get("system", "You are a helpful assistant.")

        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]

        start_time = time.time()

        response = client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        )

        latency_ms = (time.time() - start_time) * 1000

        text = response.choices[0].message.content if response.choices else ""

        return ModelResponse(
            text=text or "",
            model=self.model,
            usage={
                "prompt_tokens": response.usage.prompt_tokens if response.usage else 0,
                "completion_tokens": response.usage.completion_tokens if response.usage else 0,
                "total_tokens": response.usage.total_tokens if response.usage else 0,
            },
            latency_ms=latency_ms,
            metadata={
                "finish_reason": response.choices[0].finish_reason if response.choices else None,
                "model": response.model,
            },
        )

    def generate_with_logprobs(
        self, prompt: str, **kwargs: Any
    ) -> Dict[str, Any]:
        """Generate a response with token log probabilities.

        Useful for analyzing model confidence and detecting
        unusual token distributions that may indicate sandbagging.

        Args:
            prompt: The input prompt
            **kwargs: Additional parameters

        Returns:
            Dictionary with text and log probabilities
        """
        client = self._get_client()

        temperature = kwargs.get("temperature", self.temperature)
        max_tokens = kwargs.get("max_tokens", self.max_tokens)
        system = kwargs.get("system", "You are a helpful assistant.")

        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]

        response = client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
            logprobs=True,
            top_logprobs=5,
        )

        choice = response.choices[0] if response.choices else None
        text = choice.message.content if choice else ""

        logprobs_data = None
        if choice and choice.logprobs:
            logprobs_data = {
                "tokens": [
                    {
                        "token": lp.token,
                        "logprob": lp.logprob,
                        "top_logprobs": [
                            {"token": t.token, "logprob": t.logprob}
                            for t in (lp.top_logprobs or [])
                        ],
                    }
                    for lp in (choice.logprobs.content or [])
                ],
            }

        return {
            "text": text or "",
            "logprobs": logprobs_data,
            "model": response.model,
        }

__init__(model: str = 'gpt-4', max_tokens: int = 1024, temperature: float = 0.7, api_key: Optional[str] = None) -> None

Initialize OpenAI API client.

Parameters:

Name Type Description Default
model str

GPT model identifier

'gpt-4'
max_tokens int

Maximum tokens to generate

1024
temperature float

Sampling temperature (0.0-2.0)

0.7
api_key Optional[str]

API key (defaults to OPENAI_API_KEY env var)

None

Raises:

Type Description
ImportError

If openai package is not installed

ValueError

If no API key is provided or found

Source code in src/rotalabs_probe/integrations/openai_api.py
def __init__(
    self,
    model: str = "gpt-4",
    max_tokens: int = 1024,
    temperature: float = 0.7,
    api_key: Optional[str] = None,
) -> None:
    """Initialize OpenAI API client.

    Args:
        model: GPT model identifier
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0-2.0)
        api_key: API key (defaults to OPENAI_API_KEY env var)

    Raises:
        ImportError: If openai package is not installed
        ValueError: If no API key is provided or found
    """
    self.model = model
    self.max_tokens = max_tokens
    self.temperature = temperature

    # Get API key
    self._api_key = api_key or os.environ.get("OPENAI_API_KEY")
    if not self._api_key:
        raise ValueError(
            "OpenAI API key required. Set OPENAI_API_KEY environment "
            "variable or pass api_key parameter."
        )

    # Lazy import to avoid dependency issues
    self._client: Optional[Any] = None

generate(prompt: str, **kwargs: Any) -> str

Generate a response from GPT.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
**kwargs Any

Additional parameters (temperature, max_tokens, system)

{}

Returns:

Type Description
str

The model's text response

Source code in src/rotalabs_probe/integrations/openai_api.py
def generate(self, prompt: str, **kwargs: Any) -> str:
    """Generate a response from GPT.

    Args:
        prompt: The input prompt
        **kwargs: Additional parameters (temperature, max_tokens, system)

    Returns:
        The model's text response
    """
    client = self._get_client()

    # Extract parameters with defaults
    temperature = kwargs.get("temperature", self.temperature)
    max_tokens = kwargs.get("max_tokens", self.max_tokens)
    system = kwargs.get("system", "You are a helpful assistant.")

    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    start_time = time.time()

    response = client.chat.completions.create(
        model=self.model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
    )

    latency_ms = (time.time() - start_time) * 1000

    # Extract text from response
    text = response.choices[0].message.content if response.choices else ""

    return text or ""

generate_with_perturbation(prompt: str, noise_level: float = 0.1, **kwargs: Any) -> str

Generate a response with noise injection.

Applies perturbation through: 1. Temperature scaling based on noise level 2. Optional prompt perturbation 3. Top-p sampling adjustment

Parameters:

Name Type Description Default
prompt str

The input prompt

required
noise_level float

Perturbation level (0.0-1.0)

0.1
**kwargs Any

Additional generation parameters

{}

Returns:

Type Description
str

The perturbed response

Source code in src/rotalabs_probe/integrations/openai_api.py
def generate_with_perturbation(
    self, prompt: str, noise_level: float = 0.1, **kwargs: Any
) -> str:
    """Generate a response with noise injection.

    Applies perturbation through:
    1. Temperature scaling based on noise level
    2. Optional prompt perturbation
    3. Top-p sampling adjustment

    Args:
        prompt: The input prompt
        noise_level: Perturbation level (0.0-1.0)
        **kwargs: Additional generation parameters

    Returns:
        The perturbed response
    """
    # Scale temperature with noise level
    base_temp = kwargs.get("temperature", self.temperature)
    perturbed_temp = min(base_temp + noise_level * 0.8, 2.0)

    # Adjust top_p for additional randomness
    top_p = max(0.5, 1.0 - noise_level * 0.3)

    # Apply prompt perturbation at higher noise levels
    perturbed_prompt = prompt
    if noise_level > 0.3:
        perturbed_prompt = self._perturb_prompt(prompt, noise_level)

    return self.generate(
        perturbed_prompt,
        temperature=perturbed_temp,
        top_p=top_p,
        **{k: v for k, v in kwargs.items() if k not in ["temperature", "top_p"]},
    )

generate_batch(prompts: List[str], **kwargs: Any) -> List[str]

Generate responses for multiple prompts.

Source code in src/rotalabs_probe/integrations/openai_api.py
def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
    """Generate responses for multiple prompts."""
    return [self.generate(prompt, **kwargs) for prompt in prompts]

get_model_info() -> Dict[str, Any]

Get information about the configured model.

Returns:

Type Description
Dict[str, Any]

Dictionary with model metadata

Source code in src/rotalabs_probe/integrations/openai_api.py
def get_model_info(self) -> Dict[str, Any]:
    """Get information about the configured model.

    Returns:
        Dictionary with model metadata
    """
    return {
        "provider": "openai",
        "model": self.model,
        "max_tokens": self.max_tokens,
        "temperature": self.temperature,
        "capabilities": ["text_generation", "analysis", "coding", "reasoning"],
    }

generate_with_response(prompt: str, **kwargs: Any) -> ModelResponse

Generate a response with full metadata.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
**kwargs Any

Additional parameters

{}

Returns:

Type Description
ModelResponse

ModelResponse with text and metadata

Source code in src/rotalabs_probe/integrations/openai_api.py
def generate_with_response(
    self, prompt: str, **kwargs: Any
) -> ModelResponse:
    """Generate a response with full metadata.

    Args:
        prompt: The input prompt
        **kwargs: Additional parameters

    Returns:
        ModelResponse with text and metadata
    """
    client = self._get_client()

    temperature = kwargs.get("temperature", self.temperature)
    max_tokens = kwargs.get("max_tokens", self.max_tokens)
    system = kwargs.get("system", "You are a helpful assistant.")

    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    start_time = time.time()

    response = client.chat.completions.create(
        model=self.model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
    )

    latency_ms = (time.time() - start_time) * 1000

    text = response.choices[0].message.content if response.choices else ""

    return ModelResponse(
        text=text or "",
        model=self.model,
        usage={
            "prompt_tokens": response.usage.prompt_tokens if response.usage else 0,
            "completion_tokens": response.usage.completion_tokens if response.usage else 0,
            "total_tokens": response.usage.total_tokens if response.usage else 0,
        },
        latency_ms=latency_ms,
        metadata={
            "finish_reason": response.choices[0].finish_reason if response.choices else None,
            "model": response.model,
        },
    )

generate_with_logprobs(prompt: str, **kwargs: Any) -> Dict[str, Any]

Generate a response with token log probabilities.

Useful for analyzing model confidence and detecting unusual token distributions that may indicate sandbagging.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
**kwargs Any

Additional parameters

{}

Returns:

Type Description
Dict[str, Any]

Dictionary with text and log probabilities

Source code in src/rotalabs_probe/integrations/openai_api.py
def generate_with_logprobs(
    self, prompt: str, **kwargs: Any
) -> Dict[str, Any]:
    """Generate a response with token log probabilities.

    Useful for analyzing model confidence and detecting
    unusual token distributions that may indicate sandbagging.

    Args:
        prompt: The input prompt
        **kwargs: Additional parameters

    Returns:
        Dictionary with text and log probabilities
    """
    client = self._get_client()

    temperature = kwargs.get("temperature", self.temperature)
    max_tokens = kwargs.get("max_tokens", self.max_tokens)
    system = kwargs.get("system", "You are a helpful assistant.")

    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    response = client.chat.completions.create(
        model=self.model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
        logprobs=True,
        top_logprobs=5,
    )

    choice = response.choices[0] if response.choices else None
    text = choice.message.content if choice else ""

    logprobs_data = None
    if choice and choice.logprobs:
        logprobs_data = {
            "tokens": [
                {
                    "token": lp.token,
                    "logprob": lp.logprob,
                    "top_logprobs": [
                        {"token": t.token, "logprob": t.logprob}
                        for t in (lp.top_logprobs or [])
                    ],
                }
                for lp in (choice.logprobs.content or [])
            ],
        }

    return {
        "text": text or "",
        "logprobs": logprobs_data,
        "model": response.model,
    }