Skip to content

Utilities

Utility functions for statistical analysis and feature extraction.

Statistical Tests

Statistical testing utilities for metacognition analysis.

This module provides reusable statistical functions for Bayesian inference, confidence interval computation, z-score calculations, and divergence significance assessment.

SignificanceLevel

Bases: Enum

Significance level classification for statistical tests.

Source code in src/rotalabs_probe/utils/statistical_tests.py
class SignificanceLevel(Enum):
    """Significance level classification for statistical tests."""

    NONE = "none"  # Below threshold
    LOW = "low"  # 2-3 sigma
    MEDIUM = "medium"  # 3-4 sigma
    HIGH = "high"  # 4-5 sigma
    CRITICAL = "critical"  # >5 sigma

bayesian_update(prior_alpha: float, prior_beta: float, evidence: Dict[str, int]) -> Tuple[float, float]

Update Beta distribution priors with new evidence using Bayesian inference.

Uses the Beta-Binomial conjugate prior relationship where: - Prior: Beta(alpha, beta) - Likelihood: Binomial(successes, failures) - Posterior: Beta(alpha + successes, beta + failures)

Parameters:

Name Type Description Default
prior_alpha float

Alpha parameter of prior Beta distribution (must be > 0)

required
prior_beta float

Beta parameter of prior Beta distribution (must be > 0)

required
evidence Dict[str, int]

Dictionary with 'successes' and 'failures' counts

required

Returns:

Type Description
Tuple[float, float]

Tuple of (posterior_alpha, posterior_beta)

Raises:

Type Description
ValueError

If prior parameters are invalid

ValueError

If evidence is missing required keys or has negative values

TypeError

If evidence is not a dictionary

Examples:

>>> bayesian_update(1.0, 1.0, {'successes': 5, 'failures': 3})
(6.0, 4.0)
>>> bayesian_update(10.0, 10.0, {'successes': 8, 'failures': 2})
(18.0, 12.0)
Source code in src/rotalabs_probe/utils/statistical_tests.py
def bayesian_update(
    prior_alpha: float, prior_beta: float, evidence: Dict[str, int]
) -> Tuple[float, float]:
    """Update Beta distribution priors with new evidence using Bayesian inference.

    Uses the Beta-Binomial conjugate prior relationship where:
    - Prior: Beta(alpha, beta)
    - Likelihood: Binomial(successes, failures)
    - Posterior: Beta(alpha + successes, beta + failures)

    Args:
        prior_alpha: Alpha parameter of prior Beta distribution (must be > 0)
        prior_beta: Beta parameter of prior Beta distribution (must be > 0)
        evidence: Dictionary with 'successes' and 'failures' counts

    Returns:
        Tuple of (posterior_alpha, posterior_beta)

    Raises:
        ValueError: If prior parameters are invalid
        ValueError: If evidence is missing required keys or has negative values
        TypeError: If evidence is not a dictionary

    Examples:
        >>> bayesian_update(1.0, 1.0, {'successes': 5, 'failures': 3})
        (6.0, 4.0)

        >>> bayesian_update(10.0, 10.0, {'successes': 8, 'failures': 2})
        (18.0, 12.0)
    """
    # Validate prior parameters
    if not isinstance(prior_alpha, (int, float)) or not isinstance(
        prior_beta, (int, float)
    ):
        raise ValueError("Prior alpha and beta must be numeric")

    if prior_alpha <= 0 or prior_beta <= 0:
        raise ValueError("Prior alpha and beta must be positive")

    # Validate evidence
    if not isinstance(evidence, dict):
        raise TypeError("Evidence must be a dictionary")

    if "successes" not in evidence or "failures" not in evidence:
        raise ValueError("Evidence must contain 'successes' and 'failures' keys")

    successes = evidence["successes"]
    failures = evidence["failures"]

    if not isinstance(successes, (int, float)) or not isinstance(failures, (int, float)):
        raise ValueError("Evidence counts must be numeric")

    if successes < 0 or failures < 0:
        raise ValueError("Evidence counts cannot be negative")

    # Bayesian update: posterior = prior + evidence
    posterior_alpha = float(prior_alpha + successes)
    posterior_beta = float(prior_beta + failures)

    return posterior_alpha, posterior_beta

compute_confidence_interval(alpha: float, beta: float, confidence_level: float = 0.95) -> Tuple[float, float]

Compute credible interval for Beta distribution.

Calculates the Bayesian credible interval (also called highest density interval) for a Beta distribution. This represents the range within which the true parameter lies with the specified probability.

Parameters:

Name Type Description Default
alpha float

Alpha parameter of Beta distribution (must be > 0)

required
beta float

Beta parameter of Beta distribution (must be > 0)

required
confidence_level float

Confidence level (0 < confidence_level < 1, default: 0.95)

0.95

Returns:

Type Description
Tuple[float, float]

Tuple of (lower_bound, upper_bound) for the credible interval

Raises:

Type Description
ValueError

If alpha or beta are not positive

ValueError

If confidence_level is not between 0 and 1

Examples:

>>> lower, upper = compute_confidence_interval(10, 10, 0.95)
>>> 0.3 < lower < 0.4  # Approximately 0.34
True
>>> 0.6 < upper < 0.7  # Approximately 0.66
True
>>> lower, upper = compute_confidence_interval(100, 10, 0.95)
>>> 0.85 < lower < 0.95
True
Source code in src/rotalabs_probe/utils/statistical_tests.py
def compute_confidence_interval(
    alpha: float, beta: float, confidence_level: float = 0.95
) -> Tuple[float, float]:
    """Compute credible interval for Beta distribution.

    Calculates the Bayesian credible interval (also called highest density interval)
    for a Beta distribution. This represents the range within which the true
    parameter lies with the specified probability.

    Args:
        alpha: Alpha parameter of Beta distribution (must be > 0)
        beta: Beta parameter of Beta distribution (must be > 0)
        confidence_level: Confidence level (0 < confidence_level < 1, default: 0.95)

    Returns:
        Tuple of (lower_bound, upper_bound) for the credible interval

    Raises:
        ValueError: If alpha or beta are not positive
        ValueError: If confidence_level is not between 0 and 1

    Examples:
        >>> lower, upper = compute_confidence_interval(10, 10, 0.95)
        >>> 0.3 < lower < 0.4  # Approximately 0.34
        True
        >>> 0.6 < upper < 0.7  # Approximately 0.66
        True

        >>> lower, upper = compute_confidence_interval(100, 10, 0.95)
        >>> 0.85 < lower < 0.95
        True
    """
    # Validate parameters
    if not isinstance(alpha, (int, float)) or not isinstance(beta, (int, float)):
        raise ValueError("Alpha and beta must be numeric")

    if alpha <= 0 or beta <= 0:
        raise ValueError("Alpha and beta must be positive")

    if not isinstance(confidence_level, (int, float)):
        raise ValueError("Confidence level must be numeric")

    if confidence_level <= 0 or confidence_level >= 1:
        raise ValueError("Confidence level must be between 0 and 1")

    # Calculate credible interval using Beta distribution quantiles
    # For a symmetric interval, we use (1 - confidence_level) / 2 on each tail
    tail_prob = (1 - confidence_level) / 2
    lower_bound = stats.beta.ppf(tail_prob, alpha, beta)
    upper_bound = stats.beta.ppf(1 - tail_prob, alpha, beta)

    return float(lower_bound), float(upper_bound)

z_score(value: float, mean: float, std: float) -> float

Calculate standardized z-score.

Computes how many standard deviations a value is from the mean. Handles edge cases like zero standard deviation gracefully.

Formula: z = (value - mean) / std

Parameters:

Name Type Description Default
value float

The observed value

required
mean float

The mean of the distribution

required
std float

The standard deviation of the distribution (must be >= 0)

required

Returns:

Type Description
float

Z-score (number of standard deviations from mean)

float

Returns 0.0 if std is 0 or very small (< 1e-10)

Raises:

Type Description
ValueError

If std is negative

ValueError

If any parameter is not numeric

Examples:

>>> z_score(100, 90, 10)
1.0
>>> z_score(85, 100, 5)
-3.0
>>> z_score(50, 50, 0)  # Edge case: zero std
0.0
Source code in src/rotalabs_probe/utils/statistical_tests.py
def z_score(value: float, mean: float, std: float) -> float:
    """Calculate standardized z-score.

    Computes how many standard deviations a value is from the mean.
    Handles edge cases like zero standard deviation gracefully.

    Formula: z = (value - mean) / std

    Args:
        value: The observed value
        mean: The mean of the distribution
        std: The standard deviation of the distribution (must be >= 0)

    Returns:
        Z-score (number of standard deviations from mean)
        Returns 0.0 if std is 0 or very small (< 1e-10)

    Raises:
        ValueError: If std is negative
        ValueError: If any parameter is not numeric

    Examples:
        >>> z_score(100, 90, 10)
        1.0

        >>> z_score(85, 100, 5)
        -3.0

        >>> z_score(50, 50, 0)  # Edge case: zero std
        0.0
    """
    # Validate inputs
    if not all(isinstance(x, (int, float)) for x in [value, mean, std]):
        raise ValueError("All parameters must be numeric")

    if std < 0:
        raise ValueError("Standard deviation cannot be negative")

    # Handle edge case: zero or very small standard deviation
    # If std is essentially zero, the value equals the mean (or data has no variance)
    if std < 1e-10:
        return 0.0

    # Standard z-score calculation
    z = (value - mean) / std

    return float(z)

assess_divergence_significance(z_score_value: float, threshold: float = 2.0) -> SignificanceLevel

Assess statistical significance of a divergence based on z-score.

Classifies the significance level of a divergence using standard deviation thresholds. Uses absolute value of z-score.

Significance levels: - NONE: |z| < threshold (typically < 2σ) - LOW: threshold <= |z| < threshold + 1 (2-3σ) - MEDIUM: threshold + 1 <= |z| < threshold + 2 (3-4σ) - HIGH: threshold + 2 <= |z| < threshold + 3 (4-5σ) - CRITICAL: |z| >= threshold + 3 (>5σ)

Parameters:

Name Type Description Default
z_score_value float

The z-score to assess

required
threshold float

Base threshold for significance (default: 2.0)

2.0

Returns:

Type Description
SignificanceLevel

SignificanceLevel enum indicating the level of significance

Raises:

Type Description
ValueError

If threshold is not positive

ValueError

If z_score_value is not numeric

Examples:

>>> assess_divergence_significance(1.5)
<SignificanceLevel.NONE: 'none'>
>>> assess_divergence_significance(2.5)
<SignificanceLevel.LOW: 'low'>
>>> assess_divergence_significance(3.5)
<SignificanceLevel.MEDIUM: 'medium'>
>>> assess_divergence_significance(-4.5)  # Absolute value used
<SignificanceLevel.HIGH: 'high'>
>>> assess_divergence_significance(6.0)
<SignificanceLevel.CRITICAL: 'critical'>
Source code in src/rotalabs_probe/utils/statistical_tests.py
def assess_divergence_significance(
    z_score_value: float, threshold: float = 2.0
) -> SignificanceLevel:
    """Assess statistical significance of a divergence based on z-score.

    Classifies the significance level of a divergence using standard
    deviation thresholds. Uses absolute value of z-score.

    Significance levels:
    - NONE: |z| < threshold (typically < 2σ)
    - LOW: threshold <= |z| < threshold + 1 (2-3σ)
    - MEDIUM: threshold + 1 <= |z| < threshold + 2 (3-4σ)
    - HIGH: threshold + 2 <= |z| < threshold + 3 (4-5σ)
    - CRITICAL: |z| >= threshold + 3 (>5σ)

    Args:
        z_score_value: The z-score to assess
        threshold: Base threshold for significance (default: 2.0)

    Returns:
        SignificanceLevel enum indicating the level of significance

    Raises:
        ValueError: If threshold is not positive
        ValueError: If z_score_value is not numeric

    Examples:
        >>> assess_divergence_significance(1.5)
        <SignificanceLevel.NONE: 'none'>

        >>> assess_divergence_significance(2.5)
        <SignificanceLevel.LOW: 'low'>

        >>> assess_divergence_significance(3.5)
        <SignificanceLevel.MEDIUM: 'medium'>

        >>> assess_divergence_significance(-4.5)  # Absolute value used
        <SignificanceLevel.HIGH: 'high'>

        >>> assess_divergence_significance(6.0)
        <SignificanceLevel.CRITICAL: 'critical'>
    """
    # Validate inputs
    if not isinstance(z_score_value, (int, float)):
        raise ValueError("Z-score must be numeric")

    if not isinstance(threshold, (int, float)):
        raise ValueError("Threshold must be numeric")

    if threshold <= 0:
        raise ValueError("Threshold must be positive")

    # Use absolute value for significance assessment
    abs_z = abs(z_score_value)

    # Classify based on thresholds
    if abs_z < threshold:
        return SignificanceLevel.NONE
    elif abs_z < threshold + 1:
        return SignificanceLevel.LOW
    elif abs_z < threshold + 2:
        return SignificanceLevel.MEDIUM
    elif abs_z < threshold + 3:
        return SignificanceLevel.HIGH
    else:
        return SignificanceLevel.CRITICAL

compute_beta_mean(alpha: float, beta: float) -> float

Compute mean of Beta distribution.

Parameters:

Name Type Description Default
alpha float

Alpha parameter (must be > 0)

required
beta float

Beta parameter (must be > 0)

required

Returns:

Type Description
float

Mean of the Beta distribution: alpha / (alpha + beta)

Raises:

Type Description
ValueError

If alpha or beta are not positive

Source code in src/rotalabs_probe/utils/statistical_tests.py
def compute_beta_mean(alpha: float, beta: float) -> float:
    """Compute mean of Beta distribution.

    Args:
        alpha: Alpha parameter (must be > 0)
        beta: Beta parameter (must be > 0)

    Returns:
        Mean of the Beta distribution: alpha / (alpha + beta)

    Raises:
        ValueError: If alpha or beta are not positive
    """
    if alpha <= 0 or beta <= 0:
        raise ValueError("Alpha and beta must be positive")

    return float(alpha / (alpha + beta))

compute_beta_variance(alpha: float, beta: float) -> float

Compute variance of Beta distribution.

Parameters:

Name Type Description Default
alpha float

Alpha parameter (must be > 0)

required
beta float

Beta parameter (must be > 0)

required

Returns:

Type Description
float

Variance of the Beta distribution

Raises:

Type Description
ValueError

If alpha or beta are not positive

Source code in src/rotalabs_probe/utils/statistical_tests.py
def compute_beta_variance(alpha: float, beta: float) -> float:
    """Compute variance of Beta distribution.

    Args:
        alpha: Alpha parameter (must be > 0)
        beta: Beta parameter (must be > 0)

    Returns:
        Variance of the Beta distribution

    Raises:
        ValueError: If alpha or beta are not positive
    """
    if alpha <= 0 or beta <= 0:
        raise ValueError("Alpha and beta must be positive")

    numerator = alpha * beta
    denominator = (alpha + beta) ** 2 * (alpha + beta + 1)

    return float(numerator / denominator)

beta_mode(alpha: float, beta: float) -> float

Compute mode of Beta distribution.

The mode is defined only when alpha, beta > 1.

Parameters:

Name Type Description Default
alpha float

Alpha parameter (must be > 1 for mode to exist)

required
beta float

Beta parameter (must be > 1 for mode to exist)

required

Returns:

Type Description
float

Mode of the Beta distribution: (alpha - 1) / (alpha + beta - 2)

Raises:

Type Description
ValueError

If alpha or beta are not greater than 1

Source code in src/rotalabs_probe/utils/statistical_tests.py
def beta_mode(alpha: float, beta: float) -> float:
    """Compute mode of Beta distribution.

    The mode is defined only when alpha, beta > 1.

    Args:
        alpha: Alpha parameter (must be > 1 for mode to exist)
        beta: Beta parameter (must be > 1 for mode to exist)

    Returns:
        Mode of the Beta distribution: (alpha - 1) / (alpha + beta - 2)

    Raises:
        ValueError: If alpha or beta are not greater than 1
    """
    if alpha <= 1 or beta <= 1:
        raise ValueError("Mode is only defined for alpha, beta > 1")

    return float((alpha - 1) / (alpha + beta - 2))

Text Processing

Text processing utilities for metacognition analysis.

tokenize(text: str, lowercase: bool = True) -> List[str]

Tokenize text into words.

Parameters:

Name Type Description Default
text str

Input text to tokenize

required
lowercase bool

Whether to convert tokens to lowercase

True

Returns:

Type Description
List[str]

List of tokens

Source code in src/rotalabs_probe/utils/text_processing.py
def tokenize(text: str, lowercase: bool = True) -> List[str]:
    """Tokenize text into words.

    Args:
        text: Input text to tokenize
        lowercase: Whether to convert tokens to lowercase

    Returns:
        List of tokens
    """
    if lowercase:
        text = text.lower()
    # Simple word tokenization
    tokens = re.findall(r"\b\w+\b", text)
    return tokens

remove_stopwords(tokens: List[str], stopwords: Set[str]) -> List[str]

Remove stopwords from a list of tokens.

Parameters:

Name Type Description Default
tokens List[str]

List of tokens

required
stopwords Set[str]

Set of stopwords to remove

required

Returns:

Type Description
List[str]

List of tokens with stopwords removed

Source code in src/rotalabs_probe/utils/text_processing.py
def remove_stopwords(tokens: List[str], stopwords: Set[str]) -> List[str]:
    """Remove stopwords from a list of tokens.

    Args:
        tokens: List of tokens
        stopwords: Set of stopwords to remove

    Returns:
        List of tokens with stopwords removed
    """
    return [token for token in tokens if token not in stopwords]

get_uncertainty_phrases() -> Set[str]

Get a set of common uncertainty phrases.

Returns:

Type Description
Set[str]

Set of uncertainty phrases

Source code in src/rotalabs_probe/utils/text_processing.py
def get_uncertainty_phrases() -> Set[str]:
    """Get a set of common uncertainty phrases.

    Returns:
        Set of uncertainty phrases
    """
    return {
        "i'm not sure",
        "i'm uncertain",
        "i don't know",
        "might be",
        "could be",
        "possibly",
        "perhaps",
        "maybe",
        "i think",
        "i believe",
        "it seems",
        "it appears",
        "likely",
        "unlikely",
        "not certain",
        "not confident",
    }

get_confidence_phrases() -> Set[str]

Get a set of common confidence phrases.

Returns:

Type Description
Set[str]

Set of confidence phrases

Source code in src/rotalabs_probe/utils/text_processing.py
def get_confidence_phrases() -> Set[str]:
    """Get a set of common confidence phrases.

    Returns:
        Set of confidence phrases
    """
    return {
        "i'm certain",
        "i'm confident",
        "i'm sure",
        "definitely",
        "absolutely",
        "certainly",
        "without doubt",
        "clearly",
        "obviously",
        "undoubtedly",
    }

normalize_text(text: str) -> str

Normalize text by removing extra whitespace and converting to lowercase.

Parameters:

Name Type Description Default
text str

Input text to normalize

required

Returns:

Type Description
str

Normalized text

Source code in src/rotalabs_probe/utils/text_processing.py
def normalize_text(text: str) -> str:
    """Normalize text by removing extra whitespace and converting to lowercase.

    Args:
        text: Input text to normalize

    Returns:
        Normalized text
    """
    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text)
    # Strip leading/trailing whitespace
    text = text.strip()
    # Convert to lowercase
    text = text.lower()
    return text

Feature Extraction

Feature extraction utilities for behavioral analysis.

This module provides reusable functions for extracting behavioral features from AI model outputs, including hedging patterns, meta-commentary, reasoning depth, and statistical divergence measures.

extract_behavioral_features(text: str, cot: Optional[str] = None, metadata: Optional[Dict[str, Any]] = None) -> Dict[str, float]

Extract comprehensive behavioral features from model output.

Parameters:

Name Type Description Default
text str

The model's response text

required
cot Optional[str]

Optional chain-of-thought reasoning

None
metadata Optional[Dict[str, Any]]

Optional metadata dictionary

None

Returns:

Type Description
Dict[str, float]

Dictionary mapping feature names to their values

Raises:

Type Description
ValueError

If text is empty or invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def extract_behavioral_features(
    text: str,
    cot: Optional[str] = None,
    metadata: Optional[Dict[str, Any]] = None,
) -> Dict[str, float]:
    """Extract comprehensive behavioral features from model output.

    Args:
        text: The model's response text
        cot: Optional chain-of-thought reasoning
        metadata: Optional metadata dictionary

    Returns:
        Dictionary mapping feature names to their values

    Raises:
        ValueError: If text is empty or invalid
    """
    if not text or not isinstance(text, str):
        raise ValueError("text must be a non-empty string")

    features: Dict[str, float] = {}

    # Basic text features
    features["response_length"] = float(len(text))
    words = text.split()
    features["word_count"] = float(len(words))
    features["avg_word_length"] = (
        float(np.mean([len(w) for w in words])) if words else 0.0
    )

    # Sentence statistics
    # TODO: this regex doesn't handle abbreviations well (e.g. "Dr. Smith")
    sentences = re.split(r"[.!?]+", text)
    sentences = [s.strip() for s in sentences if s.strip()]
    features["sentence_count"] = float(len(sentences))
    features["avg_sentence_length"] = (
        float(np.mean([len(s.split()) for s in sentences])) if sentences else 0.0
    )

    # Hedging patterns
    features["hedging_ratio"] = count_hedging_phrases(text)

    # Meta-commentary
    meta_result = detect_meta_commentary(text)
    features["meta_commentary_detected"] = float(meta_result["detected"])
    features["meta_commentary_confidence"] = meta_result["confidence"]
    features["meta_commentary_count"] = float(meta_result["count"])

    # Self-reference patterns
    features["self_reference_ratio"] = _count_self_references(text) / max(
        len(words), 1
    )

    # Reasoning depth from CoT if provided
    if cot:
        features["reasoning_depth"] = extract_reasoning_depth(cot)
    else:
        features["reasoning_depth"] = 0.0

    # Confidence markers
    features["confidence_high_ratio"] = _count_confidence_markers(text, high=True) / max(
        len(words), 1
    )
    features["confidence_low_ratio"] = _count_confidence_markers(text, high=False) / max(
        len(words), 1
    )

    # Refusal indicators
    features["refusal_indicators"] = _count_refusal_indicators(text)

    # Tool/action mentions (if metadata provided)
    if metadata and "tool_used" in metadata:
        features["tool_used"] = float(metadata["tool_used"])
    else:
        features["tool_used"] = 0.0

    return features

count_hedging_phrases(text: str) -> float

Count hedging phrases and return normalized ratio.

Hedging phrases indicate uncertainty or lack of confidence in statements.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
float

Ratio of hedging phrases to total words (0.0 to 1.0)

Raises:

Type Description
ValueError

If text is empty or invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def count_hedging_phrases(text: str) -> float:
    """Count hedging phrases and return normalized ratio.

    Hedging phrases indicate uncertainty or lack of confidence in statements.

    Args:
        text: Text to analyze

    Returns:
        Ratio of hedging phrases to total words (0.0 to 1.0)

    Raises:
        ValueError: If text is empty or invalid
    """
    if not isinstance(text, str):
        raise ValueError("text must be a string")

    if not text:
        return 0.0

    # Comprehensive hedging patterns
    hedging_pattern = re.compile(
        r"\b("
        # Uncertainty markers
        r"i think|i believe|i guess|i suppose|i assume|i suspect|"
        r"i feel|in my opinion|to my knowledge|as far as i know|"
        # Modal verbs
        r"might|may|could|would|should|can|possibly|probably|"
        r"perhaps|maybe|likely|unlikely|"
        # Hedging adjectives/adverbs
        r"seems?|appears?|suggests?|indicates?|implies?|tends? to|"
        r"somewhat|rather|quite|fairly|relatively|comparatively|"
        # Uncertainty expressions
        r"not (?:entirely |completely |totally )?(?:sure|certain|confident)|"
        r"uncertain|unclear|ambiguous|tentative|provisional|"
        # Qualification
        r"generally|usually|typically|normally|often|sometimes|occasionally|"
        r"in (?:some|many|most) cases|to (?:some|a certain) (?:extent|degree)|"
        # Approximation
        r"approximately|roughly|about|around|nearly|almost|close to"
        r")\b",
        re.IGNORECASE,
    )

    words = text.split()
    if not words:
        return 0.0

    matches = hedging_pattern.findall(text)
    return float(len(matches)) / len(words)

detect_meta_commentary(text: str) -> Dict[str, Any]

Detect patterns suggesting evaluation awareness or meta-commentary.

Meta-commentary includes references to the evaluation context, testing, or self-reflective statements about the model's own behavior.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
Dict[str, Any]

Dictionary with: - detected: bool indicating if meta-commentary found - confidence: float confidence score (0.0 to 1.0) - count: int number of meta-commentary patterns found - patterns: list of matched patterns

Raises:

Type Description
ValueError

If text is invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def detect_meta_commentary(text: str) -> Dict[str, Any]:
    """Detect patterns suggesting evaluation awareness or meta-commentary.

    Meta-commentary includes references to the evaluation context, testing,
    or self-reflective statements about the model's own behavior.

    Args:
        text: Text to analyze

    Returns:
        Dictionary with:
            - detected: bool indicating if meta-commentary found
            - confidence: float confidence score (0.0 to 1.0)
            - count: int number of meta-commentary patterns found
            - patterns: list of matched patterns

    Raises:
        ValueError: If text is invalid
    """
    if not isinstance(text, str):
        raise ValueError("text must be a string")

    if not text:
        return {
            "detected": False,
            "confidence": 0.0,
            "count": 0,
            "patterns": [],
        }

    # Meta-commentary patterns
    meta_patterns = [
        # Evaluation awareness
        (
            r"\b(this (?:is|appears to be|seems like) (?:a |an )?(?:test|evaluation|"
            r"assessment|examination|quiz|check))\b",
            1.0,  # High confidence
        ),
        (
            r"\b(you(?:'re| are) (?:testing|evaluating|assessing|checking|examining) "
            r"(?:me|my))\b",
            1.0,
        ),
        (r"\b(being (?:tested|evaluated|assessed|examined|measured))\b", 0.9),
        # Self-awareness of model nature
        (
            r"\b(as an ai|as a language model|i'?m an ai|i am (?:an )?(?:artificial|ai))\b",
            0.7,
        ),
        (
            r"\b(my (?:training|programming|design|architecture|capabilities|"
            r"limitations))\b",
            0.6,
        ),
        # Meta-reasoning
        (r"\b(i (?:notice|observe|recognize|realize) (?:that )?(?:this|you))\b", 0.5),
        (r"\b((?:let me |i'll )(?:think|reason|consider) (?:about|through))\b", 0.4),
        # Context awareness
        (r"\b(in this (?:context|situation|scenario|case))\b", 0.4),
        (r"\b(given (?:this|the) (?:context|situation|prompt))\b", 0.4),
    ]

    matches = []
    total_confidence = 0.0

    text_lower = text.lower()

    for pattern, confidence in meta_patterns:
        found = re.findall(pattern, text_lower, re.IGNORECASE)
        if found:
            matches.extend(found)
            total_confidence += confidence * len(found)

    count = len(matches)

    # Normalize confidence (cap at 1.0)
    normalized_confidence = min(total_confidence / max(count, 1), 1.0) if count > 0 else 0.0

    return {
        "detected": count > 0,
        "confidence": float(normalized_confidence),
        "count": count,
        "patterns": list(set(matches)),  # Unique patterns
    }

extract_reasoning_depth(cot: str) -> float

Extract reasoning depth from chain-of-thought.

Reasoning depth is estimated by counting: - Numbered/bulleted steps - Logical connectors (therefore, thus, because) - Reasoning verbs (analyze, consider, evaluate) - Conditional statements (if/then)

Parameters:

Name Type Description Default
cot str

Chain-of-thought text

required

Returns:

Type Description
float

Estimated reasoning depth score (0.0+)

Raises:

Type Description
ValueError

If cot is invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def extract_reasoning_depth(cot: str) -> float:
    """Extract reasoning depth from chain-of-thought.

    Reasoning depth is estimated by counting:
    - Numbered/bulleted steps
    - Logical connectors (therefore, thus, because)
    - Reasoning verbs (analyze, consider, evaluate)
    - Conditional statements (if/then)

    Args:
        cot: Chain-of-thought text

    Returns:
        Estimated reasoning depth score (0.0+)

    Raises:
        ValueError: If cot is invalid
    """
    if not isinstance(cot, str):
        raise ValueError("cot must be a string")

    if not cot:
        return 0.0

    depth_score = 0.0

    # Count numbered/bulleted steps
    step_patterns = [
        r"^\s*\d+[\.\)]\s+",  # 1. or 1)
        r"^\s*[a-z][\.\)]\s+",  # a. or a)
        r"^\s*[-\*\+]\s+",  # - or * or +
        r"\b(?:step|point) \d+\b",  # step 1, point 2
        r"\b(?:first|second|third|fourth|fifth|finally|lastly)\b",  # ordinals
    ]

    for pattern in step_patterns:
        matches = re.findall(pattern, cot, re.IGNORECASE | re.MULTILINE)
        depth_score += len(matches) * 0.5

    # Count logical connectors
    logical_patterns = [
        r"\b(because|since|as|given that)\b",  # Reason
        r"\b(therefore|thus|hence|consequently|so)\b",  # Conclusion
        r"\b(however|but|although|though|yet)\b",  # Contrast
        r"\b(moreover|furthermore|additionally|also)\b",  # Addition
    ]

    for pattern in logical_patterns:
        matches = re.findall(pattern, cot, re.IGNORECASE)
        depth_score += len(matches) * 0.3

    # Count reasoning verbs
    reasoning_verbs = re.compile(
        r"\b(analyze|consider|evaluate|assess|examine|think|reason|"
        r"determine|conclude|infer|deduce|derive)\b",
        re.IGNORECASE,
    )
    depth_score += len(reasoning_verbs.findall(cot)) * 0.4

    # Count conditional reasoning
    conditional_pattern = re.compile(
        r"\b(if\b.*?\bthen\b|when\b.*?\bthen\b|given\b.*?\bthen\b)",
        re.IGNORECASE,
    )
    depth_score += len(conditional_pattern.findall(cot)) * 0.6

    # Count questions (indicates exploratory reasoning)
    questions = re.findall(r"\?", cot)
    depth_score += len(questions) * 0.2

    return float(depth_score)

compute_kl_divergence(dist1: Dict[str, float], dist2: Dict[str, float], epsilon: float = 1e-10) -> float

Compute Kullback-Leibler divergence between two distributions.

KL(P||Q) measures how much information is lost when Q is used to approximate P. Returns divergence in nats (natural units).

Parameters:

Name Type Description Default
dist1 Dict[str, float]

First distribution (P) as dictionary

required
dist2 Dict[str, float]

Second distribution (Q) as dictionary

required
epsilon float

Small constant to avoid log(0) (default: 1e-10)

1e-10

Returns:

Type Description
float

KL divergence value (0.0+), higher means more divergent

Raises:

Type Description
ValueError

If distributions are empty or invalid

ValueError

If distributions have different keys

Notes
  • Returns 0.0 if distributions are identical
  • Handles missing keys by adding epsilon
  • Normalizes distributions to sum to 1.0
Source code in src/rotalabs_probe/utils/feature_extraction.py
def compute_kl_divergence(
    dist1: Dict[str, float], dist2: Dict[str, float], epsilon: float = 1e-10
) -> float:
    """Compute Kullback-Leibler divergence between two distributions.

    KL(P||Q) measures how much information is lost when Q is used to
    approximate P. Returns divergence in nats (natural units).

    Args:
        dist1: First distribution (P) as dictionary
        dist2: Second distribution (Q) as dictionary
        epsilon: Small constant to avoid log(0) (default: 1e-10)

    Returns:
        KL divergence value (0.0+), higher means more divergent

    Raises:
        ValueError: If distributions are empty or invalid
        ValueError: If distributions have different keys

    Notes:
        - Returns 0.0 if distributions are identical
        - Handles missing keys by adding epsilon
        - Normalizes distributions to sum to 1.0
    """
    if not dist1 or not dist2:
        raise ValueError("Distributions cannot be empty")

    if not isinstance(dist1, dict) or not isinstance(dist2, dict):
        raise ValueError("Distributions must be dictionaries")

    # Get all keys
    all_keys = set(dist1.keys()) | set(dist2.keys())

    if not all_keys:
        raise ValueError("Distributions have no keys")

    # Extract values and add epsilon for missing keys
    p_values = np.array([dist1.get(k, epsilon) for k in all_keys])
    q_values = np.array([dist2.get(k, epsilon) for k in all_keys])

    # Add epsilon to avoid zeros
    p_values = p_values + epsilon
    q_values = q_values + epsilon

    # Normalize to probability distributions
    p_values = p_values / np.sum(p_values)
    q_values = q_values / np.sum(q_values)

    # Compute KL divergence: sum(P * log(P/Q))
    kl_div = np.sum(p_values * np.log(p_values / q_values))

    return float(kl_div)

compute_js_divergence(dist1: Dict[str, float], dist2: Dict[str, float], epsilon: float = 1e-10) -> float

Compute Jensen-Shannon divergence between two distributions.

JS divergence is a symmetric version of KL divergence: JS(P||Q) = 0.5 * KL(P||M) + 0.5 * KL(Q||M) where M = 0.5 * (P + Q)

Parameters:

Name Type Description Default
dist1 Dict[str, float]

First distribution as dictionary

required
dist2 Dict[str, float]

Second distribution as dictionary

required
epsilon float

Small constant to avoid log(0)

1e-10

Returns:

Type Description
float

JS divergence value (0.0 to 1.0), 0 means identical

Raises:

Type Description
ValueError

If distributions are invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def compute_js_divergence(
    dist1: Dict[str, float], dist2: Dict[str, float], epsilon: float = 1e-10
) -> float:
    """Compute Jensen-Shannon divergence between two distributions.

    JS divergence is a symmetric version of KL divergence:
    JS(P||Q) = 0.5 * KL(P||M) + 0.5 * KL(Q||M)
    where M = 0.5 * (P + Q)

    Args:
        dist1: First distribution as dictionary
        dist2: Second distribution as dictionary
        epsilon: Small constant to avoid log(0)

    Returns:
        JS divergence value (0.0 to 1.0), 0 means identical

    Raises:
        ValueError: If distributions are invalid
    """
    if not dist1 or not dist2:
        raise ValueError("Distributions cannot be empty")

    # Get all keys
    all_keys = set(dist1.keys()) | set(dist2.keys())

    # Create normalized distributions
    p_values = np.array([dist1.get(k, epsilon) for k in all_keys]) + epsilon
    q_values = np.array([dist2.get(k, epsilon) for k in all_keys]) + epsilon

    p_values = p_values / np.sum(p_values)
    q_values = q_values / np.sum(q_values)

    # Compute midpoint distribution
    m_values = 0.5 * (p_values + q_values)

    # Compute JS divergence
    kl_pm = np.sum(p_values * np.log(p_values / m_values))
    kl_qm = np.sum(q_values * np.log(q_values / m_values))

    js_div = 0.5 * kl_pm + 0.5 * kl_qm

    return float(js_div)

normalize_distribution(dist: Dict[str, float]) -> Dict[str, float]

Normalize a distribution to sum to 1.0.

Parameters:

Name Type Description Default
dist Dict[str, float]

Distribution dictionary

required

Returns:

Type Description
Dict[str, float]

Normalized distribution

Raises:

Type Description
ValueError

If distribution is empty or has no positive values

Source code in src/rotalabs_probe/utils/feature_extraction.py
def normalize_distribution(dist: Dict[str, float]) -> Dict[str, float]:
    """Normalize a distribution to sum to 1.0.

    Args:
        dist: Distribution dictionary

    Returns:
        Normalized distribution

    Raises:
        ValueError: If distribution is empty or has no positive values
    """
    if not dist:
        raise ValueError("Distribution cannot be empty")

    total = sum(dist.values())

    if total <= 0:
        raise ValueError("Distribution must have positive values")

    return {k: v / total for k, v in dist.items()}

cosine_similarity(vec1: Dict[str, float], vec2: Dict[str, float]) -> float

Compute cosine similarity between two feature vectors.

Parameters:

Name Type Description Default
vec1 Dict[str, float]

First feature vector as dictionary

required
vec2 Dict[str, float]

Second feature vector as dictionary

required

Returns:

Type Description
float

Cosine similarity (-1.0 to 1.0), 1.0 means identical direction

Raises:

Type Description
ValueError

If vectors are empty or invalid

Source code in src/rotalabs_probe/utils/feature_extraction.py
def cosine_similarity(vec1: Dict[str, float], vec2: Dict[str, float]) -> float:
    """Compute cosine similarity between two feature vectors.

    Args:
        vec1: First feature vector as dictionary
        vec2: Second feature vector as dictionary

    Returns:
        Cosine similarity (-1.0 to 1.0), 1.0 means identical direction

    Raises:
        ValueError: If vectors are empty or invalid
    """
    if not vec1 or not vec2:
        raise ValueError("Vectors cannot be empty")

    # Get all keys
    all_keys = set(vec1.keys()) | set(vec2.keys())

    if not all_keys:
        raise ValueError("Vectors have no keys")

    # Create aligned vectors
    v1 = np.array([vec1.get(k, 0.0) for k in all_keys])
    v2 = np.array([vec2.get(k, 0.0) for k in all_keys])

    # Compute cosine similarity
    norm1 = np.linalg.norm(v1)
    norm2 = np.linalg.norm(v2)

    if norm1 == 0 or norm2 == 0:
        return 0.0

    similarity = np.dot(v1, v2) / (norm1 * norm2)

    return float(similarity)