Skip to content

Utilities Module

The rotalabs_audit.utils module provides common utility functions for text processing, ID generation, hashing, pattern extraction, and similarity calculation.

Text Processing

Functions for cleaning, normalizing, and manipulating text.

clean_text

Clean and normalize text by removing extra whitespace.

clean_text(text)

Clean and normalize text.

Removes extra whitespace, normalizes line endings, and strips leading/trailing whitespace.

Parameters:

Name Type Description Default
text str

The text to clean.

required

Returns:

Type Description
str

Cleaned and normalized text.

Example

clean_text(" Hello World ") 'Hello World' clean_text("Line1\n\n\nLine2") 'Line1\nLine2'

truncate_text

Truncate text with ellipsis, attempting to break at word boundaries.

truncate_text(text, max_length=100, suffix='...')

Truncate text with ellipsis.

Truncates text to the specified maximum length, adding a suffix if truncation occurs. Attempts to break at word boundaries.

Parameters:

Name Type Description Default
text str

The text to truncate.

required
max_length int

Maximum length including suffix (default: 100).

100
suffix str

String to append if truncated (default: "...").

'...'

Returns:

Type Description
str

Truncated text with suffix if needed.

Example

truncate_text("Hello World", max_length=8) 'Hello...' truncate_text("Hi", max_length=10) 'Hi'

split_sentences

Split text into sentences using sentence-ending punctuation.

split_sentences(text)

Split text into sentences.

Uses regex to split text at sentence boundaries.

Parameters:

Name Type Description Default
text str

The text to split into sentences.

required

Returns:

Type Description
List[str]

List of sentence strings.

Example

sentences = split_sentences("Hello! How are you? I'm fine.") len(sentences) 3


Pattern Matching

Functions for extracting structured content from text using regex patterns.

find_all_matches

Find all matches of a regex pattern in text.

find_all_matches(pattern, text, flags=re.IGNORECASE)

Find all matches of a regex pattern.

Wrapper around re.findall with sensible defaults and error handling.

Parameters:

Name Type Description Default
pattern str

The regex pattern to search for.

required
text str

The text to search in.

required
flags int

Regex flags (default: re.IGNORECASE).

IGNORECASE

Returns:

Type Description
List[str]

List of all matches (or capture groups if pattern has groups).

Example

matches = find_all_matches(r"\b(\w+ing)\b", "Running and jumping") matches ['Running', 'jumping']

extract_numbered_list

Extract items from a numbered list (1., 2., etc.).

extract_numbered_list(text)

Extract items from a numbered list (1., 2., etc.).

Parses text to find numbered list items and returns their content.

Parameters:

Name Type Description Default
text str

The text containing a numbered list.

required

Returns:

Type Description
List[str]

List of item texts without the numbers.

Example

text = "1. First item\n2. Second item\n3. Third item" items = extract_numbered_list(text) items ['First item', 'Second item', 'Third item']

extract_bullet_list

Extract items from a bullet list (-, *, etc.).

extract_bullet_list(text)

Extract items from a bullet list (-, *, etc.).

Parses text to find bullet list items and returns their content.

Parameters:

Name Type Description Default
text str

The text containing a bullet list.

required

Returns:

Type Description
List[str]

List of item texts without the bullets.

Example

text = "- First item\n* Second item\n- Third item" items = extract_bullet_list(text) items ['First item', 'Second item', 'Third item']


ID and Hashing

Functions for generating unique identifiers and content hashes.

generate_id

Generate a unique ID for audit entries.

generate_id(length=8)

Generate a unique ID for audit entries.

Generates a UUID-based identifier truncated to the specified length.

Parameters:

Name Type Description Default
length int

Number of characters for the ID (default: 8).

8

Returns:

Type Description
str

A unique identifier string.

Example

id1 = generate_id() len(id1) 8 id2 = generate_id(12) len(id2) 12

hash_content

Generate SHA-256 hash of content for integrity verification.

hash_content(content)

Generate SHA-256 hash of content.

Creates a deterministic hash of the input content for integrity verification and deduplication.

Parameters:

Name Type Description Default
content str

The content string to hash.

required

Returns:

Type Description
str

The SHA-256 hash as a hexadecimal string.

Example

hash1 = hash_content("Hello, World!") len(hash1) 64 hash2 = hash_content("Hello, World!") hash1 == hash2 True


Similarity

Functions for calculating text similarity.

calculate_text_similarity

Calculate Jaccard similarity (word overlap) between two texts.

calculate_text_similarity(text1, text2)

Calculate similarity between two texts.

Uses Jaccard similarity (word overlap) to compute a similarity score between 0 and 1.

Parameters:

Name Type Description Default
text1 str

First text.

required
text2 str

Second text.

required

Returns:

Type Description
float

Similarity score between 0.0 (no similarity) and 1.0 (identical).

Example

calculate_text_similarity("hello world", "hello there") 0.333... calculate_text_similarity("same text", "same text") 1.0