Utilities Module¶
The rotalabs_audit.utils module provides common utility functions for text processing, ID generation, hashing, pattern extraction, and similarity calculation.
Text Processing¶
Functions for cleaning, normalizing, and manipulating text.
clean_text¶
Clean and normalize text by removing extra whitespace.
clean_text(text)
¶
Clean and normalize text.
Removes extra whitespace, normalizes line endings, and strips leading/trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to clean. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Cleaned and normalized text. |
Example
clean_text(" Hello World ") 'Hello World' clean_text("Line1\n\n\nLine2") 'Line1\nLine2'
truncate_text¶
Truncate text with ellipsis, attempting to break at word boundaries.
truncate_text(text, max_length=100, suffix='...')
¶
Truncate text with ellipsis.
Truncates text to the specified maximum length, adding a suffix if truncation occurs. Attempts to break at word boundaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to truncate. |
required |
max_length
|
int
|
Maximum length including suffix (default: 100). |
100
|
suffix
|
str
|
String to append if truncated (default: "..."). |
'...'
|
Returns:
| Type | Description |
|---|---|
str
|
Truncated text with suffix if needed. |
Example
truncate_text("Hello World", max_length=8) 'Hello...' truncate_text("Hi", max_length=10) 'Hi'
split_sentences¶
Split text into sentences using sentence-ending punctuation.
split_sentences(text)
¶
Split text into sentences.
Uses regex to split text at sentence boundaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to split into sentences. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of sentence strings. |
Example
sentences = split_sentences("Hello! How are you? I'm fine.") len(sentences) 3
Pattern Matching¶
Functions for extracting structured content from text using regex patterns.
find_all_matches¶
Find all matches of a regex pattern in text.
find_all_matches(pattern, text, flags=re.IGNORECASE)
¶
Find all matches of a regex pattern.
Wrapper around re.findall with sensible defaults and error handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
The regex pattern to search for. |
required |
text
|
str
|
The text to search in. |
required |
flags
|
int
|
Regex flags (default: re.IGNORECASE). |
IGNORECASE
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of all matches (or capture groups if pattern has groups). |
Example
matches = find_all_matches(r"\b(\w+ing)\b", "Running and jumping") matches ['Running', 'jumping']
extract_numbered_list¶
Extract items from a numbered list (1., 2., etc.).
extract_numbered_list(text)
¶
Extract items from a numbered list (1., 2., etc.).
Parses text to find numbered list items and returns their content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text containing a numbered list. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of item texts without the numbers. |
Example
text = "1. First item\n2. Second item\n3. Third item" items = extract_numbered_list(text) items ['First item', 'Second item', 'Third item']
extract_bullet_list¶
Extract items from a bullet list (-, *, etc.).
extract_bullet_list(text)
¶
Extract items from a bullet list (-, *, etc.).
Parses text to find bullet list items and returns their content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text containing a bullet list. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of item texts without the bullets. |
Example
text = "- First item\n* Second item\n- Third item" items = extract_bullet_list(text) items ['First item', 'Second item', 'Third item']
ID and Hashing¶
Functions for generating unique identifiers and content hashes.
generate_id¶
Generate a unique ID for audit entries.
generate_id(length=8)
¶
Generate a unique ID for audit entries.
Generates a UUID-based identifier truncated to the specified length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int
|
Number of characters for the ID (default: 8). |
8
|
Returns:
| Type | Description |
|---|---|
str
|
A unique identifier string. |
Example
id1 = generate_id() len(id1) 8 id2 = generate_id(12) len(id2) 12
hash_content¶
Generate SHA-256 hash of content for integrity verification.
hash_content(content)
¶
Generate SHA-256 hash of content.
Creates a deterministic hash of the input content for integrity verification and deduplication.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
The content string to hash. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The SHA-256 hash as a hexadecimal string. |
Example
hash1 = hash_content("Hello, World!") len(hash1) 64 hash2 = hash_content("Hello, World!") hash1 == hash2 True
Similarity¶
Functions for calculating text similarity.
calculate_text_similarity¶
Calculate Jaccard similarity (word overlap) between two texts.
calculate_text_similarity(text1, text2)
¶
Calculate similarity between two texts.
Uses Jaccard similarity (word overlap) to compute a similarity score between 0 and 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text. |
required |
text2
|
str
|
Second text. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Similarity score between 0.0 (no similarity) and 1.0 (identical). |
Example
calculate_text_similarity("hello world", "hello there") 0.333... calculate_text_similarity("same text", "same text") 1.0