Utilities Module¶

The rotalabs_audit.utils module provides common utility functions for text processing, ID generation, hashing, pattern extraction, and similarity calculation.

Text Processing¶

Functions for cleaning, normalizing, and manipulating text.

clean_text¶

Clean and normalize text by removing extra whitespace.

`clean_text(text)` ¶

Clean and normalize text.

Removes extra whitespace, normalizes line endings, and strips leading/trailing whitespace.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to clean.	required

Returns:

Type	Description
`str`	Cleaned and normalized text.

Example

clean_text(" Hello World ") 'Hello World' clean_text("Line1\n\n\nLine2") 'Line1\nLine2'

truncate_text¶

Truncate text with ellipsis, attempting to break at word boundaries.

`truncate_text(text, max_length=100, suffix='...')` ¶

Truncate text with ellipsis.

Truncates text to the specified maximum length, adding a suffix if truncation occurs. Attempts to break at word boundaries.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to truncate.	required
`max_length`	`int`	Maximum length including suffix (default: 100).	`100`
`suffix`	`str`	String to append if truncated (default: "...").	`'...'`

Returns:

Type	Description
`str`	Truncated text with suffix if needed.

Example

truncate_text("Hello World", max_length=8) 'Hello...' truncate_text("Hi", max_length=10) 'Hi'

split_sentences¶

Split text into sentences using sentence-ending punctuation.

`split_sentences(text)` ¶

Split text into sentences.

Uses regex to split text at sentence boundaries.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to split into sentences.	required

Returns:

Type	Description
`List[str]`	List of sentence strings.

Example

sentences = split_sentences("Hello! How are you? I'm fine.") len(sentences) 3

Pattern Matching¶

Functions for extracting structured content from text using regex patterns.

find_all_matches¶

Find all matches of a regex pattern in text.

`find_all_matches(pattern, text, flags=re.IGNORECASE)` ¶

Find all matches of a regex pattern.

Wrapper around re.findall with sensible defaults and error handling.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	The regex pattern to search for.	required
`text`	`str`	The text to search in.	required
`flags`	`int`	Regex flags (default: re.IGNORECASE).	`IGNORECASE`

Returns:

Type	Description
`List[str]`	List of all matches (or capture groups if pattern has groups).

Example

matches = find_all_matches(r"\b(\w+ing)\b", "Running and jumping") matches ['Running', 'jumping']

extract_numbered_list¶

Extract items from a numbered list (1., 2., etc.).

`extract_numbered_list(text)` ¶

Extract items from a numbered list (1., 2., etc.).

Parses text to find numbered list items and returns their content.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text containing a numbered list.	required

Returns:

Type	Description
`List[str]`	List of item texts without the numbers.

Example

text = "1. First item\n2. Second item\n3. Third item" items = extract_numbered_list(text) items ['First item', 'Second item', 'Third item']

extract_bullet_list¶

Extract items from a bullet list (-, *, etc.).

`extract_bullet_list(text)` ¶

Extract items from a bullet list (-, *, etc.).

Parses text to find bullet list items and returns their content.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text containing a bullet list.	required

Returns:

Type	Description
`List[str]`	List of item texts without the bullets.

Example

text = "- First item\n* Second item\n- Third item" items = extract_bullet_list(text) items ['First item', 'Second item', 'Third item']

ID and Hashing¶

Functions for generating unique identifiers and content hashes.

generate_id¶

Generate a unique ID for audit entries.

`generate_id(length=8)` ¶

Generate a unique ID for audit entries.

Generates a UUID-based identifier truncated to the specified length.

Parameters:

Name	Type	Description	Default
`length`	`int`	Number of characters for the ID (default: 8).	`8`

Returns:

Type	Description
`str`	A unique identifier string.

Example

id1 = generate_id() len(id1) 8 id2 = generate_id(12) len(id2) 12

hash_content¶

Generate SHA-256 hash of content for integrity verification.

`hash_content(content)` ¶

Generate SHA-256 hash of content.

Creates a deterministic hash of the input content for integrity verification and deduplication.

Parameters:

Name	Type	Description	Default
`content`	`str`	The content string to hash.	required

Returns:

Type	Description
`str`	The SHA-256 hash as a hexadecimal string.

Example

hash1 = hash_content("Hello, World!") len(hash1) 64 hash2 = hash_content("Hello, World!") hash1 == hash2 True

Similarity¶

Functions for calculating text similarity.

calculate_text_similarity¶

Calculate Jaccard similarity (word overlap) between two texts.

`calculate_text_similarity(text1, text2)` ¶

Calculate similarity between two texts.

Uses Jaccard similarity (word overlap) to compute a similarity score between 0 and 1.

Parameters:

Name	Type	Description	Default
`text1`	`str`	First text.	required
`text2`	`str`	Second text.	required

Returns:

Type	Description
`float`	Similarity score between 0.0 (no similarity) and 1.0 (identical).

Example

calculate_text_similarity("hello world", "hello there") 0.333... calculate_text_similarity("same text", "same text") 1.0

Utilities Module¶

Text Processing¶

clean_text¶

clean_text(text) ¶

truncate_text¶

truncate_text(text, max_length=100, suffix='...') ¶

split_sentences¶

split_sentences(text) ¶

Pattern Matching¶

find_all_matches¶

find_all_matches(pattern, text, flags=re.IGNORECASE) ¶

extract_numbered_list¶

extract_numbered_list(text) ¶

extract_bullet_list¶

extract_bullet_list(text) ¶

ID and Hashing¶

generate_id¶

generate_id(length=8) ¶

hash_content¶

hash_content(content) ¶

Similarity¶

calculate_text_similarity¶

calculate_text_similarity(text1, text2) ¶

`clean_text(text)` ¶

`truncate_text(text, max_length=100, suffix='...')` ¶

`split_sentences(text)` ¶

`find_all_matches(pattern, text, flags=re.IGNORECASE)` ¶

`extract_numbered_list(text)` ¶

`extract_bullet_list(text)` ¶

`generate_id(length=8)` ¶

`hash_content(content)` ¶

`calculate_text_similarity(text1, text2)` ¶