Text Transformers

Clean, normalize, and transform text data.

Usage


# Preview output will be shown here

Installation

datacompose add text

API Reference

Extract Functions

text.extract_hex

Extract first hex value from mixed content. Looks for hex with prefix (0x, #) or MAC-address format (XX:XX:XX).

Parameters

Property Type Description
col required
Column
Column containing mixed content

text.extract_base64

Extract base64 from mixed content. Looks for base64 strings with = padding or that follow "base64," prefix.

Parameters

Property Type Description
col required
Column
Column containing mixed content

Validation Functions

text.is_valid_hex

Check if string is valid hexadecimal.

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.is_valid_base64

Check if string is valid base64.

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.is_valid_url_encoded

Check if string is valid URL encoded (no malformed percent sequences).

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.has_control_characters

Check if string contains control characters (excluding tab/newline/CR).

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_zero_width_characters

Check if string contains zero-width characters.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_non_ascii

Check if string contains non-ASCII characters.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_escape_sequences

Check if string contains literal escape sequences (\\n, \\t, etc).

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_url_encoding

Check if string contains URL percent encoding.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_html_entities

Check if string contains HTML entities.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_ansi_codes

Check if string contains ANSI escape codes.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_non_printable

Check if string contains non-printable characters.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_accents

Check if string contains accented characters.

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_unicode_issues

Check if string contains unicode normalization issues. Detects: curly quotes, fancy dashes, special spaces, full-width chars, and combining characters (accents as separate codepoints).

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_whitespace_issues

Check if string has whitespace issues.

Parameters

Property Type Description
col required
Column
Column containing string to check

Utility Functions

text.hex_to_text

Convert hexadecimal string to text.

Parameters

Property Type Description
col required
Column
Column containing hex string

text.text_to_hex

Convert text to hexadecimal string.

Parameters

Property Type Description
col required
Column
Column containing text

text.clean_hex

Clean hex string (remove prefix, normalize case, remove separators).

Parameters

Property Type Description
col required
Column
Column containing hex string

text.decode_base64

Decode base64 string to text.

Parameters

Property Type Description
col required
Column
Column containing base64 string

text.encode_base64

Encode text to base64 string.

Parameters

Property Type Description
col required
Column
Column containing text

text.clean_base64

Clean base64 string (remove whitespace, fix padding).

Parameters

Property Type Description
col required
Column
Column containing base64 string

text.decode_url

Decode URL percent-encoded string.

Parameters

Property Type Description
col required
Column
Column containing URL encoded string

text.encode_url

Encode string with URL percent-encoding.

Parameters

Property Type Description
col required
Column
Column containing string

text.decode_html_entities

Decode HTML entities to characters.

Parameters

Property Type Description
col required
Column
Column containing HTML entities

text.encode_html_entities

Encode special characters as HTML entities.

Parameters

Property Type Description
col required
Column
Column containing string

text.unescape_string

Convert literal escape sequences to actual characters.

Parameters

Property Type Description
col required
Column
Column containing string with escape sequences

text.escape_string

Convert special characters to literal escape sequences.

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_line_endings

Normalize line endings to LF.

Parameters

Property Type Description
col required
Column
Column containing string

text.to_ascii

Transliterate non-ASCII characters to ASCII equivalents.

Parameters

Property Type Description
col required
Column
Column containing string

text.to_codepoints

Convert string to Unicode codepoints representation.

Parameters

Property Type Description
col required
Column
Column containing string

text.from_codepoints

Convert Unicode codepoints representation to string.

Parameters

Property Type Description
col required
Column
Column containing codepoints

text.reverse_string

Reverse a string.

Parameters

Property Type Description
col required
Column
Column containing string

text.truncate

Truncate string to maximum length.

Parameters

Property Type Description
col required
Column
Column containing string
max_length required
Column
Maximum length
ellipsis required
Column
Whether to add "..." when truncating

text.pad_left

Pad string on the left to specified width.

Parameters

Property Type Description
col required
Column
Column containing string
width required
Column
Target width
pad_char required
Column
Character to pad with

text.pad_right

Pad string on the right to specified width.

Parameters

Property Type Description
col required
Column
Column containing string
width required
Column
Target width
pad_char required
Column
Character to pad with

text.remove_control_characters

Remove control characters (preserving tab, newline, CR).

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_zero_width_characters

Remove zero-width characters.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_non_printable

Remove non-printable characters (preserving tab, newline, CR).

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_ansi_codes

Remove ANSI escape codes.

Parameters

Property Type Description
col required
Column
Column containing string

text.strip_invisible

Remove all invisible characters (control chars, zero-width, BOM).

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_bom

Remove byte order mark (BOM).

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_unicode

Normalize unicode (replace curly quotes, fancy dashes, special spaces).

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_accents

Remove accents/diacritics from characters.

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_whitespace

Normalize whitespace (trim and collapse multiple spaces).

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_html_tags

Remove HTML tags from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_urls

Remove URLs from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_emojis

Remove emojis from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_punctuation

Remove punctuation from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_digits

Remove digits from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_letters

Remove letters from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_escape_sequences

Remove literal escape sequences from string.

Parameters

Property Type Description
col required
Column
Column containing string

text.strip_to_alphanumeric

Keep only alphanumeric characters.

Parameters

Property Type Description
col required
Column
Column containing string

text.clean_for_comparison

Clean string for comparison (lowercase, trim, normalize whitespace, remove accents).

Parameters

Property Type Description
col required
Column
Column containing string

text.slugify

Convert string to URL-safe slug.

Parameters

Property Type Description
col required
Column
Column containing string

text.collapse_repeats

Collapse repeated characters to maximum count.

Parameters

Property Type Description
col required
Column
Column containing string
max_repeat required
Column
Maximum allowed consecutive repetitions (1 or 2)

text.clean_string

Comprehensive string cleaning (remove BOM, zero-width, control chars, normalize unicode).

Parameters

Property Type Description
col required
Column
Column containing string