Email Transformers
Clean, validate, and extract information from email addresses.
Usage
| email | standardized | username | domain | is_valid |
| John.Doe@Gmail.COM | john.doe@gmail.com | john.doe | gmail.com | true |
| JANE.SMITH@OUTLOOK.COM | jane.smith@outlook.com | jane.smith | outlook.com | true |
| info@company-name.org | info@company-name.org | info | company-name.org | true |
| invalid.email@ | null | null | null | false |
| user+tag@domain.co.uk | user+tag@domain.co.uk | user+tag | domain.co.uk | true |
| bad email@test.com | null | null | null | false |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from transformers.pyspark.emails import emails
# Initialize Spark
spark = SparkSession.builder.appName("EmailCleaning").getOrCreate()
# Create sample data
data = [
("john.doe@gmail.com",),
("JANE.SMITH@OUTLOOK.COM",),
("info@company-name.org",),
("invalid.email@",),
("user+tag@domain.co.uk",),
]
df = spark.createDataFrame(data, ["email"])
# Extract and validate email components
result_df = df.select(
F.col("email"),
emails.standardize_email(F.col("email")).alias("standardized"),
emails.extract_username(F.col("email")).alias("username"),
emails.extract_domain(F.col("email")).alias("domain"),
emails.is_valid_email(F.col("email")).alias("is_valid")
)
# Show results
result_df.show(truncate=False)
# Filter to valid emails only
valid_emails = result_df.filter(F.col("is_valid") == True)
Installation
datacompose add emails
API Reference
Extract Functions
emails.extract_email
Extract first valid email address from text.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing text with potential email addresses |
emails.extract_all_emails
Extract all email addresses from text as an array.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing text with potential email addresses |
emails.extract_username
Extract username (local part) from email address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.extract_domain
Extract domain from email address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.extract_domain_name
Extract domain name without TLD from email address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.extract_tld
Extract top-level domain from email address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.extract_name_from_email
Attempt to extract person's name from email username. E.g., john.smith@example.com -> "John Smith"
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
Transform Functions
emails.standardize_email
Apply standard email cleaning and normalization.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
lowercase required | Column | Convert to lowercase |
remove_dots_gmail required | Column | Remove dots from Gmail addresses |
remove_plus required | Column | Remove plus addressing |
fix_typos required | Column | Fix common domain typos |
Validation Functions
emails.is_valid_email
Check if email address has valid format.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
min_length required | Column | Minimum length for valid email |
max_length required | Column | Maximum length for valid email |
emails.is_valid_username
Check if email username part is valid.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
min_length required | Column | Minimum length for valid username |
max_length required | Column | Maximum length for valid username |
emails.is_valid_domain
Check if email domain part is valid.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.has_plus_addressing
Check if email uses plus addressing (e.g., user+tag@gmail.com).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.is_disposable_email
Check if email is from a disposable email service.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
disposable_domains required | Column | List of disposable domains to check against |
emails.is_corporate_email
Check if email appears to be from a corporate domain (not free email provider).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
free_providers required | Column | List of free email provider domains to check against |
Utility Functions
emails.remove_whitespace
Remove all whitespace from email address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.lowercase_email
Convert entire email address to lowercase.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.lowercase_domain
Convert only domain part to lowercase, preserve username case.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.remove_plus_addressing
Remove plus addressing from email (e.g., user+tag@gmail.com -> user@gmail.com).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.remove_dots_from_gmail
Remove dots from Gmail addresses (Gmail ignores dots in usernames).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.fix_common_typos
Fix common domain typos in email addresses.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
custom_mappings required | Column | Additional domain mappings to apply (extends DOMAIN_TYPO_MAPPINGS) |
custom_tld_mappings required | Column | Additional TLD mappings to apply (extends TLD_TYPO_MAPPINGS) |
emails.normalize_gmail
Normalize Gmail addresses (remove dots, plus addressing, lowercase).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.get_canonical_email
Get canonical form of email address for deduplication. Applies maximum normalization.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.get_email_provider
Get email provider name from domain.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.mask_email
Mask email address for privacy (e.g., joh***@gm***.com).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
mask_char required | Column | Character to use for masking |
keep_chars required | Column | Number of characters to keep at start |
emails.filter_valid_emails
Return email only if valid, otherwise return null.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.filter_corporate_emails
Return email only if corporate, otherwise return null.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |
emails.filter_non_disposable_emails
Return email only if not disposable, otherwise return null.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing email address |