Address Transformers
Extract, validate, and standardize address components from unstructured text.
Usage
| address | street_number | street_name | city | state | zip |
| 123 Main St, New York, NY 10001 | 123 | Main | New York | NY | 10001 |
| 456 oak ave apt 5b, los angeles, ca 90001 | 456 | Oak | Los Angeles | CA | 90001 |
| 789 ELM STREET CHICAGO IL 60601 | 789 | Elm | Chicago | IL | 60601 |
| 321 pine rd. suite 100,, boston massachusetts | 321 | Pine | Boston | MA | null |
| PO Box 789, Atlanta, GA 30301 | null | null | Atlanta | GA | 30301 |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from transformers.pyspark.addresses import addresses
# Initialize Spark
spark = SparkSession.builder.appName("DataCleaning").getOrCreate()
# Create sample data
data = [
("123 Main St, New York, NY 10001",),
("456 Oak Ave Apt 5B, Los Angeles, CA 90001",),
("789 Elm Street, Chicago, IL 60601",),
("321 Pine Road Suite 100, Boston, MA 02101",),
]
df = spark.createDataFrame(data, ["address"])
# Extract and standardize address components
result_df = df.select(
F.col("address"),
addresses.extract_street_number(F.col("address")).alias("street_number"),
addresses.extract_street_name(F.col("address")).alias("street_name"),
addresses.extract_city(F.col("address")).alias("city"),
addresses.extract_state(F.col("address")).alias("state"),
addresses.extract_zip_code(F.col("address")).alias("zip")
)
# Show results
result_df.show(truncate=False)
# Filter to valid addresses
valid_addresses = result_df.filter(addresses.validate_zip_code(F.col("zip")))
Installation
datacompose add addresses
API Reference
Extract Functions
addresses.extract_street_number
Extract street/house number from address. Extracts the numeric portion at the beginning of an address. Handles various formats: 123, 123A, 123-125, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_street_prefix
Extract directional prefix from street address. Extracts directional prefixes like N, S, E, W, NE, NW, SE, SW.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_street_name
Extract street name from address. Extracts the main street name, excluding number, prefix, and suffix.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_street_suffix
Extract street type/suffix from address. Extracts street type like Street, Avenue, Road, Boulevard, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_full_street
Extract complete street address (number + prefix + name + suffix). Extracts everything before apartment/suite and city/state/zip.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_apartment_number
Extract apartment/unit number from address. Extracts apartment, suite, unit, or room numbers including: Apt 5B, Suite 200, Unit 12, #4A, Rm 101, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_floor
Extract floor number from address. Extracts floor information like: 5th Floor, Floor 2, Fl 3, Level 4, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_building
Extract building name or identifier from address. Extracts building information like: Building A, Tower 2, Complex B, Block C, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_unit_type
Extract the type of unit (Apt, Suite, Unit, etc.) from address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_secondary_address
Extract complete secondary address information (unit type + number). Combines unit type and number into standard format: "Apt 5B", "Ste 200", "Unit 12", etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_zip_code
Extract US ZIP code (5-digit or ZIP+4 format) from text. Returns empty string for null/invalid inputs.
addresses.extract_city
Extract city name from US address text. Extracts city by finding text before state abbreviation or ZIP code. Handles various formats including comma-separated and multi-word cities.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
custom_cities required | Column | Optional list of custom city names to recognize (case-insensitive) |
addresses.extract_state
Extract and standardize state to 2-letter abbreviation. Handles both full state names and abbreviations, case-insensitive. Returns standardized 2-letter uppercase abbreviation.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text with state information |
custom_states required | Column | Optional dict mapping full state names to abbreviations |
addresses.extract_country
Extract country from address. Extracts country names from addresses, handling common variations and abbreviations. Returns standardized country name.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text with potential country |
addresses.extract_po_box
Extract PO Box number from address. Extracts PO Box, P.O. Box, POB, Post Office Box numbers. Handles various formats including with/without periods and spaces.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.extract_private_mailbox
Extract private mailbox (PMB) number from address. Extracts PMB or Private Mail Box numbers, commonly used with commercial mail receiving agencies (like UPS Store).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
Transform Functions
addresses.standardize_street_prefix
Standardize street directional prefixes to abbreviated form. Converts all variations to standard USPS abbreviations: North/N/N. → N, South/S/S. → S, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing street prefix |
custom_mappings required | Column | Optional dict of custom prefix mappings (case insensitive) |
addresses.standardize_street_suffix
Standardize street type/suffix to USPS abbreviated form. Converts all variations to standard USPS abbreviations per the config: Street/St/St. → St, Avenue/Ave/Av → Ave, Boulevard → Blvd, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing street suffix |
custom_mappings required | Column | Optional dict of custom suffix mappings (case insensitive) |
addresses.standardize_unit_type
Standardize unit type to common abbreviations. Converts all variations to standard abbreviations: Apartment/Apt. → Apt, Suite → Ste, Room → Rm, etc.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing unit type |
custom_mappings required | Column | Optional dict of custom unit type mappings |
addresses.standardize_zip_code
Standardize ZIP code format. - Removes extra spaces - Ensures proper dash placement for ZIP+4 - Returns empty string for invalid formats
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing ZIP codes to standardize |
addresses.standardize_city
Standardize city name formatting. - Trims whitespace - Normalizes internal spacing - Applies title case (with special handling for common patterns) - Optionally applies custom city name mappings
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing city names to standardize |
custom_mappings required | Column | Optional dict for city name corrections/standardization |
addresses.standardize_state
Convert state to standard 2-letter format. Converts full names to abbreviations and ensures uppercase.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing state names or abbreviations |
addresses.standardize_country
Standardize country name to consistent format. Converts various country representations to standard names.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing country name or abbreviation |
custom_mappings required | Column | Optional dict of custom country mappings |
addresses.standardize_po_box
Standardize PO Box format to consistent representation. Converts various PO Box formats to standard "PO Box XXXX" format.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing PO Box text |
Validation Functions
addresses.has_apartment
Check if address contains apartment/unit information.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.validate_zip_code
Validate if a ZIP code is in correct US format. Validates: - 5-digit format (e.g., "12345") - ZIP+4 format (e.g., "12345-6789") - Not all zeros (except "00000" which is technically valid) - Within valid range (00001-99999 for base ZIP)
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing ZIP codes to validate |
addresses.is_valid_zip_code
Alias for validate_zip_code for consistency.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing ZIP codes to validate |
addresses.validate_city
Validate if a city name appears valid. Validates: - Not empty/null - Within reasonable length bounds - Contains valid characters (letters, spaces, hyphens, apostrophes, periods) - Optionally: matches a list of known cities
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing city names to validate |
known_cities required | Column | Optional list of valid city names to check against |
min_length required | Column | Minimum valid city name length |
max_length required | Column | Maximum valid city name length |
addresses.validate_state
Validate if state code is a valid US state abbreviation. Checks against list of valid US state abbreviations including territories.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing state codes to validate |
addresses.has_country
Check if address contains country information.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.has_po_box
Check if address contains PO Box.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.is_po_box_only
Check if address is ONLY a PO Box (no street address).
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
Utility Functions
addresses.remove_secondary_address
Remove apartment/suite/unit information from address. Removes secondary address components to get clean street address.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.get_zip_code_type
Determine the type of ZIP code.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing ZIP codes |
addresses.split_zip_code
Split ZIP+4 code into base and extension components.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing ZIP codes |
addresses.get_state_name
Convert state abbreviation to full name.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing 2-letter state abbreviations |
addresses.remove_country
Remove country from address. Removes country information from the end of addresses.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |
addresses.remove_po_box
Remove PO Box from address. Removes PO Box information while preserving other address components.
Parameters
Property | Type | Description |
---|---|---|
col required | Column | Column containing address text |