The most important part of any Data Product is Data Quality.
Whether you’re a Business Analyst using reporting tools like PowerBI or Superset, a data scientist building RAG systems for LLMs, or a researcher working with gene sequencing — clean data is the foundation of everything. Yet, achieving data quality remains one of the most challenging aspects of data engineering.
The Phone Number Problem
Let’s look at a common scenario: cleaning phone numbers using regex in PySpark.
# This is you at 3 AM trying to clean phone numbers
df = df.withColumn("phone_clean",
F.when(F.col("phone").rlike("^d{10}$"), F.col("phone"))
.when(F.col("phone").rlike("^d{3}-d{3}-d{4}$"),
F.regexp_replace(F.col("phone"), "-", ""))
.when(F.col("phone").rlike("^(d{3}) d{3}-d{4}$"),
F.regexp_replace(F.regexp_replace(F.col("phone"), "[()-s]", ""), " ", ""))
# ... 47 more edge cases you haven't discovered yet
)
But wait, there’s more problems:
- Extracting phone numbers from free-form text
- International formats and country codes
- Extensions like “x1234” or “ext. 5678”
- Phone numbers embedded in sentences
The Current Solutions Fall Short
Option 1: External Libraries
Packages like Dataprep.ai or PyJanitor seem promising, but:
- They only work with Pandas (not PySpark)
- Built-in assumptions you can’t change without forking
- One-size-fits-all approach doesn’t fit your data
Option 2: Regex Patterns
- Hard to maintain and difficult to read
- Brittle and prone to edge cases
- Each new format requires updating complex patterns
Option 3: LLMs for Data Cleaning
- Compliance nightmare with PII data
- Expensive at scale
- Non-deterministic results
The Root Problem
Bad data is fundamentally a people problem. It’s nearly impossible to abstract away human inconsistency into an external package. People aren’t predictable, and their mistakes don’t follow neat patterns.
Our Data Quality Hypothesis
I believe data errors follow a distribution something like this:
Distribution of errors in human-entered data: █████████████ 60% - Perfect data (no cleaning needed) ████████ 30% - Common errors (typos, formatting) ██ 8% - Edge cases (weird but handleable) ▌ 2% - Chaos (someone typed their life story in the phone field) DataCompose: Clean the 38% that matters Let the juniors clean the last 2% (it builds character)
The Uncomfortable Truth About AI and Data Quality
Everyone’s racing to implement RAG, fine-tune models, and build AI agents. But here’s what they don’t put in the keynotes: Your RAG system is only as good as your data quality.
You can have GPT-5, Claude, or any frontier model, but if your customer database has three different formats for phone numbers, your AI is going to hallucinate customer service disasters.
The Real Cause of AI Failures
Most “AI failures” are actually data quality failures.
That customer complaint about your AI-powered system giving wrong information? It’s probably because:
- Your address data has “St.” in one table and “Street” in another
- Phone numbers are stored in three different formats
- Names are sometimes “LASTNAME, FIRSTNAME” and sometimes “FirstName LastName”
DataCompose isn’t trying to be AI. We’re trying to make your AI actually work by ensuring it has clean data to work with.
And here’s the kicker: your 38% of problematic data is not the same as everyone else’s. Your business has its own patterns, its own rules, and its own weird edge cases.
DataCompose Principle #1: Own Your Business Logic
Data transformations and data cleaning are business logic. And business logic belongs in your code.
Learn more about DataCompose Concepts
This is the fundamental problem. So how do we square the circle of these transformations being hard to maintain, yet too inflexible to have as an external dependency?
We took inspiration from the React/Svelte fullstack world and adopted the shadcn “copy to own” pattern, bringing it to PySpark. Instead of importing an external library that you can’t modify, you get battle-tested transformations that lives in your code.
We call our building blocks “primitives” — small, modular functions with clearly defined inputs and outputs that compose into pipelines. When we have a module of primitives that you can compose together, we call it a transformer. These aren’t magical abstractions; they’re just well-written PySpark functions that you own completely.
With this approach, you get:
- Primitives that do 90% of the work - Start with proven patterns
- Code that lives in YOUR repository - No external dependencies to manage
- Full ability to modify as needed - It’s your code, change whatever you want
- No dependencies beyond what you already have - If you have PySpark, you’re ready
DataCompose Principle #2: Validate Everything
Data transformations should be validated at every step for edge cases, and should be adjustable for your use case.
Every primitive comes with:
- Comprehensive test cases
- Edge case handling
- Clear documentation of what it does and doesn’t handle
- Configurable behavior for your specific needs
DataCompose Principle #3: Zero Dependencies
No external dependencies beyond Python/PySpark (including DataCompose). Each primitive must be modular and work on your system without adding extra dependencies.
Why this matters:
- PySpark runs in the JVM — adding dependencies is complex
- Enterprise environments have strict package approval processes
- Every new dependency is a potential security risk
- Simple is more maintainable
Our commitment: Pure PySpark transformations only.
How it works
Want to dive deeper? Check out our Getting Started Guide for a complete walkthrough.
1. Install DataCompose CLI
pip install datacompose
2. Add the transformers you need - they’re copied to your repo, pre-validated against tests
datacompose add addresses
3. You own the code - use it like any other Python module
# This is in your repository, you own it
from transformers.pyspark.addresses import addresses
from pyspark.sql import functions as F
# Clean and extract address components
result_df = df
.withColumn("street_number", addresses.extract_street_number(F.col("address")))
.withColumn("street_name", addresses.extract_street_name(F.col("address")))
.withColumn("city", addresses.extract_city(F.col("address")))
.withColumn("state", addresses.standardize_state(F.col("address")))
.withColumn("zip", addresses.extract_zip_code(F.col("address")))
result_df.show()
See all available functions: Check the Address Transformers API Reference
Output:
+----------------------------------------+-------------+------------+-----------+-----+-------+ |address |street_number|street_name |city |state|zip | +----------------------------------------+-------------+------------+-----------+-----+-------+ |123 Main St, New York, NY 10001 |123 |Main |New York |NY |10001 | |456 Oak Ave Apt 5B, Los Angeles, CA 90001|456 |Oak |Los Angeles|CA |90001 | |789 Pine Blvd, Chicago, IL 60601 |789 |Pine |Chicago |IL |60601 | +----------------------------------------+-------------+------------+-----------+-----+-------+
4. Need more? Use keyword arguments or modify the source directly
The Future Vision
Our goal is simple: provide clean data transformations as drop-in replacements that you can compose as YOU see fit.
- No magic
- No vendor lock-in
- Just reliable primitives that work
What’s Available Now
We’re starting with the most common data quality problems:
- Addresses — Standardize formats, extract components, validate
- Emails — Clean, validate, extract domains
- Phone Numbers — Format, extract, validate across regions
What’s Next
Based on community demand, we’re considering:
- Date/time standardization
- Name parsing and formatting
- Currency and number formats
- Custom business identifiers
Want to see something specific? Let us know!
Frequently Asked Questions
LLM's are:
- Expensive at scale (cleaning 10M phone numbers at $0.002 per call?)
- Non-deterministic (same input, different outputs)
- A compliance problem with PII
- Slow (3-5 seconds per record vs. milliseconds)
LLMs are great for the 2% chaos cases. Use DataCompose for the 98% that should just work.
External dependencies are hard to maintain too. When they break, you're stuck waiting for fixes. When they change, you're forced to adapt. With DataCompose, you own the code.
Whoever imports the code can check it into git. Full history, full traceability, full control.
We're working on a diff tool to help you see what's changed between DataCompose versions and merge improvements you want.
These three transformers solve immediate problems for 90% of organizations. They're complex enough to show the power of DataCompose's approach, but common enough that you'll use them tomorrow. Plus, when your boss asks "what does this tool do?", you can show them clean customer data in minutes, not hours of regex debugging.
Yes, for now. DataCompose is built specifically for PySpark environments. We know teams use Pandas, Polars, DuckDB, and raw SQL too. If there's enough demand, we'll expand. But we're reducing our scope at first to get validation.
If there's enough of a need, we definitely will be.