Data cleaning is difficult, especially at scale. It’s tedious, often frustrating work, but it remains the foundation of every reliable system and every meaningful analysis.
The challenge becomes even greater on large distributed systems, which is where most significant datasets live. Once a system is large enough, you naturally lose visibility into what’s happening inside it as data moves through each stage.
The Core Problem
Data cleaning is, at its core, business logic. There are countless ways to interpret and correct messy input, and those rules vary widely across domains.
Because of this, data cleaning cannot be fully abstracted or reduced to a universal library. Tools like PyJanitor can be helpful for common patterns, but they cannot encompass the full range of cleaning logic that real systems require.
So how do we design a system that allows for infinite variations in data cleaning while still providing meaningful abstraction?
DataCompose has a simple mission: help users control and validate data across a number of domains in a way that is safe and reproducible.
Design Principle #1: Infinite Combinations Through Copy-to-Own
To support the wide range of data cleaning logic that real systems require, we chose a copy-to-own model inspired by shadcn.
Users import primitives directly into their notebook or codebase, giving them both the flexibility of ready-made components and the ability to modify them when needed.
The recommended workflow is simple:
- Start with the primitives as they are
- If they don’t fully meet your needs, edit the imported source directly
- Own your business logic completely
Why Copy-to-Own?
One of the ongoing challenges with Spark and other distributed systems is dependency management. You have to be extremely careful about what external libraries you bring into the environment.
Even small changes can create:
- Version conflicts
- Unpredictable behavior
- Difficult-to-debug pipeline failures
Because of this, relying on a large dependency-heavy data cleaning framework is often not an option.
The hard part of data cleaning is not moving the data. The hard part is defining the rules and handling every edge case correctly.
These rules differ from team to team, and even from dataset to dataset. A copy-to-own model gives users complete control over the logic that actually matters.
The Universal Library Problem
A one-size-fits-all cleaning library will always fall short because every domain has its own:
- Definitions
- Exceptions
- Requirements
By letting users own the primitives directly, you remove the friction of forcing their business logic into someone else’s abstractions. Teams begin with a working implementation and then adjust it as their needs evolve.
This approach keeps the system flexible without locking users into rigid patterns or overly opinionated defaults.
How It Works
Copy-to-own is a simple model where the source code for the primitives is placed directly into your project. The CLI generates these primitives for you so that you can edit, extend, or replace them as your business logic evolves.
# Add primitives to your project
datacompose add addresses
# The code is now in YOUR repo
# Edit it however you want Instead of depending on a large external library, you own the implementation. The only component that remains fixed is the IO class. Everything else can be modified to fit your domain.
Design Principle #2: Primitives as Building Blocks
Primitives are the smallest units of data cleaning logic in DataCompose.
Each primitive performs one clear, well-defined transformation. They are intentionally scoped to do a single thing, and to do it in a predictable and testable way.
Instead of hiding complex behavior behind large, opinionated functions, primitives make every step of the cleaning process explicit.
Two Key Advantages
1. Transparency You can read a primitive and immediately understand what it does to the data. No magic, no hidden complexity.
2. Flexibility Because each primitive is small and focused, it can be combined with others to express more complex behavior without creating a large or rigid abstraction.
Built to Be Edited
Primitives are also built to be edited. When the CLI copies them into your project, you are free to adjust or extend their logic to match your domain rules.
This is important because most real data contains edge cases that no library can anticipate. By giving users direct access to the source, DataCompose avoids the limitations of a single universal cleaning framework and lets teams define the exact behavior they need.
A Vocabulary for Data Cleaning
In practice, primitives function like the vocabulary of a small language for data cleaning. Each one describes a single transformation. Composition is how those transformations are combined into full pipelines.
Together, they form a system that is both structured and adaptable, grounded in simple parts that scale to complex workflows.
Design Principle #3: Composition for Complex Pipelines
Once the primitives are in your codebase, the next step is combining them into meaningful transformations. Each primitive handles a single, well-scoped piece of logic. Composition is how those pieces come together to form a complete cleaning pipeline.
Why Composition Matters
Real data cleaning is rarely a single operation. It’s a sequence of decisions and corrections that build on each other:
# Without composition - hard to read and maintain
df = df.withColumn("phone",
standardize_format(
remove_extensions(
extract_phone_number(
normalize_whitespace(F.col("raw_input"))
)
)
)
)
# With composition - clear and maintainable
from transformers.pyspark.phones import compose, normalize_whitespace,
extract_phone_number, remove_extensions, standardize_format
clean_phone = compose(
normalize_whitespace,
extract_phone_number,
remove_extensions,
standardize_format
)
df = df.withColumn("phone", clean_phone(F.col("raw_input"))) You may normalize a field, validate it, correct a known formatting issue, and then enforce a domain rule. Writing all of that logic inline quickly becomes difficult to review or maintain.
Composed primitives let you break the logic into clear, testable units and then chain them in a way that reflects the actual intent of the cleaning process.
The Compose Operator
The compose operator provides a small domain-specific language for these transformations. Instead of wiring functions together manually, the operator creates a structured pipeline that can be inspected and executed consistently.
This has two main benefits:
1. Readability You can look at a composed sequence of primitives and understand exactly what the cleaning logic is doing.
2. Analyzability Because the composition is explicit, DataCompose can analyze the structure, validate it, and prepare it for execution across different backends.
The Result: Control, Clarity, and Portability
Data cleaning will always be one of the hardest parts of working with real-world data. It’s full of irregularities, edge cases, and domain-specific rules that cannot be predicted in advance.
Most attempts to fully abstract this process eventually break down, because the logic that matters is the logic that belongs to the business itself.
The design decisions in DataCompose are meant to balance flexibility with structure:
- Primitives give a clear vocabulary for expressing individual transformations
- Copy-to-own gives teams full control over their rules without locking them into rigid assumptions
- Composition provides a readable way to assemble these rules into reliable pipelines
Accepting Reality
The goal is to give engineers control, clarity, and portability. As business logic changes, the system can change with it.
Instead of trying to force data cleaning into a universal abstraction, DataCompose focuses on giving users the tools to express their own logic in a consistent and reproducible way.
The result is a model that accepts the reality of data cleaning rather than fighting it.
It provides a structured foundation while leaving room for the infinite variations that real systems require.
Get Started
Ready to see how primitives work in practice?
- Getting Started Guide - Complete walkthrough
- Primitives Concepts - Deep dive into primitives
- API Reference - See all available primitives
Or jump straight in:
pip install datacompose
datacompose add addresses Your data cleaning logic, your way.