The problem DataCompose is focused on solving is Garbage In, Garbage Out. I want to walk through three different stories, each exploring a different area of data errors that lead to bad outcomes: bad analytics/reporting, operational patterns breaking down, and flawed AI/ML.
Story 1: Target Canada (Bad Reporting)
Entry-level employees manually entered 75,000 products under tight deadlines, recording dimensions in inches instead of centimeters, entering USD instead of CAD, scrambling width/height/length fields. Only 30% of data was accurate versus 98-99% in US operations.
Executives saw dashboards showing “inventory in stock” while store shelves sat empty. Target Canada shuttered after $7 billion in losses.
Garbage in, garbage out.
Story 2: TSB Bank (Operational Issues)
TSB’s 2018 migration of 5.2 million customer accounts to a new platform called Proteo4UK exposed fundamental data testing gaps:
- 1.9 million customers were locked out of accounts
- Some customers could see other customers’ account information, a catastrophic data integrity and security failure
- Money appeared to “disappear” from accounts
- Mortgage accounts vanished from customer views
TSB lost 80,000 customers and paid £330+ million in total losses.
Garbage in, garbage out.
Story 3: Epic’s Sepsis Model (AI/ML)
Epic’s Sepsis Model represents a systemic failure affecting patient care at hundreds of U.S. hospitals. The model was trained to predict sepsis, a condition causing one-third of all hospital deaths, but suffered from critical data quality issues.
It was trained on billing codes for sepsis treatment rather than actual clinical onset, and incorporated features like blood culture orders that only appear after clinicians already suspect sepsis (data leakage).
A 2021 University of Michigan study found the model achieved only 0.63 AUC (versus Epic’s claimed 0.76-0.83) and missed 67% of sepsis patients. When researchers restricted analysis to data available before diagnostic tests were ordered, accuracy dropped from 87% to 53%.
The model essentially learned to detect clinician suspicion rather than predict sepsis, yet it generates alerts for 18% of all hospitalized patients, contributing to physician alert fatigue.
Garbage in, garbage out.
The Common Denominator
The three disasters above share a common thread: rigid systems could not handle edge cases.
- Target’s SAP couldn’t validate that “inches” should be “centimeters”
- TSB’s migration tools couldn’t catch inconsistent customer records
- Epic’s model couldn’t distinguish billing codes from clinical data
Traditional cleaning frameworks force you into their assumptions. They provide a clean_address() function that works great until your addresses include PO boxes, military bases, or international formats. Then you’re stuck forking their repo or writing custom code from scratch.
A Different Tactic
DataCompose takes a different approach. Instead of monolithic cleaning functions, we provide atomic primitives that you compose together:
# Primitives for email cleaning
normalize_email(col('email')) # lowercase, trim whitespace
remove_email_dots(col('email')) # gmail dot handling
validate_email_domain(col('email')) # check domain exists
# Compose them for your use case
df.withColumn('clean_email',
validate_email_domain(
remove_email_dots(
normalize_email(col('email'))
)
)
) Data Primitives are small atomic cleaning functions that can be applied to your dataframes and SQL tables and combined in a way that allows flexibility to clean data without rebuilding the wheel every single time.
Each primitive does one thing. You own the code. Copy it into your repo, modify it for your edge cases, combine it however you need. When your business logic changes (and it will), you adjust the primitives, not the framework.
This is what Target needed: primitives that could validate units, catch currency mismatches, and enforce dimension constraints, composed into their specific data quality rules. This is what TSB needed: primitives to validate customer records during migration rather than discovering problems in production.
DataCompose is a philosophy where we assume we don’t have all the answers or that we can fill all your data gaps. But we can make your data easier to access and transform, and we want to grow it to fill your exact needs.
DataCompose is here to solve the problem of Garbage In, Garbage Out. Data professionals deserve better than burying logic in keyword arguments.
Want to see how primitives work? Check out the primitives documentation or get started:
pip install datacompose