1601 E. 5th St. #109

Austin, Texas 78702

United States


Module 002/2, Ground Floor, Tidel Park

Elcosez, Aerodome Post

Coimbatore, Tamil Nadu 641014 India


138G Grays Hill

Opp. BSNL GM Office, Sims Park

Coonoor, Tamil Nadu 643101 India


Block 7, Lot 5,

Camella Homes Bermuda,

Phase 2B, Brgy. Banlic,

City of Cabuyao, Laguna,


San Jose

Escazu Village

Calle 118B, San Rafael

San Jose, SJ 10203

Costa Rica

News & Insights

News & Insights

Data Provenance

We manage a lot of data at IEI and, more often than not, I find that the key to cleaning up and improving the data we get lies in where the data came from in the first place – in other words, its “provenance.”

Data “sources” roughly fall into these groups:

    • Proprietary structured databases
    • Internal customer or prospect data
      • CRM data
      • Circulation files
      • One-off customer purchases
    • Public data
    • Government filings
    • Public transaction data (shipping manifests, bills of lading)
    • User-generated content (reviews, rankings)
    • News (press releases, news articles, blog posts)
    • Web information (addresses, bios, products)

A typical data project usually involves deduping, normalizing, appending missing information, and direct verification via a combination of in-house researchers, trained crowdsourced workers, and software tools. The choice of these tools depends not only on the desired end result of the project (a publishable database, a list clean enough to use for marketing) but also where the data came from.

Typical red flags in data’s provenance:

  • Was the data harvested data? (old, miscategorized, unverified)
  • Was the data entered by hand? (misspellings, transposed fields, missing required fields)
  • If internal data, when was it last used? (old)
  • Were multiple sources combined? (mixed formatting conventions, truncation)

Based on these indicators, we define a process to address each of the issues (re-categorizing, fixing spelling errors, targeting key missing fields) in a logical sequence.

The one indispensable step in all data projects is direct verification via primary sources. These sources can include recent government filings, official websites, or direct communication with a person at the company in question. Without this step at the end of a process there is a significant risk of introducing old or incorrect data into the deliverable. This final verification also adds value as a citation as to the data’s accuracy, much like a “sell by” date. Increasingly, data customers expect this piece of metadata as a “certificate of authenticity” and for good reason: Their customers, either paying subscribers or internal sales teams, all need to know where the data came from, too.

Keep on top of the information industry 
with our ‘Data Content Best Practices’ newsletter:

Keep on top of the information industry with our ‘Data Content Best Practices’ newsletter: