Chris Taggart, Founder of OpenCorporates believes that their API will be the principal way that their data is consumed by end users within the next five years.
Speaking at API Strategy and Practice / APIDays in Berlin today — and only three days after the release of Version 4 of the OpenCorporates API — Taggart shared some of the origin and unique challenges facing the open data business that aims to build the world’s most comprehensive underlying dataset of company information.
“Government information on companies is siloed in various registers,” says Taggart. He believes that by de-siloing that data, more useful analysis can be performed, for example, analyzing the health and safety reputations of companies where transgressions are stored in one dataset and the company data in another (as occurs in the UK).
Taggart — a former journalist who ran magazine publishing companies — used the investment he made from the sale of a previous business to start OpenCorporates, and has now reached the stage of having a sustainable business model.
“A company is a legal construct and we have assembled this data from company registers around the world,” Taggart says. Key to the approach taken is that OpenCorporates automatically ingests all its data from primary sources, so there have been no manual imports. Instead, a variety of webscraping, PDF scraping, CSV dumps, and API ingestion is used, with specific code written where needed to pull in the data. While datasets are then normalized so that company names can be matched up globally (thus helping identify, for example, foreign branches of home companies), the underlying data is not changed.
“Identifiers is a pretty critical issue,” says Taggart, a challenge also being faced by other company information data businesses like Owler, which have an internal mantra of ‘death to DUNS’ (the proprietary company identifier used by Dun and Bradstreet). Taggart sees proprietary identifiers leading back to the very core problem that OpenCorporates is trying to solve, that is, open access to business information: “The Dun and Bradstreet identifier is a proprietary identifier with lock-in and no visibility,” says Taggart.
OpenCorporates’ mission goal is to aggregate every bit of company-related public data and match it to the relevant company, says Taggart.
“We have already made a good start, but a very small start: including trademarks and corporate structure. Our current focus is on U.S. business licenses (we have a product being released in a month or two), financial licenses, and government gazettes."
OpenCorporates' roadmap is driven in part by user demand, and by the core team identifying things that are structurally useful. “For example,” says Taggart, “getting data on U.S. businesses is difficult. Businesses must register at the state level but that doesn't have any information on industry codes, who the people managing the business are, no trading addresses or trading names, no financials. But if the business starts to employ people, or open up a shopfront, well, you need licenses for that. So these are disparate datasets, but by pulling this stuff together we can build one complex structure that gives you a much better idea about things.”
Data Quality in Proprietary Business Information Services
Compared with proprietary business information data systems, Taggart sees OpenCorporates as better suited to providing high quality, accurate information.
He points to six common data quality issues commonly seen in proprietary, vendor-locked company information data services. In his presentation, Taggart said:
- Data accuracy is problematic: “Most proprietary business data has been re-keyed, so you have errors happening that are not being picked up.”
- Gaps in data
- Lack of granularity: ”Legacy systems of business data are often unable to provide the level of detail needed.”
- Errors go unchecked: Taggart notes that OpenCorporates receives around 20 emails each day pointing to errors in their corporate datasets, and that is when the data is being drawn from official sources. Fraudulent data can be more easily discovered when open sourced and made available in the way OpenCorporates does.
- Black box/no provenance: “We always say where we get the data from, and what year it was. Often proprietary data is the same data that goes round and round between various company purchases. Sharing data on the date and source of company data also means it is better able to provide context, and allows appropriate comparisons to be made between datasets.”
- Isolated: Proprietary identifiers create barriers to sharing data openly and prevent others from helping improve data quality.
Version 4 of OpenCorporates API
Taggart announced that Version 4 of the OpenCorporates API was released earlier this week, noting: “We expect that in five years, all of our data will be accessed solely by API.”
New features in the latest version include:
- companies can be search by their registered address
- searches for companies can begin with a given phrase (e.g., ‘Barclays Bank’)
- advanced filters allow searching by multiple jurisdictions (e.g., 'Ireland and UK'), by country, and by inactive and Branch companies; or that filter results so that only nonprofit companies are shown, or can be excluded from private company searches
- filtering of companies by officer’s addresses, dates of birth, or position
- greater date searching
- a revamp of how industry codes are allocated to companies that allows for more granular search filtering.
While some API calls do not require an API registration, the new version also allows those with an API Key to make calls to get company officers’ addresses and dates of birth programmatically.
The new searches help surface company data for a variety of use cases. For example, users can discover what businesses operate in a particular building, or can assess the likelihood of fraudulent record-keeping, for example, by filtering for only those companies with an active company officer over 105 years of age (Taggart says there are about 400 in the UK alone).
A recent blog post on the OpenCorporates website by Tony Hirst shows some of the powerful company data information functions that can be performed with the API, including mapping what other companies are owned by one company’s largest shareholders, what companies (and their subsidiaries) receive payments from city governments, and which corporate entities might share the same location (or even trade under multiple names at the one location).
The OpenCorporates Business Model
Taggart says that being “Open is critical here: everything we do is available for free, without registration, on the web.”
While having many links with non-profit organizations like the Open Knowledge Foundation, Taggart doesn’t want people to be confused about the business model: “We are a for-profit company with a strong social mission. This is also critical to quality. By being open in this way we can ensure the quality the world needs.”
He says OpenCorporates drew on successful open source approaches as a way to define its innovative business model: a share-alike or paid service.
“If you use the API, anything you use it in has also got to be open sourced,” explains Taggart, defining the share-alike part of the equation. “Or you can build proprietary tools and you pay for API access,” he adds, explaining the profit-generation-side. “This allows us to be a public-good oriented company and, at the same time, to be sustainable. Now, the next challenge is how do we scale this up: how do we handle the sales cycle?”
Taggart says with their approach to date, they have been able to prove themselves to their paying clients and with Taggart’s own initial investment, this gave the room for the business to grow to a point where it is now sustainable. “It has been difficult but it is now working,” he said.