Web data is so ubiquitous it has all but become a commodity. In a world dominated by search, it’s easy to forget that a search engine is not the only use case for Web data. Thousands of researchers, entrepreneurs, and business executives tap into this vast data source to deliver insights – either for internal applications or as a service to customers (e.g. social media monitoring, business intelligence, or financial analysis). Unfortunately, access to structured Web data tends to be limited at best, and heavily gated by billion dollar corporations. For the rest of us, acquiring and processing this data into a machine readable format presents a growing technological challenge. It means investing heavily in computing resources and software development to deliver a crawling and structuring solution at scale. Unless structuring and organizing the planet’s biggest dataset is your business, the challenge rarely justifies an internal development initiative. In fact, scraping, crawling, and structuring Web data internally has arguably become one of the most inefficient approaches to accessing the surface Web. Consuming crawled Web data with an API makes much more sense from both a technological and business perspective.
Driven by customer demand, the Webhose.io API delivers structured Web data anyone can consume on-demand. The API queries an index of crawled Web content from regularly updated vertical data pools - namely blogs, news articles, and associated online discussions such as comments, reviews and message boards. Proprietary crawlers index millions of Web pages daily, structuring the raw data into extracted, inferred, and enriched layers of information. Consumers of Web data tap into our index to deliver analysis in the form of knowledge and wisdom. The deal is pretty simple. We crawl the Web. You get the data.
Still, the business case isn’t alway obvious if you’re not a data scientist or engineer. Whether you’re considering replacing an internal solution or relying on a new data provider, how do you compare two (or more) datasets that are not only huge, but keep growing by the nanosecond? Start by reviewing the following key elements:
Granularity refers to the depth of detail included in the dataset. In information science, the DIKW hierarchy model offers four layers of information types – data, information, knowledge, and wisdom.
Let’s assume it’s your job to provide analysis based on usable Web data. Your expertise is to deliver knowledge and wisdom based on raw data. The more granular the information you start with, the easier it is to refine it into usable knowledge and wisdom.
The first layer of granularity is to structure extracted information into a machine readable format. This could mean parsing HTML and text into a data structure that includes key fields such as URL, title, or body text. These are relatively trivial as they are clearly marked up in most Web pages.
Adding inferred information requires more computing resources. Rudimentary language dependent information falls into this category – the language of a particular Web page or associated country (combination of TLD and language). Other types of inferred information include author or element identification such as image or video content.
A final layer of enriched information requires even more processing power; a good example is the ability to identify keywords within a Web page as names. Does the word “apple” on a given page refer to the fruit or the computer company? Does my query for “Paris Hilton” refer to a Parisian hotel or to a celebrity? You might even enrich this information higher up the pyramid towards knowledge to understand if this reference includes a positive or negative sentiment. The more granular the data you start with, the easier it is to filter as you move up the pyramid towards analysis.
2. Enterprise Class Coverage
When it comes to data measurement, the first question most researchers, data scientists, and even business executives ask is “How much of the Web do you crawl?” Unfortunately, any figure or percentage estimate would be misleading at best. The Web is a constantly evolving and fragmented collection of unstructured data. Extracting that data and then structuring it as a prerequisite for analysis means making intelligent compromises. From a business and technology perspective, the real question is “what is the best possible coverage you can provide given finite resources?” Answering that question is an ongoing technological challenge that is driving phenomenal growth of the emerging Data-as-a-Service solution category.