KimonoLabs is a startup that wants to make it easier for users to extract data from the unstructured Web and expose it as APIs. There are other companies doing the same thing, but, based on my tests, KimonoLabs’ Kimono changes the game like no other product. Kimono allows even users without any prior programming knowledge to extract content from Web pages with just a few mouse clicks, and then expose the content through simple APIs. It is likely a step toward the semantic Web that is expected to speed up the creation of smart applications because it allows even non-techies to get information from different sources and, with the help of a couple more tools, create and put different services in communication.
In this post, I will show how I used Kimono to create a simple API from a Web page, and recommend how to put the information provided to a good use .
To scrape or not to scrape
If you have ever needed to extract content from Web pages, you know how painful it can be to scrape pages by hand to extract data from unstructured, and often malformed, HTML content.
Why even scrape in the first place?
Well, when presented with a business problem that needs to be solved, you don’t want to reinvent the wheel if you can possibly help it. More often than not, the cost of designing, developing, testing and deploying your own solution would be an order of magnitude higher than reusing an existing solution of some sort. It usually makes sense to check to see if someone within your organization has already solved the same or a similar problem, and how much it would cost to adapt their solution to your case. If there is no such solution in-house, you should then check open source solutions. If even that fails, check out paid services. Only if none of these pans out, or if a careful analysis shows that the cost for such a solution would be higher than developing a new solution internally, should you think about engineering your own solution.
If you are lucky, you’ll find a page that fits your needs and has some basic formatting, a good HTML-nested structure, and some CSS selectors that uniquely identify the fields you are interested in.
You can use something like Scrapy, a fast, extensible, solution that requires only that you define a few rules to extract the data you need. You’ll still need some Python programming, though, and also you’ll have to install and configure some software, but you’ll be able to count on a solid crawling engine, and the time required will be a fraction of what it would take to write a crawler and a parser yourself.
Kimono makes things that much faster and easier: Through a simple graphic interface, you can just select the points you are interested in to shape your API.
First steps with Kimono
The very first step, as you can imagine, is registering yourself (for free) on Kimono’s sign up page.
No sensitive data is needed--just your name, your email and a password. After going through this process, I was immediately brought to the beginning of my journey, which started by adding the Kimono bookmarklet to my browser’s bookmarks bar.
I decided to create an API collecting new threads on Hacker News. I opened https://news.ycombinator.com/newest in the browser and clicked the kimonify bookmark I had added to my bookmarks bar.
You can also paste the following into your browser’s address bar: http://kimonify.kimonolabs.com/kimload?url=https://news.ycombinator.com/newest.
The first part of this URL is the base for the Kimono service; the second part is the target page, upon which you’ll create your API.
As you can likely see, the target page has been loaded in a sort of container, with a new bar at the very top of the page. When I moved my mouse around over certain fields, I noticed a weird highlighting effect that wasn’t present in the original page.
In the bottom right corner, you’ll also notice a floating menu. That’s Kimono suggesting your existing projects as starting points for your own API. After all, there is no need to reinvent the wheel!
For this page, in particular, you can see that a lot of APIs (at the time of writing, more than 350) have been created through Kimono.
I encourage you to complete the tutorial and try this great feature: You can pick any existing interface and refine it as you like. Pay attention to the symbols below the name of the API; they can give you an idea of the number of different elements scraped and returned by the API, and so, in turn, about its level of detail, helping you choose the best framework basis for your own project.
When I clicked on the first entry, all the other titles were automatically highlighted (initially in a different shade of yellow). This essentially lets Kimono know that it should scrape the page, create as many entries as there are items in the list of news, and, for each entry, add a field with the title of the news.
I next added another field to the data: author. First I had to tell kimono to add a new field. To do so I just clicked on the plus sign next to the yellow circle on the top bar, and I could then click on the author name on any of the news items in the list.
At this point, both the author and the “discuss” link for each item were selected. By inspecting the source HTML for the original page, I immediately realized that this happens because the structure of the page doesn’t allow Kimono’s automated tool to distinguish between the two HTML tags:
I highlighted the tags in red in the picture above. Both fields are represented using the same tag, have no class and no other trait to differentiate them. The automated tool sees them as similar and yet different: It recognizes that the text is contained in two distinct anchor tags, so for the user it is quite easy to choose which one of the two (or possibly even both) should be selected—you just click on the check icon on any of the authors highlighted, and on the cancel icon (the X mark) on the “discuss” tag in the row you selected. (In this case, it is important to cancel the tag that corresponds to the author you clicked on earlier.)
If you did everything correctly, this is how the page should look like now:
Let’s also add the rank of the news to our data. I went through the same steps as before; I just had to be careful that the Y Combinator logo didn’t get initially selected as well. (A little tip: When you add/remove items for a field, the number inside the circle for the current field will be updated. Check that all three numbers are the same to ensure that you selected all the elements you meant to, and only those.
This is how it all should look like at this point.
But, of course, I didn’t want news just from one page; I also wanted news from the previous ones, as well. Kimono can help with that, too. At the top, on the Kimono toolbar, is a pagination button.
I clicked on the pagination button and selected the “More” link at the end of the list of news at the bottom of the page. This let Kimono know which link should be used to get to the next page of results.
To check the model of my data, I clicked on the “Data model” button in the Kimono bar.
(As you can see, I changed the property names and added yet another field to the data.)
There are three views available: The Simple view, shown above, allows users to get an overview of the data and change basic attributes like the name of the collection and of the properties to be extracted; the Editor view doesn’t add too much to what we have already accomplished, but it would allow users to combine fields, among other things.
The Advanced View is pretty interesting.
In addition to checking in detail that you have used the right selectors and regular expressions to extract your data, you can click on the “Attributes” link for the “title” property.
During my tests a modal window popped up, showing the attributes of the DOM elements selected for my field. This allowed me to decide which one should be included in my data. (Since I was able to link to the full text of each news item, I need to make sure the href checkbox for the title property is selected.)
To double check that the information about the URLs is correctly extracted, I could switch to the “Data preview” (using the button next to “Data model”). This offered me a snapshot of the data extracted, in different formats. I could even download it if I wanted to.
Once I was sure that everything was working as it should be, I clicked on the “Done” button in the top right corner. This brought me to the API creation page, where I could choose details like my API’s name, if it should be called on demand or updated at regular intervals, and--since I applied pagination--how many pages at most should be retrieved.
That’s almost it! Once the Kimono tool had done its thing, I was directed to the API console
The API console
The Kimono API console offers several useful options for helping users to manage their APIs. You can edit the API defined so far at any time (from “API Detail” tab), change its configuration, test the API endpoint, or download the data extracted as a JSON, any time you need it.
The most useful tab, however, is probably “How to use,” which provides snippets of code for several different languages, both front-end and back-end.
In the picture above, you can see, for example, how easy it is to query your newly created API from an HTML page. (The example uses jQuery, but you could use any library you want or even native XMLHttpRequest.)
As mentioned, you can also call this APIs server-side.
There are a lot more options and even ways to combine different APIs. You can easily extract similar data (for example, movie or book information) from as many sources as you need, and then combine them all into one single interface that will feed data from all of these sources.
You can even use an API as a source for another API, with the first one extracting a list of URLs that will be fed to the second one. Take a look here for more details.
But the coolest part is that, with a few clicks from the “API detail” panel, you can create either Kimono apps, for mobile, or KimonoBlocks-- widgets displaying live data extracted from your sources that can be added directly to your Web pages, without any coding.
As you can see above, you can try different styles, check the preview and then just copy and paste the code snippet into any HTML page—it’s as easy as that.
Just as easily, you could create a mobile-oriented standalone application. I say mobile-oriented because the layout is compliant with mobile devices, but it can be used as a regular Web application. (Check it out here.)
Compared to the widgets creation tool, you’ll have a broader choice of themes, styles, and designs. Currently, only the list visualization is available, but chart and table visualizations are on the way.
(And it literally took me less than 1 minute to create this app.)
Although powerful, Kimono is only as good as the structure of the Web page from which you extract data.
Kimono's extraction algorithms use CSS selectors to determine which elements to target on the page, and then regular expressions to further refine that set. This can put limitations on the rules you can write, because CSS selectors work in most—but not all--situations.
For example, during tests I tried to extract data from Hacker News’ “Who’s Hiring” thread, to run some statistics on the companies and cities where jobs were being advertised. It turned out to be a much more difficult task than I expected it to be.
The picture above is worth the proverbial 1,000 words. I was interested in the line of text highlighted in green, but I wanted to avoid text from comments added to top-level posts. Without getting too far into the weeds, the tool wouldn’t allow me to do what I wanted to do because:
- Comments are not nested inside posts, just differently indented; and they don’t use a different CSS class, but instead use spacers.
- CSS has no selector for “contains” or “has,” and since spacers are nested inside table columns, I couldn’t select columns next to the ones that contained spacers.
- The first row of text in each post is not contained in any tag, while next lines are contained in “p” tags.
You can make up for some of these problems in the advanced panel of Kimono’s “Data model” view, but you are still limited by CSS rules:
Above you can see the rule we would have needed, but it was not accepted by Kimono because it is not a valid CSS selector.
It’s not Kimono engine that fails, but, still, you could select the right DOM elements using jQuery (which implements a “has” selector). So, for this particular situation, you might still need to get your hands dirty and scrape manually.
With all of this said, Kimono is a powerful tool that bears testing and watching. In addition to all of the benefits I’ve noted so far, Kimono also provides great documentation, with tutorials guiding that effectively guide users through API creation. The tutorials explain in detail each step of the process, down to the smallest field in the Kimono user interface.
It will be interesting to see how the project will evolve, how the few flaws still present will be handled, and if more complex way to build rules for the scraper will be introduced. I believe that is a crucial factor for the success of this product category.