Data Extraction? Clipping? API? Screen Scraping??- Explained with Tutorials and Links!
There are many different ways to create mashups, and the terminology can be confusing. In a great article about the different types and their respective functions within an mashup, Michael Ogrinz has posted about these terms what what they really mean.
I found Michael Oginz originally from a post about clippings on programmableweb
But before you leave to go to the holyprogrammableland, I’ll lift a few of the key definitions and provide some examples:
“Data Extraction
If you intend to harvest information from closed Web sites that don’t expose an API, this feature provides some level of parsing against the underlying DOM. Some tools like Kapow let you go after explicit parts of a page. Others, like Connotate or Dapper, use artificial intelligence to help pull out pertinent details. Each approach has its plusses and minuses. Sometimes you need a very fine level of control over extraction operations, but this approach doesn’t scale if you have thousands of sites to mine.” this is at the root of screen scraping OOOoooo
Free Data Extraction Services (well, most of them are free)
- Dapper.net - the first one that I stumbled upon, not sure it was the first, but its a great place to start. A wide variety of export formats (google widget, rss, xml, and more!) and it has a great interface that makes it quite easy for anyone to start extracting the data they way.
Warning: dapper.net can be quite slow and sometimes all out goes down, if you rely on a dapp to provide the core information of your website, you should be weary.- Dapper.net Tutorials, Resources, and Links
- Dapper Explained – Great article about dapper and the legal impact, the way it works, and everything inbetween @ readwritewebc.com
- Dapper Demo – Getting Started @ dapper.net
- Dapper RSS Feeds – Simple tutorial to create “custom” rss feeds @ todaysbesttools.com
- Dapper.net + Yahoo Pipes – Tutorial to aggregate job search @ fillslashstroke.com
- Dapper.net + readwriteweb + delicious – Video tutorial on how to grab an rss feed from delicious for a specific site (readwriteweb in this case) @ readwriteweb.com
- Dapper.net + Netvibes + Yahoo Pipes + AideRSS – An indepth look at creating a master feed from various sources using various services @ metafluence.com
- Dapper.net Tutorials, Resources, and Links
- openKapow.com – the second site that I found. openKapow uses software to create “robots” that run through their servers to return your data via an API. I found the interface to their software as complex as learning auto-cad, but I did managed to get a few robots running.
Kapow was even slower then dapper.net and I personally wouldn’t use it except that it seems to take the screen-scraping to the next level, allowing you to interact programmically with the website specifying specific elements to be taken and used in various steps.- openKapow Tutorials, Resources, Links
- openKapow Basics – Learn the terminology and do some basic tutorials @ openkapow.com
- openKapow Demo – Demo it before you do it @ openkapow.com
- openKapow + TechCrunch – TechCrunch demo robot that searches deep into TechCrunch for keywords @ http://service.openkapow.com/Andreas/
- openKapow + Gmail – Another TechCrunch demo robot that logs into your email and exports messages as xml (cool) @ http://service.openkapow.com/Andreas
- openKapow Tutorials, Resources, Links
- teqlo.com - used to do the job but It seems to be down. I wonder if its part of the googlemashup engine now? hmmmm….I did read somewhere that there was some google interest
- protosw.com – I haven’t used this one personally so I can’t have alot of first had experience to offer but it seems like its an “desktop application for building data-oriented workflow dashboards.” They offer private consulting to get your job done, or their application
- Proto Tutorials, Resources, and Links!
- Getting Started – Learn the basics of developing with proto @ protosw.com
- Developer Beta! – Sign up, start developing @ protosw.com
- Proto + Craigslist – A not very informative walkthrough of creating a craigslist scraper @ screencastcentral.com
- Proto Tutorials, Resources, and Links!
“Clipping
Clipping allows you to repurpose web content without requiring any changes to the underlying code base. How does this differ from Data Extraction? In the case of Extraction we generally harvest discrete values from a page, absent of their presentation characteristics. This specific information is often the subject of further analysis. Clipping operations return information at a presentation level; it cannot be broken down into its constituent parts.”
Free Clipping Web Services
Because “clipping” is a FAR easier task them data mining DOM, there are many services and it seems like more desktop/iphone applications then there are webservices that cleanly supply a web clipping service.
- Google Notebook – The first player I saw on the scene, google did it right. Copy/paste. dead simple. Save, share. simply simple. Its good so you can expect all of the API & integration to come!
- Google Notebook Tutorials, Resources, Links!
- Frequently Asked Questions + How-tos – As we’ve come to expect from all google documentation, a very extensive and nicely organized document outlining the various features and how-to actions of google notebook @ google.com/notebook
- Using Google Notebook Screencast – A nice little screencast that walks you through using google notebook @ youtube.com
- Google Notebook + Google Personalized Homepage + 37signal’s “backback” – A great organization article/tutorial on how to integrate all of these services under one roof. I especially though it was cool that they used backpack
@ integralawakening.com
- Google Notebook Tutorials, Resources, Links!
- notefish.com - I like notefish for its simplicty and clean look, its obviously taken its design patterns from google, and its done well for it. With clean listings and using notefish plugins for firefox and internet explorer it looks like they have a good handle on whats happening out there.
-
clipmarks.com – is pretty straight forward, save a block of DOM (html) into a “notebook” and then view text/images/basic html again later for quick reference. Using a firefox or internet explorer plugin you can save clips much like you’d take a page for delicious – press a small button, check, save. People can create notebook style lists such as 68 Web 2.0 tutorials
- Clipmarks Tutorials, Links, Resources!
- Learn the Basics – http://clipmarks.com/learn-more/
- Its all you really need to know ^^^
- Clipmarks Tutorials, Links, Resources!
No related posts.



(1 votes, average: 4.00 out of 5)






March 6th, 2009 at 9:36 pm
Good list. Though I have to say that Google Notebook is no longer accepting any new users. If you already have an account, like me, you can continue to use your Notebook. I think it’s the best clipper app out there, so it’s a shame that Google decided not to support it going forward.March 13th, 2009 at 3:39 pm
Yeah, It was one of the services that got cut in the downsizing of their labs that happened sometime last year or so… Economic Crisis affects us all.Google does do a great job at being a jack of all trades, but they had to cut the fat…
Google Notebook didn’t take off and I’m sure its code will sprout up somewhere in the near future
*ahem* Friend Connect? *ahem*