Mashups, APIs, Website Information Shared...

Archive for the ‘Data Extraction’ Category

Introducing SelectorGadget – Dapper Style DOM Selection for JQuery, Javascript, and beyond

Friday, February 27th, 2009

InspectorGadget+ JQuery = ?

I caught a jquery tweet today that linked me to an interesting little helper bookmarklet called “SelectorGadget”.

“SelectorGadget is an open source bookmarklet that makes CSS selector generation and discovery on complicated sites a breeze.”

SelectorGadget, is a very easy to use bookmarklet that can be used on any website of your choosing (Although someone in the comments DID have a problem with scraping a site that is NTFW ;) )

To get started with SelectorGadget, head over to their humble website, and install the bookmarklet and watch the video. Anyone with experience in Data Extraction (Hpricot or Beautiful Soup – as the website suggests), will immedately see the benefits of this little application.

With apparent support from JQuery, and its open-source repository over at github, I think SelectorGadget will be able to spawn alot of interest within the various Javascript and DOM Selection/Extraction camp’s around the internet.

Some ideas for you guys to dig your teeth into after the fold.

(more…)

Data Extraction? Clipping? API? Screen Scraping??- Explained with Tutorials and Links!

Wednesday, February 11th, 2009

There are many different ways to create mashups, and the terminology can be confusing. In a great article about the different types and their respective functions within an mashup, Michael Ogrinz has posted about these terms what what they really mean.

I found Michael Oginz originally from a post about clippings on programmableweb

But before you leave to go to the holyprogrammableland, I’ll lift a few of the key definitions and provide some examples:

“Data Extraction

If you intend to harvest information from closed Web sites that don’t expose an API, this feature provides some level of parsing against the underlying DOM. Some tools like Kapow let you go after explicit parts of a page. Others, like Connotate or Dapper, use artificial intelligence to help pull out pertinent details. Each approach has its plusses and minuses. Sometimes you need a very fine level of control over extraction operations, but this approach doesn’t scale if you have thousands of sites to mine.” this is at the root of screen scraping OOOoooo

Free Data Extraction Services (well, most of them are free)

  • Dapper.net - the first one that I stumbled upon, not sure it was the first, but its a great place to start. A wide variety of export formats (google widget, rss, xml, and more!) and it has a great interface that makes it quite easy for anyone to start extracting the data they way.
    Warning: dapper.net can be quite slow and sometimes all out goes down, if you rely on a dapp to provide the core information of your website, you should be weary.


  • openKapow.com – the second site that I found. openKapow uses software to create “robots” that run through their servers to return your data via an API. I found the interface to their software as complex as learning auto-cad, but I did managed to get a few robots running.
    Kapow was even slower then dapper.net and I personally wouldn’t use it except that it seems to take the screen-scraping to the next level, allowing you to interact programmically with the website specifying specific elements to be taken and used in various steps.


    • openKapow Tutorials, Resources, Links
      • openKapow Basics – Learn the terminology and do some basic tutorials @ openkapow.com
      • openKapow Demo – Demo it before you do it @ openkapow.com
      • openKapow + TechCrunch – TechCrunch demo robot that searches deep into TechCrunch for keywords @ http://service.openkapow.com/Andreas/
      • openKapow + Gmail – Another TechCrunch demo robot that logs into your email and exports messages as xml (cool) @ http://service.openkapow.com/Andreas

  • teqlo.com - used to do the job but It seems to be down. I wonder if its part of the googlemashup engine now? hmmmm….I did read somewhere that there was some google interest
  • protosw.com – I haven’t used this one personally so I can’t have alot of first had experience to offer but it seems like its an “desktop application for building data-oriented workflow dashboards.” They offer private consulting to get your job done, or their application
    • Proto Tutorials, Resources, and Links!
      • Getting Started – Learn the basics of developing with proto @ protosw.com
      • Developer Beta! – Sign up, start developing @ protosw.com
      • Proto + Craigslist – A not very informative walkthrough of creating a craigslist scraper @ screencastcentral.com

“Clipping

Clipping allows you to repurpose web content without requiring any changes to the underlying code base. How does this differ from Data Extraction? In the case of Extraction we generally harvest discrete values from a page, absent of their presentation characteristics. This specific information is often the subject of further analysis. Clipping operations return information at a presentation level; it cannot be broken down into its constituent parts.”

Free Clipping Web Services

Because “clipping” is a FAR easier task them data mining DOM, there are many services and it seems like more desktop/iphone applications then there are webservices that cleanly supply a web clipping service.

  • Google Notebook – The first player I saw on the scene, google did it right. Copy/paste. dead simple. Save, share. simply simple. Its good so you can expect all of the API & integration to come!
  • notefish.com - I like notefish for its simplicty and clean look, its obviously taken its design patterns from google, and its done well for it. With clean listings and using notefish plugins for firefox and internet explorer it looks like they have a good handle on whats happening out there.
  • clipmarks.com – is pretty straight forward, save a block of DOM (html) into a “notebook” and then view text/images/basic html again later for quick reference. Using a firefox or internet explorer plugin you can save clips much like you’d take a page for delicious – press a small button, check, save. People can create notebook style lists such as 68 Web 2.0 tutorials

    • Clipmarks Tutorials, Links, Resources!
      • Learn the Basics – http://clipmarks.com/learn-more/
      • Its all you really need to know ^^^