<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>WebServiceable &#187; Data Extraction</title>
	<atom:link href="http://webserviceable.com/topic/data-extraction/feed/" rel="self" type="application/rss+xml" />
	<link>http://webserviceable.com</link>
	<description>Mashups, APIs, Custom APIs. Information Shared.</description>
	<lastBuildDate>Mon, 23 Nov 2009 19:10:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Introducing SelectorGadget &#8211; Dapper Style DOM Selection for JQuery, Javascript, and beyond</title>
		<link>http://webserviceable.com/2009/02/27/introducing-selectorgadget-dapper-style-dom-selection-for-jquery-javascript-and-beyond/</link>
		<comments>http://webserviceable.com/2009/02/27/introducing-selectorgadget-dapper-style-dom-selection-for-jquery-javascript-and-beyond/#comments</comments>
		<pubDate>Fri, 27 Feb 2009 18:30:57 +0000</pubDate>
		<dc:creator>electBlake</dc:creator>
				<category><![CDATA[Data Extraction]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Bookmarklet]]></category>
		<category><![CDATA[Dapper.net]]></category>
		<category><![CDATA[Ideas]]></category>
		<category><![CDATA[Javascript]]></category>
		<category><![CDATA[JQuery]]></category>
		<category><![CDATA[SelectorGadget]]></category>

		<guid isPermaLink="false">http://webserviceable.com/?p=105</guid>
		<description><![CDATA[+  = ?
I caught a jquery tweet today that linked me to an interesting little helper bookmarklet called &#8220;SelectorGadget&#8221;.
&#8220;SelectorGadget is an open source bookmarklet that makes CSS selector generation and discovery on complicated sites a breeze.&#8221;

SelectorGadget, is a very easy to use bookmarklet that can be used on any website of your choosing (Although [...]


No related posts.]]></description>
			<content:encoded><![CDATA[<p><img src="http://theappslab.com/wp-content/uploads/2009/01/inspector-gadget.jpg" alt="InspectorGadget" width="173" height="200" /><strong><span style="font-size: x-large;">+</span></strong> <img class="alignnone size-full wp-image-106" title="JQuery" src="http://webserviceable.com/wp-content/uploads/2009/02/picture-8.png" alt="JQuery" width="236" height="78" /> <strong><span style="font-size: x-large;">= ?</span></strong><strong></strong></p>
<p>I caught a <a href="http://twitter.com/jquery/status/1258925790" target="_blank" >jquery tweet</a> today that linked me to an interesting little helper bookmarklet called &#8220;SelectorGadget&#8221;.</p>
<blockquote><p>&#8220;SelectorGadget is an open source bookmarklet that makes <a href="http://www.w3.org/TR/CSS2/selector.html" target="_blank" >CSS selector</a> generation and discovery on complicated sites a breeze.&#8221;</p>
</blockquote>
<p><a href="http://www.selectorgadget.com" target="_blank" title="SelectorGadget" ><strong>SelectorGadget</strong></a>, is a very easy to use bookmarklet that can be used on any website of your choosing (Although someone in the comments DID have a problem with scraping a site that is NTFW <img src='http://webserviceable.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  )</p>
<p>To get started with <strong>SelectorGadget</strong>, <a href="http://www.selectorgadget.com" target="_blank" >head over to their humble website</a>, and install the bookmarklet and watch the video. Anyone with experience in Data Extraction (Hpricot or Beautiful Soup &#8211; as the website suggests), will immedately see the benefits of this little application.</p>
<p>With apparent support from JQuery, and its <a href="http://github.com/iterationlabs/selectorgadget/tree/master" target="_blank" title="SelectorGadget @ github.com" >open-source repository over at github</a>, I think SelectorGadget will be able to spawn alot of interest within the various Javascript and DOM Selection/Extraction camp&#8217;s around the internet.</p>
<p>Some ideas for you guys to dig your teeth into after the fold.</p>
<p><span id="more-105"></span></p>
<h3>My Forseeable Uses for SelectorGadget and its Algorythms:</h3>
<ol>
<li>Speed up creation of <a href="http://en.wikipedia.org/wiki/Greasemonkey" target="_blank" title="GreaseMonkey" >GreaseMonkey</a> userscripts(<a href="http://userscripts.org" target="_blank" title="GreaseMonkey Userscripts.org" >.org</a>) for those of us who rely on JQuery for our javascript prowess.</li>
<li>Make DOM Selection in your own Javascript Applications (jquery included), MUCH MUCH easier</li>
<li>Dynamic JQuery plugins for data extraction &#8211; Make more versitile plugins to manage dynamic datasets.
<ul>
<li>For Example:
<ul>
<li>Creation a plugin that adds dom to specific element types of a page (mp3 player to mp3 links)</li>
<li>Generalize the selection of the mp3 links</li>
<li>Drop in mp3 player plugin to any page w/out any need to initilize</li>
</ul>
</li>
</ul>
</li>
<li>Creation of your own <a href="http://dapper.net" target="_blank" title="Dapper.net" >Dapper Engine</a>????????
<ul>
<li>For instance if you could (hypothetically) get jaxer running this javascript selection engine. Then maybe&#8230; you could be running a serverside DOM Selector engine with native ajax-scraping abilities&#8230;Hrrmm&#8230;</li>
</ul>
</li>
<li>Quick Ruby Data Extraction &amp; Screen Scraping
<ul>
<li><a href="http://wiki.github.com/why/hpricot" target="_blank" title="Ruby's Hpricot gem" >Hpricot</a>, or better yet <a href="http://github.com/scrubber/scrubyt" target="_blank" >scrubyt</a>, OR even better still <a href="http://github.com/scrubber/scrubyt/tree/skimr" target="_blank" >skimr</a> (the re-factored scrubyt for those of you who have been paying attention)</li>
</ul>
</li>
<li>SelectorGadget to <a href="http://code.google.com/p/phpquery/" target="_blank" title="phpQuery - a php port of JQuery" >phpQuery</a> ? &#8211; phpQuery is a dom selector engine for php designed to be a port of JQuery to php. Its pretty good but a little slow for my liking.</li>
</ol>
<h3>The Death of PHP in Data Extraction</h3>
<p>Yes, I am calling it. This is not breaking news as Ruby&#8217;s scrubyt (and its driving forces WWW:Mechanize and Hpricot), have been dominating the &#8220;custom api creation&#8221; process for awhile now. Still, php is so common and easy to get up and running its hard for me drop it all together.</p>
<p>Software like SelectorGadget gives me even more reason to move away from php and work on my Ruby development more. As an interim solution (in the interest of time) I imagine myself using Ruby for my data and php for my presentation.</p>
<p>OR, if I can finally get a solid native jaxer server running I could simply use my javascript skills to properly deploy my data extraction javascript applications. (If your a jaxer master, please message me, I&#8217;ve tried numerous times with limited success)</p>
<p>Until I am a ruby master (which might take time as I&#8217;m learning far too many languages atm), I am going to see what I can hack out of the SelectorGadget engine. Its algorithms might unlock a very cool selector engine for php (aka an improved or refactored phpQuery <img src='http://webserviceable.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> )</p>
<p><br class="spacer_" /></p>
<p>Anyway, I think SelectorGadget has ALOT of promise and I see this accomplishment echoing through a lot of different technologies and software. I know I am going to dig into it and see what I can do. I&#8217;ll be sure to report all of my findings to you good people.</p>
<p>Happy api&#8217;ing.</p>


<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://webserviceable.com/2009/02/27/introducing-selectorgadget-dapper-style-dom-selection-for-jquery-javascript-and-beyond/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Data Extraction? Clipping? API? Screen Scraping??- Explained with Tutorials and Links!</title>
		<link>http://webserviceable.com/2009/02/11/data-extraction-clipping-api-screen-scraping-explained-with-tutorials-and-links/</link>
		<comments>http://webserviceable.com/2009/02/11/data-extraction-clipping-api-screen-scraping-explained-with-tutorials-and-links/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 20:17:13 +0000</pubDate>
		<dc:creator>electBlake</dc:creator>
				<category><![CDATA[Data Extraction]]></category>
		<category><![CDATA[Dapper.net]]></category>
		<category><![CDATA[openkapow]]></category>

		<guid isPermaLink="false">http://webserviceable.com/?p=54</guid>
		<description><![CDATA[There are many different ways to create mashups, and the terminology can be confusing. In a great article about the different types and their respective functions within an mashup, Michael Ogrinz has posted about these terms what what they really mean.
I found Michael Oginz originally from a post about clippings on programmableweb
But before you leave [...]


No related posts.]]></description>
			<content:encoded><![CDATA[<p>There are many different ways to create mashups, and the terminology can be confusing. In a great article about the different types and their respective functions within an mashup, <a href="http://www.mashuppatterns.com/xn/detail/u_0v6frpjqz1rq9" target="_blank" >Michael Ogrinz</a> has <a href="http://www.mashuppatterns.com/profiles/blogs/a-handful-of-core-activities" target="_blank" >posted about</a> these terms what what they really mean.<a href="http://www.mashuppatterns.com/xn/detail/u_0v6frpjqz1rq9" target="_blank" ></a></p>
<p>I found Michael Oginz originally from a <a href="http://blog.programmableweb.com/2009/02/11/mashup-patterns-clipping-mashups/" target="_blank" >post about clippings on programmableweb</a></p>
<p>But before you leave to go to the holyprogrammableland, I&#8217;ll <a href="http://www.mashuppatterns.com/profiles/blogs/a-handful-of-core-activities" target="_blank" >lift</a> a few of the key definitions and provide some examples:</p>
<blockquote>
<h2><strong>&#8220;Data Extraction</strong></h2>
<p>If you intend to harvest information from closed Web sites that don’t expose an API, this feature provides some level of parsing against the underlying DOM. Some tools like Kapow let you go after explicit parts of a page. Others, like Connotate or <a href="http://www.dapper.net/" target="_blank" >Dapper</a>, use artificial intelligence to help pull out pertinent details. Each approach has its plusses and minuses. Sometimes you need a very fine level of control over extraction operations, but this approach doesn’t scale if you have thousands of sites to mine.&#8221; <em>this is at the root of screen scraping OOOoooo</em></p>
</blockquote>
<h3><strong>Free Data Extraction Services (well, most of them are free)<br />
 </strong></h3>
<ul style="text-align: left;">
<li><strong><span style="font-size: medium;"><a href="http://Dapper.net" target="_blank" >Dapper.net</a></span> </strong>- the first one that I stumbled upon, not sure it was the first, but its a great place to start. A wide variety of export formats (google widget, rss, xml, and more!) and it has a great interface that makes it quite easy for anyone to start extracting the data they way.<em><br />
 Warning: dapper.net can be quite slow and sometimes all out goes down, if you rely on a dapp to provide the core information of your website, you should be weary.</em></p>
<p><br class="spacer_" /></p>
<ul>
<li><strong><em>Dapper.net Tutorials, Resources, and Links<br />
 </em></strong> </p>
<ul>
<li> <a href="http://www.readwriteweb.com/archives/dapper_quest_to_unlock_web_data.php" target="_blank" >Dapper Explained</a> &#8211; Great article about dapper and the legal impact, the way it works, and everything inbetween @ readwritewebc.com</li>
</ul>
<ul>
<li><em><a href="http://www.dapper.net/dapperDemo/" target="_blank" >Dapper Demo</a> &#8211; Getting Started @ dapper.net</em></li>
<li><em><a href="http://todaysbesttools.com/create-rss-feeds-with-dapper/82/" target="_blank" >Dapper RSS Feeds</a> &#8211; Simple tutorial to create &#8220;custom&#8221; rss feeds @ todaysbesttools.com<br />
 </em></li>
<li><em><a href="http://www.fillslashstroke.com/slash/2008/07/dapper-pipes/" target="_blank" >Dapper.net + Yahoo Pipes</a> &#8211; Tutorial to aggregate job search @ fillslashstroke.com</em></li>
<li><em><a href="http://www.readwriteweb.com/archives/screen-scraping.php" target="_blank" >Dapper.net + readwriteweb  + delicious </a> &#8211; Video tutorial on how to grab an rss feed from delicious for a specific site (readwriteweb in this case) @ readwriteweb.com</em></li>
<li><em><a href="http://www.metafluence.com/integrating-netvibes-pipes-aiderss-dapper-for-an-intelligence-dashboard/" target="_blank" >Dapper.net + Netvibes + Yahoo Pipes + AideRSS</a> &#8211; An indepth look at creating a master feed from various sources using various services @ metafluence.com </em>
<p><em></em></p>
</li>
</ul>
</li>
</ul>
</li>
<li><span style="font-size: medium;"><a href="http://openKapow.com" target="_blank" ><strong>openKapow.com</strong></a></span> &#8211; the second site that I found. openKapow uses software to create &#8220;robots&#8221; that run through their servers to return your data via an API. I found the interface to their software as complex as learning auto-cad, but I did managed to get a few robots running.<br />
 Kapow was even slower then dapper.net and I personally wouldn&#8217;t use it except that it seems to take the screen-scraping to the next level, allowing you to interact programmically with the website specifying specific elements to be taken and used in various steps.</p>
<p><br class="spacer_" /></p>
<ul>
<li><em><strong>openKapow Tutorials, Resources, Links</strong></em>
<ul>
<li><em><a href="http://openkapow.com/blogs/support/archive/2006/12/04/FAQ.aspx" target="_blank" >openKapow Basics</a> &#8211; Learn the terminology and do some basic tutorials @ openkapow.com</em></li>
<li><em><a href="http://demo.openkapow.com/" target="_blank" >openKapow Demo</a> &#8211; Demo it before you do it @ openkapow.com</em></li>
<li><em><a href="http://service.openkapow.com/Andreas/techcrunchpostsearch.rest" target="_blank" >openKapow + TechCrunch</a> &#8211; TechCrunch demo robot that searches deep into TechCrunch for keywords @ http://service.openkapow.com/Andreas/</em></li>
<li><em><a href="http://service.openkapow.com/Andreas/gmailreader.rest" target="_blank" >openKapow + Gmail</a> &#8211; Another TechCrunch demo robot that logs into your email and exports messages as xml (cool) @ http://service.openkapow.com/Andreas </em>
<p><em></em></p>
</li>
</ul>
</li>
</ul>
</li>
<li><span style="font-size: medium;"><a href="http://teqlo.com " target="_blank" ><strong>teqlo.com</strong> </a></span>- used to do the job but It seems to be down. I wonder if its part of the googlemashup engine now? hmmmm&#8230;.I did read <a href="http://www.readwriteweb.com/archives/eric_schmidt_defines_web_30.php" target="_blank" >somewhere</a> that there was some google interest </li>
<li><span style="font-size: medium;"><a href="http://protosw.com" target="_blank" ><strong>protosw.com</strong></a></span> &#8211; I haven&#8217;t used this one personally so I can&#8217;t have alot of first had experience to offer but it seems like its an &#8220;desktop application for building data-oriented workflow dashboards.&#8221; They offer private consulting to get your job done, or their application
<ul>
<li><em><strong>Proto Tutorials, Resources, and Links! </strong></em>
<ul>
<li><em><a href="http://www.protosw.com/devcenter" target="_blank" >Getting Started</a> &#8211; Learn the basics of developing with proto @ protosw.com</em></li>
<li><em><a href="http://www.protosw.com/devcenter/beta-program" target="_blank" >Developer Beta!</a> &#8211; Sign up, start developing @ protosw.com</em></li>
<li><em><a href="http://www.screencastcentral.com/public/yt2329.cfm" target="_blank" >Proto + Craigslist</a> &#8211; A not very informative walkthrough of creating a craigslist scraper @ screencastcentral.com</em></li>
</ul>
</li>
</ul>
</li>
</ul>
<blockquote>
<h2><strong>&#8220;Clipping</strong></h2>
<p>Clipping allows you to repurpose web content without requiring any changes to the underlying code base. How does this differ from Data Extraction? In the case of Extraction we generally harvest discrete values from a page, absent of their presentation characteristics. This specific information is often the subject of further analysis. Clipping operations return information at a presentation level; it cannot be broken down into its constituent parts.&#8221;</p>
</blockquote>
<h3>Free Clipping Web Services<br />
</h3>
<p>Because &#8220;clipping&#8221; is a FAR easier task them data mining DOM, there are many services and it seems like more desktop/iphone applications then there are webservices that cleanly supply a web clipping service.</p>
<ul>
<li><span style="font-size: medium;"><a href="http://www.google.com/notebook" target="_blank" ><strong>Google Notebook</strong></a></span> &#8211; The first player I saw on the scene, google did it right. Copy/paste. dead simple. Save, share. simply simple. Its good so you can expect all of the API &amp; integration to come!
<ul>
<li><strong>Google Notebook Tutorials, Resources, Links!</strong>
<ul>
<li><a href="http://www.google.com/googlenotebook/faq.html" target="_blank" >Frequently Asked Questions + How-tos</a> &#8211; As we&#8217;ve come to expect from all google documentation, a very extensive and nicely organized document outlining the various features and how-to actions of google notebook @ google.com/notebook</li>
<li><a href="http://www.youtube.com/watch?v=I2fuwVhwEG0" target="_blank" >Using Google Notebook Screencast</a> &#8211; A nice little screencast that walks you through using google notebook @ youtube.com</li>
<li><a href="http://www.integralawakening.com/ia/2006/05/a_complete_onli.html" target="_blank" >Google Notebook + Google Personalized Homepage + 37signal&#8217;s &#8220;backback&#8221;</a> &#8211; A great organization article/tutorial on how to integrate all of these services under one roof. I especially though it was cool that they used backpack <img src='http://webserviceable.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  @ integralawakening.com </li>
</ul>
</li>
</ul>
</li>
<li><strong><span style="font-size: medium;"><a href="http://www.notefish.com" target="_blank" >notefish.com</a></span> </strong>- I like notefish for its simplicty and clean look, its obviously taken its design patterns from google, and its done well for it. With <a href="http://www.notefish.com/notes.php?p=36619" target="_blank" >clean listings</a> and using <a href="http://www.notefish.com/install-tools.php" target="_blank" >notefish plugins for firefox and internet explorer</a> it looks like they have a good handle on whats happening out there.</li>
<li>
<p><span style="font-size: medium;"><a href="http://clipmarks.com" target="_blank" ><strong>clipmarks.com</strong></a></span> &#8211; is pretty straight forward, save a block of DOM (html) into a &#8220;notebook&#8221; and then view text/images/basic html again later for quick reference. Using a <a href="https://addons.mozilla.org/en-US/firefox/addon/1407/" target="_blank" >firefox</a> or <a href="http://clipmarks.com/install" target="_blank" >internet explorer</a> plugin you can save clips much like you&#8217;d take a page for delicious &#8211; press a small button, check, save. People can create notebook style lists such as <a href="http://www.clipmarks.com/clipmark/04266AF1-B96A-4780-BC1E-5F9DD39DA4E9/" target="_blank" >68 Web 2.0 tutorials</a></p>
<ul>
<li><em><strong>Clipmarks Tutorials, Links, Resources!</strong></em>
<ul>
<li><em><a href="http://clipmarks.com/learn-more/" target="_blank" >Learn the Basics</a> &#8211; http://clipmarks.com/learn-more/</em></li>
<li><em>Its all you really need to know ^^^ </em>
<p><em></em></p>
</li>
</ul>
</li>
</ul>
</li>
</ul>


<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://webserviceable.com/2009/02/11/data-extraction-clipping-api-screen-scraping-explained-with-tutorials-and-links/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
