Start Curating with ScraperWiki

There is an endless debate among developers and testers, whether testers should learn to code.  I think the current focus on test automation is misguided.  However, programming is enjoyable and it is definitely beneficial to testers.  It is beneficial, not to write automation, but to understand programming, computer science and to create tools to use while testing.

Scripting is a good way to get started with programming.  There is no better way to learn programming than to use Ruby.  I have created this example which does not require any installation.  You can run the entire example from your browser using Scraperwiki

When using Ruby, Scraperwiki uses the Nokogiri gem (a gem is the Ruby equivalent of a library)  for parsing html.  For this example you don’t need to install Nokogiri.  However, you can refer to the Nokogiri reference if you want to.  When you use Ruby, you would install the Nokogiri gem on your computer.

In this example I want to extract a list of countries and their areas from Wikipedia.  Here is the Wikipedia page with that information:

List of countries by area from Wikipedia.com

List of countries by area from Wikipedia.com

That page contains a list of countries in a table:

List of countries by area from Wikipedia.com

List of countries by area from Wikipedia.com

I then use the Firefox extension, Firebug, to analyze the html structure for the country name and the total area.  The name is in the second cell in the row (<tr> is the html markup for the row and <td> is the markup for a cell.  With 0 (zero) indexing, the second cell is td[1]). Once i get the cell using the function xpath(), I use the function at_xpath() to get the anchor element.

I retrieve that using Nokogiri using the following statment:

country_anchor = row.xpath(‘.//td’)[1].at_xpath(‘.//a’)

xpath(‘.//td’)[1] – retrieves the second cell

.at_xpath(‘.//a’) – retreives the html anchor

 

Find country name using Firebug

Find country name using Firebug

I then analyze the total area using Firebug.  I get the area using the following statements:

total_cell = row.xpath(‘.//td’)[2]
total_km = total_cell.children[1].text.strip

The first statement is the same as the previous.  In the second statment I get the second child using ‘children[1]’.

Finding country area using Firebug

Finding country area using Firebug

Here is how the code looks in Scraperwiki:

 

Steps to follow in Scraperwiki

Select Create a new dataset

New data set

New data set

Select Code in your browser

Code in your browser

Code in your browser

Choose Ruby

Paste the code given above

Run

View in a table

View in a table

View in a table

Download as spreadsheet

Next steps

You can play around with pages in wikipedia or other websites.

As you try more projects refer to Nokogiri

I first got the idea of scraping from Brian Marick’s book, ‘Everyday scripting with Ruby’.  This is an extremely well written book to learn programming with Ruby.  Some of the examples in the book may not be updated.  However, it’s still a great book for beginners.

I’ll write another post on where you can go next with Ruby.