_ hpricot.com | blog | demos | contact | hpricot alternatives
Problems with Hpricot
2014-07-10 Dan BikleHpricot is old, slow, and rarely used.
One favorable feature of Hpricot is it is relatively easy to use and install.
Also Hpricot has been ported to JRuby which allows you to access the Hpricot API from Java.
An Hpricot Use-Case
2014-07-11 Dan BikleIf you look at this URL you will see a data value called PEG ratio:
Some people believe that if PEG ratio goes down, then stock price should go up:
An easy way to study the correlation between PEG ratio and stock price is to sample both values over 2 years for these stocks:
Hpricot is well suited for collecting the data for this study.
I start the data collection task by assuming it will happen on a Linux host and then using a Linux utility named cron to run a simple Hpricot script once a day:
When I write a cron script (or most types of software) on Linux I work from the "outside in".
In this case, I would write the cron entry first.
Then I write the shell script which the cron entry calls.
Next, I build the Ruby script which the shell script calls.
Then, I craft the actual Hpricot syntax to collect both stock price and PEG value for each stock on each day.
The cron entry is just one line of code; I display a sample cron entry below which would be suitable if the Linux is running on UTC time:
The above cron entry will run /home/dan/peg_scraper/peg.bash at 17:59 UTC, Monday through Friday.
I activate the above entry by copying the syntax into a file named /home/dan/peg_scraper/crontab.txt
Then I issue the command: crontab /home/dan/peg_scraper/crontab.txt
I am then confident the crontab utility will call my shell script.
If you want to learn how to write and understand shell scripts, use Google to find tutorials:
Next, I write the shell script: /home/dan/peg_scraper/peg.bash
The script might look something like this:
When I wrote the above script I decided to pack as much logic into the script as possible.
This decision is consistent with my pattern of working from the "outside in".
Ideally, when I get to the inner-most layer, the syntax would be trivial to write because all the functionality has been written into the outermost layers.
Next, I write the ruby script: /home/dan/peg_scraper/peg.rb
To craft the Hpricot expression to get the PEG value, I use a Firefox plugin named Firebug to get the CSS path of the td-element holding the PEG value.
Firebug told the the CSS path is this:
I then tested the above CSS path here:
Hpricot could not parse the above path.
I was able, however, to use Hpricot to locate the text node with value "PEG Ratio (5 yr expected)".
Once I had that text node, I moved up to the parent tr-element using ".." expression.
Then from the parent I searched downward for a td-element with class yfnc_tabledata1:
See the selector in action using the form below:Next, worked on scraping the price of IBM out of the page. After interacting with Firebug for 3 minutes I found a selector which gives me IBM price:
Then I tested it using the form below:
After I wrote both the shell script and the ruby script I ran them and found the following data in the output file:
After two or three years I will collect enough data to study the question, "Do we have a correlation between PEG and price?"
blog | demos | contact | hpricot alternatives