_ hpricot.com
| blog | demos | contact | hpricot alternatives


Problems with Hpricot

2014-07-10 Dan Bikle

Hpricot is old, slow, and rarely used.

One favorable feature of Hpricot is it is relatively easy to use and install.

Also Hpricot has been ported to JRuby which allows you to access the Hpricot API from Java.

An Hpricot Use-Case

2014-07-11 Dan Bikle

If you look at this URL you will see a data value called PEG ratio:

Some people believe that if PEG ratio goes down, then stock price should go up:

An easy way to study the correlation between PEG ratio and stock price is to sample both values over 2 years for these stocks:

Hpricot is well suited for collecting the data for this study.

I start the data collection task by assuming it will happen on a Linux host and then using a Linux utility named cron to run a simple Hpricot script once a day:

When I write a cron script (or most types of software) on Linux I work from the "outside in".

In this case, I would write the cron entry first.

Then I write the shell script which the cron entry calls.

Next, I build the Ruby script which the shell script calls.

Then, I craft the actual Hpricot syntax to collect both stock price and PEG value for each stock on each day.

The cron entry is just one line of code; I display a sample cron entry below which would be suitable if the Linux is running on UTC time:

59 17 * * mon,tue,wed,thu,fri /home/dan/peg_scraper/peg.bash > /tmp/peg_bash_output 2>&1

The above cron entry will run /home/dan/peg_scraper/peg.bash at 17:59 UTC, Monday through Friday.

I activate the above entry by copying the syntax into a file named /home/dan/peg_scraper/crontab.txt

Then I issue the command: crontab /home/dan/peg_scraper/crontab.txt

I am then confident the crontab utility will call my shell script.

If you want to learn how to write and understand shell scripts, use Google to find tutorials:

Next, I write the shell script: /home/dan/peg_scraper/peg.bash

The script might look something like this:


# /home/dan/peg_scraper/peg.bash

# I use this script to call a Ruby script 
# which uses Hpricot to scrape both 
# PEG and Price out of Yahoo Key Statistics pages

cd /home/dan/peg_scraper/

ruby peg.rb http://finance.yahoo.com/q/ks?s=IBM+Key+Statistics   >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=AAPL+Key+Statistics  >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=MSFT+Key+Statistics  >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=C+Key+Statistics     >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=XOM+Key+Statistics   >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=WMT+Key+Statistics   >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=T+Key+Statistics     >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=JNJ+Key+Statistics   >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=DIS+Key+Statistics   >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=KO+Key+Statistics    >> peg_data.csv 
ruby peg.rb http://finance.yahoo.com/q/ks?s=PFE+Key+Statistics   >> peg_data.csv 


# In peg_data.csv 
# I should see something like this:
# "IBM | 1.21 | 188.00"
# "AAPL | 1.29 | 95.22"
# "MSFT | 2.33 | 42.09"
# "C | 0.91 | 47.00"
# "XOM | 3.54 | 101.74"
# "WMT | 1.84 | 76.82"
# "T | 2.40 | 35.76"
# "JNJ | 2.55 | 105.10"
# "DIS | 1.28 | 86.89"
# "KO | 3.02 | 41.97"
# "PFE | 4.42 | 30.07"

When I wrote the above script I decided to pack as much logic into the script as possible.

This decision is consistent with my pattern of working from the "outside in".

Ideally, when I get to the inner-most layer, the syntax would be trivial to write because all the functionality has been written into the outermost layers.

Next, I write the ruby script: /home/dan/peg_scraper/peg.rb

# /home/dan/peg_scraper/peg.rb

# I use this script to use Hpricot to scrape both PEG and Price out of Yahoo Key Statistics pages

# Use the information at the URL below to write this script:
# https://github.com/hpricot/hpricot/blob/master/README.md

if ARGV[0].nil?
  p "You need to give me a URL."
  p "Demo:"
  p "ruby peg.rb http://finance.yahoo.com/q/ks?s=IBM+Key+Statistics"

require 'rubygems'
require 'hpricot'
require 'open-uri'

my_url = ARGV[0]
doc = open(my_url) { |f| Hpricot(f) }

mysymbol = my_url.sub(/.+=/,'').sub(/.Key.Statistics/,'')
mypeg    = doc.search("td.yfnc_tablehead1[text()*='PEG Ratio'] .. td.yfnc_tabledata1").inner_html
myprice  = doc.search("div.yfi_rt_quote_summary_rt_top span.time_rtq_ticker span").inner_html

p "#{mysymbol} | #{mypeg} | #{myprice}"

# done

To craft the Hpricot expression to get the PEG value, I use a Firefox plugin named Firebug to get the CSS path of the td-element holding the PEG value.

Firebug told the the CSS path is this:

table#yfncsumtab tbody tr td.yfnc_modtitlew1
table.yfnc_datamodoutline1 tbody tr td
table tbody tr td.yfnc_tabledata1

I then tested the above CSS path here:

Hpricot could not parse the above path.

I was able, however, to use Hpricot to locate the text node with value "PEG Ratio (5 yr expected)".

Once I had that text node, I moved up to the parent tr-element using ".." expression.

Then from the parent I searched downward for a td-element with class yfnc_tabledata1:

td.yfnc_tablehead1[text()*='PEG Ratio'] .. td.yfnc_tabledata1

See the selector in action using the form below:

Next, worked on scraping the price of IBM out of the page. After interacting with Firebug for 3 minutes I found a selector which gives me IBM price:

Then I tested it using the form below:

After I wrote both the shell script and the ruby script I ran them and found the following data in the output file:

"IBM | 1.21 | 188.00"
"AAPL | 1.29 | 95.22"
"MSFT | 2.33 | 42.09"
"C | 0.91 | 47.00"
"XOM | 3.54 | 101.74"
"WMT | 1.84 | 76.82"
"T | 2.40 | 35.76"
"JNJ | 2.55 | 105.10"
"DIS | 1.28 | 86.89"
"KO | 3.02 | 41.97"
"PFE | 4.42 | 30.07"

After two or three years I will collect enough data to study the question, "Do we have a correlation between PEG and price?"

blog | demos | contact | hpricot alternatives