Skip to content
i5m edited this page Sep 13, 2010 · 1 revision

(Part of An Hpricot Showcase.)

Loading Hpricot Itself

You have probably got the gem, right? To load Hpricot:

 require 'rubygems'
 require 'hpricot'

If you’ve installed the plain source distribution, go ahead and just:

 require 'hpricot'

Load an HTML Page

The Hpricot() method takes a string or any IO object and loads the contents into a document object.

 doc = Hpricot("<p>A simple <b>test</b> string.</p>")

To load from a file, just get the stream open:

 doc = open("index.html") { |f| Hpricot(f) }

To load from a web URL, use open-uri, which comes with Ruby:

 require 'open-uri'
 doc = Hpricot(open("http://qwantz.com/"))

Search for Elements

Use Doc.search:

 doc.search("//p[@class='posted']")
 #=> #<Hpricot:Elements[{p ...}, {p ...}]>

Doc.search can take an XPath or CSS expression. In the above example, all paragraph <p> elements are grabbed which have a class attribute of "posted".

A shortcut is to use the divisor:

 (doc/"p.posted")
 #=> #<Hpricot:Elements[{p ...}, {p ...}]>

Finding Just One Element

If you’re looking for a single element, the at method will return the first element matched by the expression. In this case, you’ll get back the element itself rather than the Hpricot::Elements array.

 doc.at("body")['onload']

The above code will find the body tag and give you back the onload attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes.

Fetching the Contents of an Element

Just as with browser scripting, the inner_html property can be used to get the inner contents of an element.

 (doc/"#elementID").inner_html
 #=> "..<b>contents</b>.."

If your expression matches more than one element, you’ll get back the contents of all the matched elements. So you may want to use first to be sure you get back only one.

 (doc/"#elementID").first.inner_html
 #=> "..<b>contents</b>.."

Fetching the HTML for an Element

If you want the HTML for the whole element (not just the contents), use to_html:

 (doc/"#elementID").to_html
 #=> "<div id='elementID'>...</div>"

Looping

All searches return a set of Elements. Go ahead and loop through them like you would an array.

 (doc/"p/a/img").each do |img|
   puts img.attributes['class']
 end

Continuing Searches

Searches can be continued from a collection of elements, in order to search deeper.

 # find all paragraphs.
 elements = doc.search("/html/body//p")
 # continue the search by finding any images within those paragraphs.
 (elements/"img")
 #=> #<Hpricot::Elements[{img ...}, {img ...}]>

Searches can also be continued by searching within container elements.

 # find all images within paragraphs.
 doc.search("/html/body//p").each do |para|
   puts "== Found a paragraph =="
   pp para

   imgs = para.search("img")
   if imgs.any?
     puts "== Found #{imgs.length} images inside =="
   end
 end

Of course, the most succinct ways to do the above are using CSS or XPath.

 # the xpath version
 (doc/"/html/body//p//img")
 # the css version
 (doc/"html > body > p img")
 # ..or symbols work, too!
 (doc/:html/:body/:p/:img)

Looping Edits

You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show.

 (doc/"span.entryPermalink").each do |span|
   span.set_attribute :class, 'newLinks'
 end
 puts doc

This changes all span.entryPermalink elements to span.newLinks. Keep in mind that there are often more convenient ways of doing this. Such as the set method:

 (doc/"span.entryPermalink").set(:class => 'newLinks')

Figuring Out Paths

Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag.

The css_path method:

 doc.at("div > div:nth(1)").css_path
   #=> "div > div:nth(1)"
 doc.at("#header").css_path
   #=> "#header"

Or, the xpath method:

 doc.at("div > div:nth(1)").xpath
   #=> "/div/div:eq(1)"
 doc.at("#header").xpath
   #=> "//div[@id='header']"

Return to An Hpricot Showcase.