php2012. 7. 24. 23:50

This tutorial is continuation from previous yahoo screen-scraping using PHP4 tutorial.
We will try different method using DOM and XPath which only supported in PHP5.

First, a bit knowledge of XPath is required. More about XPATH can be read on:

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Also there's small concern that using XPATH is a bit slower than pure DOM Traversal. Read Speed: DOM traversal vs. XPath in PHP 5
But i personally also think that XPath is neat and easier.

Let's start. First we diagnose document structure using Mozilla Firebug.
Try a very easy case, which is to grab the title "Top Movies":

Copy XPath using Firebug and get this query:

/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b

  1. Firefox automatically fix broken html structure, and it also add tbody tag. So, we need to remove this tag.
  2. Only grab first row of table. Change .../tr/td/font/b into .../tr[1]/td/font/b

Now we get our first XPath query:

/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font

Next harder case is to grab contents.

XPath query from Firebug is:

/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b

  1. Same problem here. Firefox automatically fix broken html structure, and it also add tbody tag. Remove tbody tag from XPath query.
  2. Grab all row of table. Change .../tr[2]/td[2]/a/font/b into .../tr/td[2]/a/font/b

Final XPath query for content is:

/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b

Now final step is to put all two XPath queries into few lines of code, and we're done:

PHP:
  1. <?php
  2.     error_reporting(E_ERROR);// | E_WARNING | E_PARSE);
  3.     include ('Snoopy.class.php');
  4.    
  5.     $snooper = new Snoopy();
  6.     if ($snooper->fetch('http://movies.yahoo.com/mv/boxoffice/')) {
  7.         $dom = new DomDocument();
  8.         $dom->loadHTML($snooper->results);
  9.        
  10.         $x = new DomXPath($dom);
  11.  
  12.         //  /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b
  13.         $nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font/b');
  14.         echo $nodes->item(0)->nodeValue"<br/>\n"//Top Movies
  15.  
  16.         //  /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b
  17.         $nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b');
  18.         foreach ($nodes as $node) {
  19.             echo $node->nodeValue"<br/>\n";
  20.         }
  21.     }
  22. ?>

Posted by 다오나무