PHP5: Screen scraping with DOM and XPath

php2012. 7. 24. 23:50

PHP5: Screen scraping with DOM and XPath

This tutorial is continuation from previous yahoo screen-scraping using PHP4 tutorial.
We will try different method using DOM and XPath which only supported in PHP5.

First, a bit knowledge of XPath is required. More about XPATH can be read on:

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Also there's small concern that using XPATH is a bit slower than pure DOM Traversal. Read Speed: DOM traversal vs. XPath in PHP 5
But i personally also think that XPath is neat and easier.

Let's start. First we diagnose document structure using Mozilla Firebug.
Try a very easy case, which is to grab the title "Top Movies":

Copy XPath using Firebug and get this query:

/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b

Firefox automatically fix broken html structure, and it also add tbody tag. So, we need to remove this tag.
Only grab first row of table. Change .../tr/td/font/b into .../tr[1]/td/font/b

Now we get our first XPath query:

/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font

Next harder case is to grab contents.

XPath query from Firebug is:

/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b

Same problem here. Firefox automatically fix broken html structure, and it also add tbody tag. Remove tbody tag from XPath query.
Grab all row of table. Change .../tr[2]/td[2]/a/font/b into .../tr/td[2]/a/font/b

Final XPath query for content is:

/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b

Now final step is to put all two XPath queries into few lines of code, and we're done:

PLAIN TEXT

PHP:<?php
    error_reporting(E_ERROR);// | E_WARNING | E_PARSE);
    include ('Snoopy.class.php');
   
    $snooper = new Snoopy();
    if ($snooper->fetch('http://movies.yahoo.com/mv/boxoffice/')) {
        $dom = new DomDocument();
        $dom->loadHTML($snooper->results);
       
        $x = new DomXPath($dom);
 
        //  /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b
        $nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font/b');
        echo $nodes->item(0)->nodeValue, "<br/>\n"; //Top Movies
 
        //  /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b
        $nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b');
        foreach ($nodes as $node) {
            echo $node->nodeValue, "<br/>\n";
        }
    }
?>

'php' 카테고리의 다른 글

사이트 긁어오기 (0)	2012.07.25
snoopy class를 이용한 youtube 이미지 저장 (php) (0)	2012.07.25
PHP. 웹페이지 자동 로긴해서 긁어 오기와 HTML 파싱 라이브러리 (3)	2012.07.24
mysqli_stmt_bind_param , mysqli .. prepare, bind_param (0)	2012.07.11
php 이미지 경로 지정할때 참고해라 (1)	2012.07.11

Posted by 다오나무

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

영삼이의 전문 정보 발행처

PHP5: Screen scraping with DOM and XPath

'php' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바