Perl one line web crawler/scraper

20 05 2007

Someone posted a Perl code snippet on

Extract the body of an HTML document
For example, print out just the body of Google’s home page:

use LWP::UserAgent;
use HTML::TreeBuilder;
$ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => '');
my $res = $ua->request($req);
if ($res->is_success) {
my $tree = HTML::TreeBuilder->new_from_content($res->content);
my $body = $tree->find('body');
foreach $e ($body->content_list())
print $e->as_HTML();

My shorter, one-liner, command-line version:

perl -MLWP::Simple -e ' $html = get ""; $html =~ s{.*?(<body.*</body>).*}{$1}is; print $html;'

This is an example of the diversity of Perl. You can solve same problem multiple ways, whichever suits your need. Also shows why Perl is regarded as the number one tool for writing crawlers, text processing, prototyping etc.




