Perl one line web crawler/scraper

20 05 2007

Someone posted a Perl code snippet on

Extract the body of an HTML document
For example, print out just the body of Google’s home page:

use LWP::UserAgent;
use HTML::TreeBuilder;
$ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => '');
my $res = $ua->request($req);
if ($res->is_success) {
my $tree = HTML::TreeBuilder->new_from_content($res->content);
my $body = $tree->find('body');
foreach $e ($body->content_list())
print $e->as_HTML();

My shorter, one-liner, command-line version:

perl -MLWP::Simple -e ' $html = get ""; $html =~ s{.*?(<body.*</body>).*}{$1}is; print $html;'

This is an example of the diversity of Perl. You can solve same problem multiple ways, whichever suits your need. Also shows why Perl is regarded as the number one tool for writing crawlers, text processing, prototyping etc.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: