Perl one line web crawler/scraper

20 05 2007

Someone posted a Perl code snippet on snippets.dzone.com

Extract the body of an HTML document
For example, print out just the body of Google’s home page:

use LWP::UserAgent;
use HTML::TreeBuilder;
$ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => 'http://www.google.com/');
my $res = $ua->request($req);
if ($res->is_success) {
my $tree = HTML::TreeBuilder->new_from_content($res->content);
$tree->elementify();
my $body = $tree->find('body');
foreach $e ($body->content_list())
{
print $e->as_HTML();
}
}

My shorter, one-liner, command-line version:

perl -MLWP::Simple -e ' $html = get "http://www.google.com"; $html =~ s{.*?(<body.*</body>).*}{$1}is; print $html;'

This is an example of the diversity of Perl. You can solve same problem multiple ways, whichever suits your need. Also shows why Perl is regarded as the number one tool for writing crawlers, text processing, prototyping etc.


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: