My Perls of wisdom

12 10 2006

Here is my first contribution to the Open Source community,

Its a command-line google search tool written in Perl. I did not use their API, just used browser agent…

## google-parser-command-line.pl By Saifullah Mahmud Sumon : 2006-Jul-12
## use multiple search topic in comma-sepatared arguments.
## Sample run, >perl google-parser-command-line.pl one search topic, another search topic
use strict;
use warnings;
use LWP; # use this module for access to webpages
my $browser; # create new object from LWP module
my $args = join(" ",@ARGV);
my @keywords = split /,/ , $args;
foreach my $keyword (@keywords){ # loop through the keywords
my $url = 'http://www.google.com/search?hl=en&lr=&ie=UTF-8&q="'.$keyword.'"&hl=en&lr=&start=0&sa=N&filter=0';
print "searching... [$keyword]\n [$url]\n";
# perform the search and return the page source to the $doc variable
my ($doc, undef, undef, undef) = do_GET($url);
while(($doc !~ /repeat the search with the omitted results included/)
and ($doc !~ /did not match any documents/)
and ($doc =~ /Next/ )){
chomp $doc;
$doc =~ s/]*>//g;
$doc =~ s/.*seconds\)//g;
$doc =~ s/12345678910.*//g;
my $results = $doc ;
while ( $results =~ /(.+?)
(.+?)
(.+?)\s+-\s+(\d+k)+/mgsi) {
my($title, $desc, $url, $size) = ($1||'',$2||'',$3||'',$4||'');
$url = 'http://'.$url;
#regex the text out, KISS rule ;) Keeping It Simple
$title =~ s/\&quot\;/"/g;
$title =~ s/\&amp\;/&/g;
$title =~ s/\&\#\d+\;/'/g;
$title =~ s!!!g;
$title =~ s!,!;!g;
$title =~ s!\s+! !g;

$url =~ s/\&quot\;/"/g;
$url =~ s/\&amp\;/&/g;
$url =~ s/\&\#\d+\;/'/g;
$url =~ s!!!g; # drop all HTML tags
$url =~ s!,!;!g;
$url = $1 if ($url =~ /(.+?)\s+/);
$url = $1 if ($url =~ /(.+?)\&nbsp/);
next if $title =~ /nbsp;$/;
printf " %-80s ==> %-70s\n", $title, $url;
}# while inner
}#while outer
print "------------------------------------------------------------------------------------------\n";
}#foreach
exit 0;

sub do_GET { # subroutine that does the actual GET on the webpage and returns the source
# Parameters: the URL,
# and then, optionally, any header lines: (key,value, key,value)
$browser = LWP::UserAgent->new() unless $browser;
$browser->agent("Mozilla/5.0");
my $resp = $browser->get(@_);
return ($resp->content, $resp->status_line, $resp->is_success, $resp)
if wantarray;
return unless $resp->is_success;
return $resp->content;
}


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: