Today is: 16 January, 2012
Check todays hot topics

Adsense publisher ID harvesting

This is not my idea. This is a spinoff of a C app that a friend (chrak) wrote to harvest Adsense ID's. I will be sure to link his paper on this as soon as I find a link.

The idea is to make use of blogger.com's next-blog handler. Originally chrak wrote a C app to grab blog urls in mass and pipe output to a bash script that would strip publisher ID from content. I thought this was a little over complicated. I also disagreed with his output of ID's. For some reason he was attempting to generate output in CSV and redirecting to a .xml file(???). I couldn't help but re-write this to parse XML during the single GET to next-blog. This was also a good excuse for me to play with SQLite. Keep in mind the the insert() and setup() functions are a little temperamental. Also note, the regex that grabs google_ad_client is a little specific. If there are spaces on either side of the '=' then it will not match. I don't really care. It shouldn't contain any spaces and if it does, fuck them. This is easily fixable.

This script seems to be a little slow. It could use a lot of optimization that I don't particularly care to add. I find that out of the default 100 gets against next-blog only about 50-60 are actually unique urls/publisher id's. I find that next-blog likes to hand out the same re-directs over and over again. I'm not sure if next-blog has preferences about where it directs you to, but again, I don't really care. Fetching 100 ID's generally takes 20-30 minutes on my broke ass work laptop. Speed it up and let us know!

I found out later that the reason he was piping to a bash script was to increase speed. I'm confident that this can be made a lot faster by using perl's LWP::Parallel::UserAgent module with a callback method to parse google_ad_client out of the page source. The problem with this is I'm too lazy to write it. We'd love to see a parallel version of this though if any of you have some 'spare cycles' to write this.

When I showed Chrak this code he got really angry that I was 'trying to steal his idea'. I'd like to say this was not the objective. I'm not entirely clear on why he's doing this, but he says it will embarass google. I think he's trying to prove google is leaking personal information in the form of publisher id's.

#!/usr/bin/perl
 
# GHETTO!!!!
 
use strict;
use warnings;
 
use Cwd;
use DBI;
use Data::Dumper;
use LWP::UserAgent;
 
my $dbh;
my $randURL = 'http://www.blogger.com/next-blog';
my $dbname  = 'ids.db';
my $tcount = $ARGV[0] ? $ARGV[0] : 100;
my $count = 0;
 
sub setup {
        my $dir = getcwd;
        my $file = "$dir/$dbname";
        if(-e $file) {
                $dbh = DBI->connect("dbi:SQLite:dbname=$file","","") or die($dbh->errstr);
                return;
        } else {
                $dbh = DBI->connect("dbi:SQLite:dbname=$file","","") or die($dbh->errstr);
                $dbh->do("CREATE TABLE adsense(url, id);") or die($dbh->errstr);
                return;
        }
}
 
sub insert {
        my @args = @_;
        setup() if(!defined($dbh));
        my $sth = $dbh->prepare("INSERT INTO adsense(url, id) VALUES(?, ?);") or die($dbh->errstr);
        $sth->execute($args[0], $args[1]);
}
 
sub main {
        setup();
        while(1) {
                my $ua = LWP::UserAgent->new( agent => "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12" );
                my $resp = $ua->get($randURL);
                if($resp->is_success()) {
                        my $curblog = "$resp->{_request}->{_uri}";
                        #my $content = $resp->content();
                        my @stuff = split(/\n/, $resp->content());
                        foreach my $line (@stuff) {
                                if($line =~ /google_ad_client="pub-(.*)"/) {
                                        my $gid = $1;
                                        print "INSERTING: \"$curblog\", \"$gid\"\n";
                                        insert($curblog, $gid);
                                        $count++;
                                        last;
                                }
                        }
                }
                die("Exiting! I've grabbed $tcount id's!") if($count >= $tcount);
        }
}
 
main();

AttachmentSize
adsense_id_harvest.pl.txt1.41 KB