Python, Time Waste

Delicious.com, checking user numbers against Benford’s Law

Sometimes, Benford’s Law is used to check some datasets and detect fraud. If a dataset which is supposed to follow the Benford’s Law distribution diverges from the law, we can say that the dataset is a possible fraud (caution with assumptions, and please, note the word “possible” here).

So I had an idea to check the number of users of the Delicious.com website, which is supposed to follow the Benford’s Law. I processed the tag “programming” and I got 40 pages of links with 376 user numbers from links of the Delicious.com. So, here is the plot:

delicious_plot

As we can see on the graph, the correlation was 0.95 (between -1 and 1), so we can say (really !!), Delicious.com is not lying about the user numbers on the links =) does anyone knows some suspect sites ?

Follow the source-code of the Python program, it uses a simple regex to get the user numbers from the pages:

import pybenford
import re
import urllib
import time

PAGES         = 40
DELICIOUS_URL = "http://delicious.com/tag/programming?page=%d"

reg         = re.compile('(\d+)', re.DOTALL |  re.IGNORECASE)
users_set = []

for i in xrange(1, PAGES+1):
   print "Reading the page %02d of %02d..." % (i, PAGES),
   site_handle = urllib.urlopen(DELICIOUS_URL % i)
   site_data   = site_handle.read()
   site_handle.close()
   map_to_int = map(int, reg.findall(site_data))
   print "%02d records!" % len(map_to_int)
   users_set.extend(map_to_int)
   time.sleep(5) # Be nice with servers !

print "Total records: %d" % len(users_set)

benford_law   = pybenford.benford_law()
digits_scale = pybenford.calc_firstdigit(users_set)
pybenford.plot_comparative(digits_scale, benford_law, "Delicious.com")