Delicious.com, checking user numbers against Benford’s Law
Sometimes, Benford’s Law is used to check some datasets and detect fraud. If a dataset which is supposed to follow the Benford’s Law distribution diverges from the law, we can say that the dataset is a possible fraud (caution with assumptions, and please, note the word “possible” here).
So I had an idea to check the number of users of the Delicious.com website, which is supposed to follow the Benford’s Law. I processed the tag “programming” and I got 40 pages of links with 376 user numbers from links of the Delicious.com. So, here is the plot:
As we can see on the graph, the correlation was 0.95 (between -1 and 1), so we can say (really !!), Delicious.com is not lying about the user numbers on the links =) does anyone knows some suspect sites ?
Follow the source-code of the Python program, it uses a simple regex to get the user numbers from the pages:
import pybenford
import re
import urllib
import time
PAGES = 40
DELICIOUS_URL = "http://delicious.com/tag/programming?page=%d"
reg = re.compile('(\d+)', re.DOTALL | re.IGNORECASE)
users_set = []
for i in xrange(1, PAGES+1):
print "Reading the page %02d of %02d..." % (i, PAGES),
site_handle = urllib.urlopen(DELICIOUS_URL % i)
site_data = site_handle.read()
site_handle.close()
map_to_int = map(int, reg.findall(site_data))
print "%02d records!" % len(map_to_int)
users_set.extend(map_to_int)
time.sleep(5) # Be nice with servers !
print "Total records: %d" % len(users_set)
benford_law = pybenford.benford_law()
digits_scale = pybenford.calc_firstdigit(users_set)
pybenford.plot_comparative(digits_scale, benford_law, "Delicious.com")
3 thoughts on “Delicious.com, checking user numbers against Benford’s Law”