Python

News, Python, Science

Prime Numbers and the Benford’s Law

Today, I read a news article from the Physorg.com about the new pattern found in the Prime Numbers, the article talks about the new discovery by Bartolo Luque and Lucas Lacasa:

In a recent study, Bartolo Luque and Lucas Lacasa of the Universidad Politécnica de Madrid in Spain have discovered a new pattern in primes that has surprisingly gone unnoticed until now. They found that the distribution of the leading digit in the prime number sequence can be described by a generalization of Benford’s law.

I was very surprised by the fact that nobody have noticed that before and after read the original paper (if you are interested, read it) describing the new patterns discovered, I was very impressed and impatient to see it in pratice !

The new pattern discovered is based on the so-called GBL (Generalized Benford’s Law), which you can see in the paper at the Eq 3.1:

gbl

Where the P(d) means the probability of appearance of the leading digit d. The alpha is the exponent of the original power law distribution (for alpha = 1, the GBL reduces to the Benford’s law).

The authors says that for a given integer interval of [1,N], there exists a particular value alpha(N) for which the GBL fits with extremely good accuracy the first digit distribution of the primes appearing in that interval and showing the functional relation between alpha and N in the Eq 3.2:

functional

Where a = 1.10 +- 0.05 for large values of N. They also cite a GBL extension, but I’ll use just these formulae to plot our distributions.

So I have implemented these formulae into the simple pybenford module as follows:

def gbl(alpha, digit):
   return 1/(10**(1-alpha)-1)*((digit + 1)**(1-alpha)-digit**(1-alpha))

def calc_alpha(n, a=1.10):
   return 1/(math.log(n)-a)

def gbenford_law(alpha):
   return [gbl(alpha, digit)*100.0 for digit in xrange(1,10)]

For the reason that we are using an infinite integer sequence, we must always pick the sequence interval [1, N] where N = 10^D  (see the  Natural Density section of the paper for more information).

The next step is to create a list of prime numbers between an arbitrary interval of D=8, or [1,10^8]. In this step I used the Sieve (see more information) utility to create a file with the generated prime numbers in the cited interval, I used the follow command to get this file output:

sieve2310.exe -s 1 -e 100000000  >>sieve_n8.txt

The sieve is very fast, this will create the file “sieve_n8.txt” with nearly 66MB (don’t worry, it’s a very fast generation, it took 8 seconds for me using a Intel Core 2 Duo 2GHz).

And we are ready to use Python and pybenford to read the prime numbers, calculate the leading digits frequency and plot our result ! Here is the code I created:

import pybenford

sieve_file = open("sieve_n8.txt", "r")
prime_list = [int(prime) for prime in sieve_file]
sieve_file.close()

alpha              = pybenford.calc_alpha(10**8)
benford_law        = pybenford.gbenford_law(alpha)
prime_distribution = pybenford.calc_firstdigit(prime_list)
pybenford.plot_comparative(prime_distribution, benford_law, "Prime Numbers")

And voilà, here is the output plot showing an extremely good accuracy claimed by paper authors (click on the image to enlarge):

prime_plot

The plotting of the distributions (click to enlarge)

If you are interested on Benford’s law, there are some posts about it here and here.

I hope you liked this =)

UPDATE 10/05: Mike Loukides did a good work generalizing for other bases, thank you for sharing your experiment Mike.

UPDATE 08/08 (lol): There are many more comments about this post on Reddit, see here.

Python, Time Waste

Delicious.com, checking user numbers against Benford’s Law

Sometimes, Benford’s Law is used to check some datasets and detect fraud. If a dataset which is supposed to follow the Benford’s Law distribution diverges from the law, we can say that the dataset is a possible fraud (caution with assumptions, and please, note the word “possible” here).

So I had an idea to check the number of users of the Delicious.com website, which is supposed to follow the Benford’s Law. I processed the tag “programming” and I got 40 pages of links with 376 user numbers from links of the Delicious.com. So, here is the plot:

delicious_plot

As we can see on the graph, the correlation was 0.95 (between -1 and 1), so we can say (really !!), Delicious.com is not lying about the user numbers on the links =) does anyone knows some suspect sites ?

Follow the source-code of the Python program, it uses a simple regex to get the user numbers from the pages:

import pybenford
import re
import urllib
import time

PAGES         = 40
DELICIOUS_URL = "http://delicious.com/tag/programming?page=%d"

reg         = re.compile('(\d+)', re.DOTALL |  re.IGNORECASE)
users_set = []

for i in xrange(1, PAGES+1):
   print "Reading the page %02d of %02d..." % (i, PAGES),
   site_handle = urllib.urlopen(DELICIOUS_URL % i)
   site_data   = site_handle.read()
   site_handle.close()
   map_to_int = map(int, reg.findall(site_data))
   print "%02d records!" % len(map_to_int)
   users_set.extend(map_to_int)
   time.sleep(5) # Be nice with servers !

print "Total records: %d" % len(users_set)

benford_law   = pybenford.benford_law()
digits_scale = pybenford.calc_firstdigit(users_set)
pybenford.plot_comparative(digits_scale, benford_law, "Delicious.com")
Python, Time Waste

Benford’s Law meets Python and Apple Stock Prices

UPDATE: See the post “Delicious.com, checking user numbers against Benford’s Law” if you want to see an one more example.

UPDATE 2: Brandon Gray has done a nice related work in Clojure, here is the link to the blog.

As Wikipedia says:

Benford’s law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty. The basis for this “law” is that the values of real-world measurements are often distributed logarithmically, thus the logarithm of this set of measurements is generally distributed uniformly.

Which means that in a dataset (not all, of course) from a real-life source of data, like for example, the Death Rates, the first digit of every number in this dataset have “1” almost one third of time, “2” in 17.6% of times, and so on in a logarithmic scale. The Benford’s law distribution formulae is:

p(n) = \log_{10}\left( 1 +\frac{1}{n} \right)

Where the “n” is the leading digit.

This formulae makes the follow distribution plot (from Wikipedia image):

So I’ve made a Python module, called “pybenford”, which helps me in the creation and analysis of datasets, like the Stock Historical Prices for Apple Inc.

I think that the code is simple enough to understand and reuse:

import pybenford
import csv

def convert_value(value):
   return float(value.replace(",","."))

stock_file      = open("apple_stock.csv", "r")
csv_apple_stock = csv.reader(stock_file, delimiter=";")
yahoo_format    = csv_apple_stock.next()
stock_prices    = [ convert_value(row[yahoo_format.index("Volume")]) for row in csv_apple_stock ]

benford_law   = pybenford.benford_law()
benford_apple = pybenford.calc_firstdigit(stock_prices)

pybenford.plot_comparative(benford_apple, benford_law, "Apple Stock Volume")

This code will iterate over the Apple Inc. historical data downloaded from Yahoo! Finance and will verify the leading digit for the field “Volume” of the dataset, the dataset is from between 1984 and today (200). Then the pybenford will plot (using Matplotlib) a comparative graph of the dataset with the Benford’s Law distribution. In the graph, there is a Pearson’s Correlation value on the title; the Pearson’s Correlation ranges from +1 to -1. A correlation of +1 means that there is a perfect positive linear relationship between variables.

Follow the plot of comparative (click on the image to enlarge):

stock_volume

As you can surprisely see, we have a strong correlation between the Volume data and the Benford’s Law, the Pearson’s Correlation was 0.98, a higher coefficient, this is like black magic for me =)

Follow another graph of the opening stock prices:

stock_open

The correlation this time was low, but it continues with a significant Pearson’s coefficient of 0.80.

I hope you enjoyed =)

The source-code for the “pybenford” can be downloaded here. This module is a simple collection of some very very simple functions.

Genetic Algorithms, Pyevolve, pys60, Python

TSP on Nokia N73 using PyS60 and Pyevolve

As I promised on the post about Pyevolve on N73, I’ve ported the TSP problem to run on the cellphone. The script will draw the best individual of every generation on the screen using the PyS60 Canvas API. The quality of the video is very low, I’m using another cellphone with VGA only recording.

When I have time, I’ll release the source-code here in the same post. The TSP code is the same used in PSP, but ported to use the Canvas API of PyS60.

I hope you enjoy.

UPDATE 24/04: the source-code can be downloaded here.

News, pys60, Python

Good news for Python community

Good news for Python community ! PyS60 1.9.3 was released and now we have touch support ! Follow the list of changes from the project site:

Python core is upgraded to 2.5.4

– Touch event support is added to appuifw Canvas. An API is added to appuifw
module, touch_enabled() for checking if the device supports touch input.

– scribble application developed using PyS60 touch feature is available in the
installer.

– This release includes a new extension module, sciptext. This is an enabler for
using S60 Platform Service APIs that were introduced in the S60 5th Edition
and back ported on S60 3rd edition FP2, from Python. It supports services like
Application Manager, Calendar, Contacts, Landmarks, Location, Logging,
Messaging, Media Management, Sensors and Sys Info. Refer scriptext module
documentation for the usage and the convention for accessing the platform
Service API interfaces is subjected to change.

– Easier runtime deployment: Python runtime and its dependent components can be
installed by just running the scriptshell application that comes with 1.9.3
release. This feature is available only from S60 3rd edition FP2 devices
onwards and also these devices should have been updated with the latest
firmware. The easier runtime deployment support will be available with all
ensymble packaged applications in future releases.

– SSL support for socket is enabled

See more on the release announce.

PyS60 is a great piece of code between the mobile open-source projects ! Congratulations for the team !

I’ve done the port of TSP problem for Nokia N73 as promised, it’s working very smoothly and the Canvas API of PyS60 is amazing and fast, I’ll post the video and source-code soon.

PyS60 related links

Download of PyS60 installation

Nokia project home

Nokia forum (get started)

Genetic Algorithms, genetic programming, News, Pyevolve, Python, Time Waste

Genetic Algorithms on cellphones ! Pyevolve on Nokia N73 (Symbian + PyS60)

Hello ! This is the 2nd post related to the Pyevolve on portable devices, the first was in the Sony PSP here and the PoC of Pyevolve solving the TSP problem with a graphical output of best individuals on the Sony PSP screen. Now, it’s time to go further and run Pyevolve into the most portable device used by us, the cellphone.

Using the new version of the PyS60, the release 1.9.1, which comes with the new amazing 2.5.1 Python core, I’ve executed the Pyevolve with no problem, and I was very surprised by the performance of the GA on the Nokia N73, the GA I’ve ran is the minimization of one of the De Jogn’s test suite functions, the Sphere function, the function is very simple and I’ve used 5 real variables between the interval of [-5.12, 5.12], the Gaussian Real Mutator and the Single Point Crossover of the Pyevolve framework. Also I’ve set a population size of 80 individuals and the mutation rate of 2% and crossover rate of 90%.

After 18 generations (about 8 seconds), the GA ended with the best score of 0.0, representing the optimal minimization of the De Jong’s Sphere function.

Follow some screenshots of the adventure (click on the pictures to enlarge):

How to install Pyevolve on the PyS60 ?

1) First, you need to install the PyS60 for your Symbian platform. Here is the installation manual. The order of the installation is: Python Runtime for S60, Python Script Shell and PIPS Library.

2) Then, you must create a directory on your Memory Card, inside the “Python” directory, named “Lib”, and inside this directory, you copy the “pyevolve” folder. The absolute folder structure will be like this:

MemoryCard:\Python\Lib\pyevolve

Some features of the framework will not work, like the some DB Adapters, however, the GA core is working really good. I’ve used the Pyevolve subversion release r157, but it should works with the 0.5 release too. I’ve not finished the documentation of the new release, I’m working on some features yet.

Here is the source code I’ve used to minimize the De Jong’s Sphere function:

import e32
print "Loading Pyevolve modules...",
e32.ao_yield()

from pyevolve import G1DList, GSimpleGA
from pyevolve import Initializators, Mutators, Consts

print " done !"
e32.ao_yield()

def sphere(xlist):
   n = len(xlist)
   total = 0
   for i in range(n):
      total += (xlist[i]**2)
   return total

def ga_callback(ga_engine):
   gen = ga_engine.getCurrentGeneration()
   best = ga.bestIndividual()
   print "Generation %d - Best Score: %.2f" % (gen, best.score)
   e32.ao_yield()

   return False

if __name__ == "__main__":

   genome = G1DList.G1DList(5)
   genome.setParams(rangemin=-5.12, rangemax=5.13, bestRawScore=0.00, roundDecimal=2)
   genome.initializator.set(Initializators.G1DListInitializatorReal)
   genome.mutator.set(Mutators.G1DListMutatorRealGaussian)

   genome.evaluator.set(sphere)

   ga = GSimpleGA.GSimpleGA(genome)
   ga.setMinimax(Consts.minimaxType["minimize"])
   ga.setGenerations(100)
   ga.setMutationRate(0.02)
   ga.terminationCriteria.set(GSimpleGA.RawScoreCriteria)
   ga.stepCallback.set(ga_callback)

   ga.evolve()

   best = ga.bestIndividual()
   print "\nBest individual score: %.2f" % (best.score,)

You can note the use of the module “e32” of the PyS60, this is used to process pending events, so we can follow the statistics of current generation while it evolves.

I hope you enjoyed this work, the next step is to port the TSP problem to cellphone =)

Some time ago I’ve asked Guido van Rossum on the Google Moderator about the future of Python for mobile phones (aka PyS60), and here is the full answer from him:

I’m hopeful, but concerned that Java has cornered this market. For example, the Android development kit is extremely slick but only supports Java at the moment. There’s no doubt about which is the dominant app development language on most mobile platforms, including S60 and anything Symbian-based. In the long run I expect Python to just happen on mobile devices, as increases in disk space will allow the set of pre-installed tools to grow. In the mean time I see a bigger role for Python server-side, for example there are iPhone apps backed by services written in Python running on App Engine (and probably also apps backed by Python running on other server platforms).
Guido van Rossum, San Francisco Bay Area

Well, I think that with this new version of PyS60, with the 2.5.1 core, the Python on mobiles can be very useful and productive. Recently, Nokia have signed a loan agreement with the European Investment Bank (EIB) to the tune of €500 million ($623.9 million). According to Reuters, the five-year loan will be used in part to “finance software research and development (R&D) projects Nokia is undertaking during 2009-2011 to make Symbian-based smartphones more competitive.” So I’m with great expectations with this new investment on Symbian smartphones and with the future possibilities of the PyS60.