Python

genetic programming, Pyevolve, Python

Genetic Programming meets Python

I’m proud to announce that the new versions of Pyevolve will have Genetic Programming support; after some time fighting with these evil syntax trees, I think I have a very easy and flexible implementation of GP in Python. I was tired to see people giving up and trying to learn how to implement a simple GP using the hermetic libraries for C/C++ and Java (unfortunatelly I’m a Java web developer hehe).

The implementation is still under some tests and optimization, but it’s working nice, here is some details about it:

The implementation has been done in pure Python, so we still have many bonus from this, but unfortunatelly we lost some performance.

The GP core is very very flexible, because it compiles the GP Trees in Python bytecodes to speed the execution of the function. So, you can use even Python objects as terminals, or any possible Python expression. Any Python function can be used too, and you can use all power of Python to create those functions, which will be automatic detected by the framework using the name prefix =)

As you can see in the source-code, you don’t need to bind variables when calling the syntax tree of the individual, you simple use the “getCompiledCode” method which returns the Python compiled function ready to be executed.

Here is a source-code example:

from pyevolve import *
import math

error_accum = Util.ErrorAccumulator()

# This is the functions used by the GP core,
# Pyevolve will automatically detect them
# and the they number of arguments
def gp_add(a, b): return a+b
def gp_sub(a, b): return a-b
def gp_mul(a, b): return a*b
def gp_sqrt(a):   return math.sqrt(abs(a))

def eval_func(chromosome):
   global error_accum
   error_accum.reset()
   code_comp = chromosome.getCompiledCode()

   for a in xrange(0, 5):
      for b in xrange(0, 5):
         # The eval will execute a pre-compiled syntax tree
         # as a Python expression, and will automatically use
         # the "a" and "b" variables (the terminals defined)
         evaluated     = eval(code_comp)
         target        = math.sqrt((a*a)+(b*b))
         error_accum += (target, evaluated)
   return error_accum.getRMSE()

def main_run():
   genome = GTree.GTreeGP()
   genome.setParams(max_depth=5, method="ramped")
   genome.evaluator.set(eval_func)

   ga = GSimpleGA.GSimpleGA(genome)
   # This method will catch and use every function that
   # begins with "gp", but you can also add them manually.
   # The terminals are Python variables, you can use the
   # ephemeral random consts too, using ephemeral:random.randint(0,2)
   # for example.
   ga.setParams(gp_terminals       = ['a', 'b'],
                gp_function_prefix = "gp")
   # You can even use a function call as terminal, like "func()"
   # and Pyevolve will use the result of the call as terminal
   ga.setMinimax(Consts.minimaxType["minimize"])
   ga.setGenerations(1000)
   ga.setMutationRate(0.08)
   ga.setCrossoverRate(1.0)
   ga.setPopulationSize(2000)
   ga.evolve(freq_stats=5)

   print ga.bestIndividual()

if __name__ == "__main__":
   main_run()

I’m very happy and testing the possibilities of this GP implementation in Python.

And of course, everything in Pyevolve can be visualized any time you want (click to enlarge):

ramped_small

ramped_big

The visualization is very flexible too, if you use Python decorators to set how functions will be graphical represented, you can have many interesting visualization patterns. If I change the function “gp_add” to:

@GTree.gpdec(representation="+", color="red")
def gp_add(a, b): return a+b

We’ll got the follow visualization (click to enlarge):

full

I hope you enjoyed it, I’m currently fixing some bugs, implementing new features, docs and preparing the next release of Pyevolve, which will take some time yet =)

News, Python, Science

Prime Numbers and the Benford’s Law

Today, I read a news article from the Physorg.com about the new pattern found in the Prime Numbers, the article talks about the new discovery by Bartolo Luque and Lucas Lacasa:

In a recent study, Bartolo Luque and Lucas Lacasa of the Universidad Politécnica de Madrid in Spain have discovered a new pattern in primes that has surprisingly gone unnoticed until now. They found that the distribution of the leading digit in the prime number sequence can be described by a generalization of Benford’s law.

I was very surprised by the fact that nobody have noticed that before and after read the original paper (if you are interested, read it) describing the new patterns discovered, I was very impressed and impatient to see it in pratice !

The new pattern discovered is based on the so-called GBL (Generalized Benford’s Law), which you can see in the paper at the Eq 3.1:

gbl

Where the P(d) means the probability of appearance of the leading digit d. The alpha is the exponent of the original power law distribution (for alpha = 1, the GBL reduces to the Benford’s law).

The authors says that for a given integer interval of [1,N], there exists a particular value alpha(N) for which the GBL fits with extremely good accuracy the first digit distribution of the primes appearing in that interval and showing the functional relation between alpha and N in the Eq 3.2:

functional

Where a = 1.10 +- 0.05 for large values of N. They also cite a GBL extension, but I’ll use just these formulae to plot our distributions.

So I have implemented these formulae into the simple pybenford module as follows:

def gbl(alpha, digit):
   return 1/(10**(1-alpha)-1)*((digit + 1)**(1-alpha)-digit**(1-alpha))

def calc_alpha(n, a=1.10):
   return 1/(math.log(n)-a)

def gbenford_law(alpha):
   return [gbl(alpha, digit)*100.0 for digit in xrange(1,10)]

For the reason that we are using an infinite integer sequence, we must always pick the sequence interval [1, N] where N = 10^D  (see the  Natural Density section of the paper for more information).

The next step is to create a list of prime numbers between an arbitrary interval of D=8, or [1,10^8]. In this step I used the Sieve (see more information) utility to create a file with the generated prime numbers in the cited interval, I used the follow command to get this file output:

sieve2310.exe -s 1 -e 100000000  >>sieve_n8.txt

The sieve is very fast, this will create the file “sieve_n8.txt” with nearly 66MB (don’t worry, it’s a very fast generation, it took 8 seconds for me using a Intel Core 2 Duo 2GHz).

And we are ready to use Python and pybenford to read the prime numbers, calculate the leading digits frequency and plot our result ! Here is the code I created:

import pybenford

sieve_file = open("sieve_n8.txt", "r")
prime_list = [int(prime) for prime in sieve_file]
sieve_file.close()

alpha              = pybenford.calc_alpha(10**8)
benford_law        = pybenford.gbenford_law(alpha)
prime_distribution = pybenford.calc_firstdigit(prime_list)
pybenford.plot_comparative(prime_distribution, benford_law, "Prime Numbers")

And voilà, here is the output plot showing an extremely good accuracy claimed by paper authors (click on the image to enlarge):

prime_plot

The plotting of the distributions (click to enlarge)

If you are interested on Benford’s law, there are some posts about it here and here.

I hope you liked this =)

UPDATE 10/05: Mike Loukides did a good work generalizing for other bases, thank you for sharing your experiment Mike.

UPDATE 08/08 (lol): There are many more comments about this post on Reddit, see here.

Python, Time Waste

Delicious.com, checking user numbers against Benford’s Law

Sometimes, Benford’s Law is used to check some datasets and detect fraud. If a dataset which is supposed to follow the Benford’s Law distribution diverges from the law, we can say that the dataset is a possible fraud (caution with assumptions, and please, note the word “possible” here).

So I had an idea to check the number of users of the Delicious.com website, which is supposed to follow the Benford’s Law. I processed the tag “programming” and I got 40 pages of links with 376 user numbers from links of the Delicious.com. So, here is the plot:

delicious_plot

As we can see on the graph, the correlation was 0.95 (between -1 and 1), so we can say (really !!), Delicious.com is not lying about the user numbers on the links =) does anyone knows some suspect sites ?

Follow the source-code of the Python program, it uses a simple regex to get the user numbers from the pages:

import pybenford
import re
import urllib
import time

PAGES         = 40
DELICIOUS_URL = "http://delicious.com/tag/programming?page=%d"

reg         = re.compile('(\d+)', re.DOTALL |  re.IGNORECASE)
users_set = []

for i in xrange(1, PAGES+1):
   print "Reading the page %02d of %02d..." % (i, PAGES),
   site_handle = urllib.urlopen(DELICIOUS_URL % i)
   site_data   = site_handle.read()
   site_handle.close()
   map_to_int = map(int, reg.findall(site_data))
   print "%02d records!" % len(map_to_int)
   users_set.extend(map_to_int)
   time.sleep(5) # Be nice with servers !

print "Total records: %d" % len(users_set)

benford_law   = pybenford.benford_law()
digits_scale = pybenford.calc_firstdigit(users_set)
pybenford.plot_comparative(digits_scale, benford_law, "Delicious.com")
Python, Time Waste

Benford’s Law meets Python and Apple Stock Prices

UPDATE: See the post “Delicious.com, checking user numbers against Benford’s Law” if you want to see an one more example.

UPDATE 2: Brandon Gray has done a nice related work in Clojure, here is the link to the blog.

As Wikipedia says:

Benford’s law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty. The basis for this “law” is that the values of real-world measurements are often distributed logarithmically, thus the logarithm of this set of measurements is generally distributed uniformly.

Which means that in a dataset (not all, of course) from a real-life source of data, like for example, the Death Rates, the first digit of every number in this dataset have “1” almost one third of time, “2” in 17.6% of times, and so on in a logarithmic scale. The Benford’s law distribution formulae is:

p(n) = \log_{10}\left( 1 +\frac{1}{n} \right)

Where the “n” is the leading digit.

This formulae makes the follow distribution plot (from Wikipedia image):

So I’ve made a Python module, called “pybenford”, which helps me in the creation and analysis of datasets, like the Stock Historical Prices for Apple Inc.

I think that the code is simple enough to understand and reuse:

import pybenford
import csv

def convert_value(value):
   return float(value.replace(",","."))

stock_file      = open("apple_stock.csv", "r")
csv_apple_stock = csv.reader(stock_file, delimiter=";")
yahoo_format    = csv_apple_stock.next()
stock_prices    = [ convert_value(row[yahoo_format.index("Volume")]) for row in csv_apple_stock ]

benford_law   = pybenford.benford_law()
benford_apple = pybenford.calc_firstdigit(stock_prices)

pybenford.plot_comparative(benford_apple, benford_law, "Apple Stock Volume")

This code will iterate over the Apple Inc. historical data downloaded from Yahoo! Finance and will verify the leading digit for the field “Volume” of the dataset, the dataset is from between 1984 and today (200). Then the pybenford will plot (using Matplotlib) a comparative graph of the dataset with the Benford’s Law distribution. In the graph, there is a Pearson’s Correlation value on the title; the Pearson’s Correlation ranges from +1 to -1. A correlation of +1 means that there is a perfect positive linear relationship between variables.

Follow the plot of comparative (click on the image to enlarge):

stock_volume

As you can surprisely see, we have a strong correlation between the Volume data and the Benford’s Law, the Pearson’s Correlation was 0.98, a higher coefficient, this is like black magic for me =)

Follow another graph of the opening stock prices:

stock_open

The correlation this time was low, but it continues with a significant Pearson’s coefficient of 0.80.

I hope you enjoyed =)

The source-code for the “pybenford” can be downloaded here. This module is a simple collection of some very very simple functions.

Genetic Algorithms, Pyevolve, pys60, Python

TSP on Nokia N73 using PyS60 and Pyevolve

As I promised on the post about Pyevolve on N73, I’ve ported the TSP problem to run on the cellphone. The script will draw the best individual of every generation on the screen using the PyS60 Canvas API. The quality of the video is very low, I’m using another cellphone with VGA only recording.

When I have time, I’ll release the source-code here in the same post. The TSP code is the same used in PSP, but ported to use the Canvas API of PyS60.

I hope you enjoy.

UPDATE 24/04: the source-code can be downloaded here.

News, pys60, Python

Good news for Python community

Good news for Python community ! PyS60 1.9.3 was released and now we have touch support ! Follow the list of changes from the project site:

Python core is upgraded to 2.5.4

– Touch event support is added to appuifw Canvas. An API is added to appuifw
module, touch_enabled() for checking if the device supports touch input.

– scribble application developed using PyS60 touch feature is available in the
installer.

– This release includes a new extension module, sciptext. This is an enabler for
using S60 Platform Service APIs that were introduced in the S60 5th Edition
and back ported on S60 3rd edition FP2, from Python. It supports services like
Application Manager, Calendar, Contacts, Landmarks, Location, Logging,
Messaging, Media Management, Sensors and Sys Info. Refer scriptext module
documentation for the usage and the convention for accessing the platform
Service API interfaces is subjected to change.

– Easier runtime deployment: Python runtime and its dependent components can be
installed by just running the scriptshell application that comes with 1.9.3
release. This feature is available only from S60 3rd edition FP2 devices
onwards and also these devices should have been updated with the latest
firmware. The easier runtime deployment support will be available with all
ensymble packaged applications in future releases.

– SSL support for socket is enabled

See more on the release announce.

PyS60 is a great piece of code between the mobile open-source projects ! Congratulations for the team !

I’ve done the port of TSP problem for Nokia N73 as promised, it’s working very smoothly and the Canvas API of PyS60 is amazing and fast, I’ll post the video and source-code soon.

PyS60 related links

Download of PyS60 installation

Nokia project home

Nokia forum (get started)