Monday, December 10, 2012

Moving to python 3k for numerical computation

In the last few years of working with python, I've always suffered from being kept back to the 2.x version of python by the need of the scientific libraries. The good news is that in the last year most of them made the great step and shipped a 3.x ready version (or, to be honest, a 3.x convertible 2.x version).

So right now I'm having fun trying to install everything on my laptop, an Ubuntu 12.10.

The first step is to install python 3.2 and the appropriate version of the pip packaging system:

sudo apt-get install python3 python3-pip

then we can just plug the normal installation process using the pip-3.2

sudo pip-3.2 install -U numpy
sudo pip-3.2 install -U scipy
sudo pip-3.2 install -U matplotlib
sudo pip-3.2 install -U sympy
sudo pip-3.2 install -U pandas
sudo pip-3.2 install -U ipython
sudo pip-3.2 install -U nose
sudo pip-3.2 install -U networkx
sudo pip-3.2 install -U statsmodels
sudo pip-3.2 install -U cython

Sadly mayavi, scikit-learn, numexpr, biopython and tables are still working on the transition, so they're not yet available. This leave the numerical side of python quite crippled, but I hope that they will soon reach the others and allow us to use py3k as the rest of the world out there.

Sunday, December 2, 2012

write controlled class attribute with descriptors

Few days ago I had the occasion to play around with the descriptor syntax of python. The normal question of "What is a descriptor" is always replied with a huge wall of text, but in reality are quite a simple concept: they are a generalization of the properties.

For those not familiar with the concept of properties, they are a trick to call function with the same syntax of an attribute. if prop is a property, you can write this assignment as a normal attribute:

A.prop = value

But the property will allow you to perfom check and similar on the value before the real assignment.
The basic syntax start from a normal get/set syntax (never use them unless you plan to work with a property!), but the you add a new element to the class that put togheter this two function under the name x:

class A(object):
    def get_x(self):
        return self.hidden_x
    def set_x(self, value):
        print "set the value of x to",value
        self.hidden_x = value
    x = property(get_x,set_x)

You can now use it as a normal class attribute, but when you assign a value to it, it will react with the setter function.

a = A()
a.x = 5

#set the value of x to 5
print a.x


This open a new world of possible interaction between the class and the user, with a very simple syntax. The only limit is that any extra information ha to be stored in the class itself, while sometimes can be useful to keep it separated. It can also become very verbose, which is something that is frown upon when programming in python (Python is not Java, remember).

If we have to create several attribute which behave in a similar way, repeting the same code for the property can be quite an hassle. that's where the descriptor start to became precious (yes, they can do a lot more, but I don't have great requirements).

The descriptor is a class which implements the method __get__ and, optionally, the method __set__ and __delete__. These are the methods that will be called when you try to use the attribute created with these properties.

Let's see a basic implementation of a constant attribute, i.e. and attribute that is fixed in the class and cannot be modified. To to this we need to implement the __get__ method to return the value, and the __set__ method to raise an error if one try to modify it. To avoid possible modification, the actual value is stored inside the Descriptor itself (via the self reference). To interact with the object that possess the Descriptor we can use the instance reference

class ConstantAttr(object):
    def __init__(self,value,name=""):
        self.value=value
        self.name=name
    def __get__(self, instance, type):
        return self.value
    def __set__(self,instance,value):
        raise AttributeError('the attribute {} cannot be written'.format(self.name))

We can now create a class that use this descriptor. We pass the name of the attribute to the __init__ otherwise the Descriptor would have no information on which name the class has registered it under.

class A(object):
    c = ConstantAttr(10,'c')

Using an instance of the class we can see that the value if printed correctly at 10, but if we try to modify it, we obtain an exception.

a = A()
print a.c
#10
a.c = 5
#raise AttributeError: the attribute c cannot be written

Now we can create as many constant attributes as we need with almost no code duplication at all! That's a good start.

The reason I started playing around with the descriptor was a little more complicated. I needed a set of attributes to have a validity test of the inserted value, raising error if the test wasn't correct. You can performa this with properties, but you can't use raise statement in a lambda, forcing you to write a lot of different setters, polluting the class source code and __dict__ with a lot of function. To remove the pollution from the dict you can always delete the function you used to create the property

class A(object):
    def get_x(self):
        return self.hidden_x
    def set_x(self, value):

        #do the check
        self.hidden_x = value
    x = property(get_x,set_x)
    del get_x,set_x


This could work, but you still have 7 or more lines to define something that is no more than a lambda with a message error attached.

So, here come the Descriptor. To keep the pollution to the minimum, I store all the protected values in a intern dictionary called props.
What this code does is to take a test function for testing if the given value is acceptable, then set it if it's correct or raise the given error if it's not.

class CheckAttr(object):
    """create a class attribute which only check and transform an attribute on setting its value"""
    def __init__(self, name, default, test=lambda i,v: True, error='test failed', converter = lambda v:v):
        self.name = name
        self.test = test
        self.error = error
        self.default = default
        self.conv = converter
       
    def checkprops(self,instance):
        try:
            instance.props
        except AttributeError:
            instance.props={}
           
    def __get__(self, instance, type):
        self.checkprops(instance)
        return instance.props.setdefault(self.name,self.default)
       
    def __set__(self, instance, value):
        val =  self.conv(value)
        self.checkprops(instance)
        if not self.test(instance, val):
            raise ValueError(self.error)
        instance.props[self.name]=val

Now it's time to test it on a simple real case. We want to describe a rectangle, so we have two dimensions, height and width, and we need an attribute to return the area of the rectangle itself.

class Rect(object):
    h = CheckAttr('h', 0.0, lambda i,v: v>= 0.0, 'height must be greater than 0', lambda v:float(v))
    w = CheckAttr('w', 0.0, lambda i,v: v>= 0.0, 'width must be greater than 0', lambda v:float(v))
    area = property(lambda self: self.h*self.w)
    def __init__(self,h=0.0,w=0.0):
        self.w = w
        self.h = h
      
Annnnnd...That's it. With this Descriptor code we imposed the condition that bot the width and height should be greater than zero and obtained an attribute area which return the value without giving the possibility of setting it, in only 7 lines of code. Talk about synthesis!

To end with something more difficult let's try to describe the Triangle, which condition use also the values of the other side. This is not a 100% safe version and not performance fine-tuned, but I guess is simple enough to be used:

infty = float('inf')
from math import sqrt
class Triangle(object):
    l1 = CheckAttr('l1', infty, lambda i,v: i.l2+i.l3 > v >= 0.0, 'side 1 must be greater than 0 and smaller than the sum of l2 and l3', lambda v:float(v))
    l2 = CheckAttr('l2', infty, lambda i,v: i.l1+i.l3 > v >= 0.0, 'side 2 must be greater than 0 and smaller than the sum of l1 and l3', lambda v:float(v))
    l3 = CheckAttr('l3', infty, lambda i,v: i.l2+i.l1 > v >= 0.0, 'side 3 must be greater than 0 and smaller than the sum of l2 and l1', lambda v:float(v))
    p = property(lambda self: self.l1+self.l2+self.l3)
    a = property(lambda self: sqrt( self.p/2 * (self.p/2-self.l1) * (self.p/2-self.l2) * (self.p/2-self.l3) ) )
    def __init__(self,l1=0.0,l2=0.0,l3=0.0):
        self.l1 = l1
        self.l2 = l2
        self.l3 = l3
    def __str__(self):
        return "Triangle({t.l1}, {t.l2}, {t.l3})".format(t=self)
    __repr__ = __str__


t = Triangle(1,2,1.5)
print t

# Triangle(1.0, 2.0, 1.5)
print t.p
# 4.5
print t.a

# 0.726184377414
t.l3=5
# ValueError: side 3 must be greater than 0 and smaller than the sum of l2 and l1

EDIT:
I forgot two interesting details for the implementation of the descriptors. The first one address the issue of accessing the descriptor from the class rather than from an instance. I would expect to obtain a reference to the Descriptor instance, but I got the default value. What I should have done was to check if the instance was None (meaning access from the class) and return the descriptor itself:

def __get__(self, instance, type):
    #this allow me to access the descriptor instance
    if not instance:
        return self
    return instance.value

The second bit is about the documentation. If I write the documentation of the Descriptor, I lose the opportunity to obtain a documentation for each instance, that is one of the cool feature of the property object. This can be done in a simple way...using a property ;)


class DocDescriptor(object):
    def __init__(self,doc):
        self.doc=doc
    __doc__ = property(lambda self: self.doc) 
    #other methods to follow

This allow to write code like this one:

class A(object):
    x = DocDescriptor("this is the documentation of x")
    y = DocDescriptor("and this is for y")  

help(A.x)
# this is the documentation of x

Friday, November 23, 2012

Creating a colormap in matplotlib

Matplotlib, as I said before, is quite an amazing graphics library, and can do some power heavy-lifting in data visualization, as long as you lose some time to understand how it works. Usually it's quite intuitive, but one field where it is capable of giving huge headhace is the generation of personalized colormaps.

This page (http://matplotlib.org/examples/api/colorbar_only.html) of the matplotlib manual give some direction, but it's not really useful. What we usually want is to create a new, smooth colormap with our colors of choice.
To do that the only solution is the matplotlib.colors.LinearSegmentedColormap class...which is quite a pain to use. Actually there is a very useful function that avoid this pain, but I will tell the secret after we see the basic behavior.

The main idea of the LinearSegmentedColormap is that for each color (red, green and blue) we divide the colormap in intervals and explain to the colormap two colors to interpolate in between. This is the code to create the simplest colormap, a grayscale:

mycm = mpl.colors.LinearSegmentedColormap('mycm',
{'red':((0., 0., 0.), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 1., 1.)),
 'blue':((0., 0., 0.), (1., 1., 1.)),
},256)


First of all there is the name of the colormap, the last is the number of point of the interpolation and the middle section is the painful one.
The colormap is described for each color by a sequence of three numbers: the first one is the position in the colormap, and can go from 0 to 1, monotolically. The second and the third numbers represents the value of the color before and after the selected position.
This basic example is composed of two point for each color, 0 and 1, and it say that at those position the color is absent (0) or present (1)

To understand better, we can use a colormap that go from red 0 to 0.25 in the first half, then just after the half switch to 0.75 and go to 1 as the colormap go to 1

import matplotlib as mpl
lscm = mpl.colors.LinearSegmentedColormap
mycm = lscm('mygray',
{'red':((0., 0., 0.), (0.5, 0.25, 0.75), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 0., 0.)),
 'blue':((0., 0., 0.), (1., 0., 0.)),
},256)


Ok, this is really powerful, but is clearly an overshot in most cases! The matplotlib developers realized this, but for some reason didnt create a whole new class clearly in the module, deciding to create a method of the LinearSegmentedColormap instead, called from_list.
This is the magic cure that we need: to make a simple colormap that goes from red to black to blue, we just need this.

mycm = lscm.from_list('mycm',['r','k','b'])

of course you can mix named colors with tuple of rgb, at your hearth content!

mycm = lscm.from_list('mycm',['pink','k',(0.5,0.5,0.95)])

Ok, now we have our wonderful colormap...but if we have some nan value in our data, everything is going bad, and value is represented in white, out of our control. Don't worry, as what we need is just to set the color to use for the nan values (actually, for the masked ones) with the function set_bad. in this case we put it to green:

#the colormap
mycm = mpl.colors.LinearSegmentedColormap.from_list('mycm',['r','k','b'])
mycm.set_bad('g')

#the corrupted data
a = rand(10,10)
a[5,5] = np.nan

#the image with a nice green spot
matshow(a,cmap = mycm)

Note: use matshow when you think that nan values can be present, as pcolor doesn't get along well with them and imshow keep the white color.

Wednesday, November 14, 2012

Lost week

I'm sorry to have lost this weekend of posting, but I'm currently blocked by some health issues and cannot use the pc in these days. I will keep up to speed next weekend.
I will start discussing some more fun argument, like playing around with matplotlib.

see you soon!

Monday, November 5, 2012

natural sorting (with a hint of regular expressions)

When we talk about sorting of strings in informatics we usually mean the lexicographic ordering, i.e. the same ordering that we have in dictionary (a paper one, not the python one). This is formally correct, but have a notorious drawback when we have to present those string to a human.

if we have the following list:

>>> strings = [ 'a1', 'a2', 'a10' ]

and we sort it, we encounter an unexpected problem:
 
>>> sorted(string)
['a1', 'a10', 'a2']

What is happening is that the string 'a10' is lexicographically before the string 'a2'.
This is very counterintuitive for our users, and in the long run can sometimes give a little headache even to us.

So, what if we want to sort our objects in a lexicographic order? The basic idea is that we want to order the string dividing the proper string part from the numeric part.
if we know how our strings are composed, as in the preceding example, we can simply tamper with the sorted key parameter. This parameter allow us to use a derivated object to order our list instead of the original one. in our case what we need is a tuple with a string part and a numeric part:

>>> splitter = lambda s: ( s[0],int(s[1:]))
>>> sorted(strings, key=splitter)

['a1', 'a2', 'a10']

Ok, this works, but is far from general. the basic idea is good, but we need a way to split a string into his numerical parts, no matter where and how many of them there are!
One method is to use the itertools module (yes, my favourite standard library module), the groupby function, to be exact. 
This function run over an iterable and group it's elements based on a lambda given by the user. In our case we need the isdigit function of the string to identify which pieces are numbers and which aren't. The solution is a simple one-liner

>>> from itertools import  groupby
>>> string = 'aaa111aaa111aaa111aaa111'
>>> [ (a,''.join(b)) for a,b in groupby(string, lambda s: s.isdigit())] 
[(False, 'aaa'), 
(True, '111'), 
 (False, 'aaa'), 
(True, '111'), 
 (False, 'aaa'), 
 (True, '111'), 
 (False, 'aaa'), 
 (True, '111')]

Where the first value of each tuple is the results of the splitting and the second is the matched text. This is already a solution to our problem, but is rough around the edges. To cite one, it read wrongly the dot inside a floating point number, and it's not easy to insert any knowledge of the structure of our string.

To solve the first problem we can fuse together  the triplets number-dot-number, while the other is quite hard to implement.

>>> string = 'aaa111aaa1.11aaa111aaa111'
>>> res = [ (a,''.join(b)) for a,b in groupby(string, lambda s: s.isdigit())]
>>> res2 = []
>>> idx = 0>>> while idx<len(res):>>>     if idx<len(res)-2:>>>         i,j,k = res[idx],res[idx+1],res[idx+2]>>>     else: >>>         i=None>>>     if i and i[0] and not j[0] and k[0] and j[1]=='.':>>>         res2.append((True,"".join([i[1],j[1],k[1]])))>>>         idx+=3>>>     else:>>>         res2.append(res[idx])>>>         idx+=1>>> res2
[(False, 'aaa'),
 (True, '111'),
 (False, 'aaa'),
 (True, '1.11'),
 (False, 'aaa'),
 (True, '111'),
 (False, 'aaa'),
 (True, '111')]

Ok, this works, but is ugly as hell. We need to find a better way. To do this, we need to borrow the power of the regular expressions. The regular expressions (or regex, for short) are a standard way to analyze a string to obtain pieces of it, using a road tested state machine.

To use the regex we need to import the re module, using the findall function to search a string for the given pattern. The pattern is described with another string with a special syntax, but we will come to that later.

Let's see some basic usage of the re module. We need to feed the findall function with a pattern string, in this case the word dog, to search into the given string. The r before the pattern is to indicate that it is a regex string, and will simplify how to write the patters

>>> import re
>>> string = "i have two dogs, the first one is called fido, while the second dog is rex"
>>> re.findall(r'dog', string) 
['dog', 'dog']

So, the re module reply to us that it has found two occurences of the word dog. Note that the resulting list only contains the exact match: so even if the first word was plural (dogs), the matched string is just the 'dog' component.  

If one of the words starts with a capital letter, the search will find only one of them. If we want to find both the cases we can use the square brackets to indicate that the strings inside are equivalent. So our new code look like this

>>> string = "i have two Dogs, the first one is called fido, while the second dog is rex"
>>> re.findall(r'[Dd]og', string)  

['Dog', 'dog']


 Ok, next step, we want to include the s of the plural if found. To obtain this, we have to say that the last s is optional: if is present, include it, but don't worry if it's missing. This is done with a question mark following the subject of interest, the letter s.

>>> re.findall(r'[Dd]ogs?', string)  
['Dogs', 'dog']

Ok, for now I will stop, you can find a huge amount of material online that explain how to use them. Prepare to suffer a little bit, understanding the regex has quite a learning curve.
The pattern to separate the any number of string block from number is the following:

r'[0-9]+|[^0-9]+'

It say that you can alternatively (the | operator) match one or more (the + operator) groups of digits ([0-9]) or something that is not a digit ([^0-9]).

Let's put it to the test:

>>> test = [ 'aaa123bbb.tex', '123aaa345.txt' ]
>>> for string in test:
>>>     res = re.findall(r'[0-9]+|[^0-9]+', string)
>>>     print string,res

aaa123bbb.tex ['aaa', '123', 'bbb.tex'] 
123aaa345.txt ['123', 'aaa', '345', '.txt']

It's not perfect around the edges, but with a little work it can be perfect. What we can do is to specify that a dot that interrupt a number is part of that number, while one that is not between numbers should be on it's own

>>> test = [ 'aaa123bbb.tex', '123aaa345.txt', "aaa3.14bbb.jpg" ]
>>> for string in test:
>>>     res = re.findall(r'[0-9]+\.?[0-9]+]?|[^.0-9]+|.', string)
>>>     print string,res

aaa123bbb.tex ['aaa', '123', 'bbb', '.', 'tex'] 
123aaa345.txt ['123', 'aaa', '345', '.', 'txt'] 
aaa3.14bbb.jpg ['aaa', '3.14', 'bbb', '.', 'jpg']

Dig deeper in the regex module..a lot of power is in it!