Tuesday, April 16, 2013

Status reports

After a long absence due to...well,  to a lot of things, here is a short synthesys of the last few months.

First of all, I'm finally moving to python 3.3! as I pointed out in my previous post, Moving to python 3k for numerical computation, the situation is still from perfect, but in the last months it got better: still no good response from mayavi and tables (and that's a real shame), but both scikits.learn and biopython (even if only installing from sources) got the golden status of py3k ready. Given that I rarely use tables or mayavi, almost 100% of my forkflow is ready for the transition. The reason for the transition is quite simple: I like new shiny toys :) aside from that, in the last year I've been bitten frequently by the unicode management in python 2.7, and several times I've desired strongly to move to a better behaved language like python 3. The python team made a great job, and the result is a language that has a cleaner and more logical structure. Why should I stick to the less than optimal python 2? I don't particularly love giving myself troubles for free, and I still have very few strings attached so that my transition can be carefree. The only thing that kept me back was the status of the support of my everyday packages. To keep track of the evolution of the support of your preferred package, two good point of reference are python 3 wall of shame superpowers and the official page of the top support in python 3 http://py3ksupport.appspot.com/. And do me a favor, start writing python 3 compatible script with a proper use of the __future__ statement and the six library!

Aside from that, I spent a lot of time working with the guys of the statsmodels project. I got a pull request accepted to implement the mosaic plot and two more are waiting a response: one is an poor-man implementation of the facet plot and one is targeted to microarray and pathway analysis. I have to admit that this has been a TREMENDOUS experience. I learned a lot of things from them, first of all the huge gap that exist between writing code that do something and code that let other do the same thing. Code readibility, good docstring, package organization and a lot of new, fun things to do. It's has been a good excuse to get more confident with the git workflow, that I have always wanted to learn but always postponed.

I also tried to post a package on pipy, the central python packages repository. The package is named keggrest, and it's a basic implementation of the rest API of the KEGG biological database. It's a shame, as it has close to no documentation and no support to python 3. Talk about throwing the first stone :). In the next few days I will give it some love to make it a package like the one that I expect to use.

See you soon with more updates and material!



Wednesday, January 23, 2013

Blogging with the ipython netbook and the github/nbviewer combo

As you can see I've just changed the whole aspect of the blog.
This is a response to a couple of design need. The lesser one is that all the plots that I will create will be with a white background, and against a dark background it result in a pain in the eye. The greater one is that Blogger suck for posting code and images. I spend half of my time to check the font and color of the code, and every image require saving it to disk. I can't even show any more complicated code as the results should be reformatted by hand. And, to be honest, I always dreamed of simply blogging my ipython notebook scripts.

So, thank to Brian E. Granger, who wrote a simple method to post an ipython notebook as a frame in the post, I can have the cake and eat it too!
The method is simple as adding an iframe tag (with the correct dimension put by hand, but that's a minor flaw) in the HTML code of the post, and voilĂ !

<iframe src="http://nbviewer.ipython.org/3835181/" width="800" height="1500"></iframe>

Leveraging the magic of the nbviewer server (that is an amazing service, by the way) I can now write a wonderfully formatted notebook, with code and formulas and plots and everything, and just hand it to you. What I'm going to do is set a git repository, create my notebook in there and link them with nbviewer in here. By the way, the notebook that it's linked it's the wonderful XKCD plot style created by Jake Vanderplas, a great blogger and python developer

I will try to explain how I'm going to set up and manage the repository.
First of all I create a new repository called blogger_notebook
Having set up a github account (that is really easy), the next step is to create the directory that will host my material.

mkdir blogger_code
cd blogger_code/

GitHub give some useful information on how to create a new repository. First of all, we create it by writing

git init

this tell to git that in this directory we have a git repository and that it should keep the version control backup of the data. Now I tell him the online repository location:

git remote add origin https://github.com/EnricoGiampieri/blogger_notebook.git 

Ok, we are close to the goal. I copy the notebook that will be my next post, basemap.ipynb. Now I need to tell git to follow it

git add basemap.ipynb

now everytime I make a modification to this file that I want to remember I can save it with the commit command. I also need to add a description of the modification done

git commit basemap.ipynb -m "creation of an example of basemap usage"

lastly, to keep up the repository online, I should put it into the GitHub repository. This will ask for my username and password and will upload all the modification to the online repository.

git push -u origin master

you can see the results here:

https://github.com/EnricoGiampieri/blogger_notebook

Now, the last step is to create a nbviewer link to the notebook. You should take the link to the raw file (you can obtain it going into the file and search for the RAW button) and give it to the nbviewer main page. it will give you a nice link to the notebook with all it's content:

http://nbviewer.ipython.org/urls/raw.github.com/EnricoGiampieri/blogger_notebook/master/basemap.ipynb

Obviously this is barely scratching the surface of the (super)power of git, but there are tons of manuals online that explain it better than what I could ever do. This was just a step by step guide to how to setup this "delayed blogging" method.

Sunday, January 13, 2013

Sympy for statistics

One of the python module to which I have the most controversial feelings is without any doubt sympy.

Sympy is a great piece of software that can deal with a huge amount of problem in a quite elegant way, and I would really like to use it more in my work. The main drawback was a very poor support for statistic, and making all those integral by hand felt a little odd.

It was with a lot of happiness that I read about the development of a new module for statistics in sympy, called sympy.stats, that promised to address all (or at least most) of the needs that someone can have working out statistical problems.

The foundation for this module has been put into place by Matthew Rocklin in the summer 2012. He made a good job, and the module has been indeed extended to support a great amount of probability distribution, both continuous and finite. There is yet no support for infinite discreet space like the natural numbers, and this means that few very important distribution like the Poisson or the Negative Binomial are still left out, but the overall feeling is very good.

The library is based on the idea of Random variable, defining a probability measure over a certain domain. For example, a normal variable is defined over the whole real axis and implements the gaussian probability density.

A selected amount of operation can be done over these random variables, notably obtaining the density estimation, the probability of an event or the expectation value.

But let the code speak.
Let's import sympy and sympy.stats, and create a Normal variable with a fixed variance and mean represented by a sympy real variable. Remember that any time we specify a new sympy symbol we have to declare a name for that symbol. in this case our normal distribution will be called simply X.

import sympy
import sympy.stats as stats

mu = sympy.Symbol('mu')
X = stats.Normal('X',mu,1)

We can now ask the expected mean and standard deviation of out random variable:

print sympy.simplify(stats.E(X))
print sympy.simplify(stats.variance(X))

that return, as expected, mu and 1.
We can also create new random expression based on the original one.
We know for example that a chi squared variable is the sum of N normal, so we can obtain the mean and variance of a 2-degree of freedom Chi distribution simply by summing up the squares of two normal distribution:

X = stats.Normal('X',0,1)
Y = stats.Normal('Y',0,1)
Chi = X**2 + Y**2


stats.E(Chi)
# 2
stats.variance(Chi)
# 4

that are exactly the values we were expecting (see Chi Squared distribution)
We can sample our expression with sample or sample_iter, and we can look at the resulting distribution:

samples = list(stats.sample_iter(Chi, numsamples=1e4))

we can plot the histogram with pylab as simple as:

pylab.hist(samples, bins=100)
pylab.show()

We can also evaluate the conditioned probability of events, but on continuous function this lead to some heavy integrals, so I will demonstrate it using the more simpler Die class, that represents the launch of a fair n-sided die.

X = stats.Die('X',6)
Y = stats.Die('Y',6)
W = stats.Die('W',6)
Z = X + Y + W


We can ask what is the probability that a realization of X is grater than 4:

stats.P(X>4)
# 1/3

or that it equals a certain value, say 3 (the ugly syntax cannot be avoided due to how the equality test is evaluated):

stats.P(sympy.Eq(X,3))
# 1/6 

We can also ask what is the probability that the three dice Z will roll more than 10 given that the first die rolled a 4:

stats.P(Z>10, sympy.Eq(X,4))

So, summing up, the stats module of sympy is really promising and I hope that a lot of work will be done on it to make it even better. If I will understand the sympy development process and the module class hierarchy, I will surely try to make a contribution.
Given these praises, for my needs it still lacks several fundamental features:
  • support for non-limited discreet spaces
  • better support for mixtures of distribution (right now I still get only error complaining about the invertibility of the CDF)
  • better fall-back to numerical evaluation, as a lot of distribution are described by integrals and special functions and, even if the integration routine of sympy is pretty solid, not everything can be solved analytically
My best whishes to the sympy team, thank you for your great job!




Monday, December 10, 2012

Moving to python 3k for numerical computation

In the last few years of working with python, I've always suffered from being kept back to the 2.x version of python by the need of the scientific libraries. The good news is that in the last year most of them made the great step and shipped a 3.x ready version (or, to be honest, a 3.x convertible 2.x version).

So right now I'm having fun trying to install everything on my laptop, an Ubuntu 12.10.

The first step is to install python 3.2 and the appropriate version of the pip packaging system:

sudo apt-get install python3 python3-pip

then we can just plug the normal installation process using the pip-3.2

sudo pip-3.2 install -U numpy
sudo pip-3.2 install -U scipy
sudo pip-3.2 install -U matplotlib
sudo pip-3.2 install -U sympy
sudo pip-3.2 install -U pandas
sudo pip-3.2 install -U ipython
sudo pip-3.2 install -U nose
sudo pip-3.2 install -U networkx
sudo pip-3.2 install -U statsmodels
sudo pip-3.2 install -U cython

Sadly mayavi, scikit-learn, numexpr, biopython and tables are still working on the transition, so they're not yet available. This leave the numerical side of python quite crippled, but I hope that they will soon reach the others and allow us to use py3k as the rest of the world out there.

Sunday, December 2, 2012

write controlled class attribute with descriptors

Few days ago I had the occasion to play around with the descriptor syntax of python. The normal question of "What is a descriptor" is always replied with a huge wall of text, but in reality are quite a simple concept: they are a generalization of the properties.

For those not familiar with the concept of properties, they are a trick to call function with the same syntax of an attribute. if prop is a property, you can write this assignment as a normal attribute:

A.prop = value

But the property will allow you to perfom check and similar on the value before the real assignment.
The basic syntax start from a normal get/set syntax (never use them unless you plan to work with a property!), but the you add a new element to the class that put togheter this two function under the name x:

class A(object):
    def get_x(self):
        return self.hidden_x
    def set_x(self, value):
        print "set the value of x to",value
        self.hidden_x = value
    x = property(get_x,set_x)

You can now use it as a normal class attribute, but when you assign a value to it, it will react with the setter function.

a = A()
a.x = 5

#set the value of x to 5
print a.x


This open a new world of possible interaction between the class and the user, with a very simple syntax. The only limit is that any extra information ha to be stored in the class itself, while sometimes can be useful to keep it separated. It can also become very verbose, which is something that is frown upon when programming in python (Python is not Java, remember).

If we have to create several attribute which behave in a similar way, repeting the same code for the property can be quite an hassle. that's where the descriptor start to became precious (yes, they can do a lot more, but I don't have great requirements).

The descriptor is a class which implements the method __get__ and, optionally, the method __set__ and __delete__. These are the methods that will be called when you try to use the attribute created with these properties.

Let's see a basic implementation of a constant attribute, i.e. and attribute that is fixed in the class and cannot be modified. To to this we need to implement the __get__ method to return the value, and the __set__ method to raise an error if one try to modify it. To avoid possible modification, the actual value is stored inside the Descriptor itself (via the self reference). To interact with the object that possess the Descriptor we can use the instance reference

class ConstantAttr(object):
    def __init__(self,value,name=""):
        self.value=value
        self.name=name
    def __get__(self, instance, type):
        return self.value
    def __set__(self,instance,value):
        raise AttributeError('the attribute {} cannot be written'.format(self.name))

We can now create a class that use this descriptor. We pass the name of the attribute to the __init__ otherwise the Descriptor would have no information on which name the class has registered it under.

class A(object):
    c = ConstantAttr(10,'c')

Using an instance of the class we can see that the value if printed correctly at 10, but if we try to modify it, we obtain an exception.

a = A()
print a.c
#10
a.c = 5
#raise AttributeError: the attribute c cannot be written

Now we can create as many constant attributes as we need with almost no code duplication at all! That's a good start.

The reason I started playing around with the descriptor was a little more complicated. I needed a set of attributes to have a validity test of the inserted value, raising error if the test wasn't correct. You can performa this with properties, but you can't use raise statement in a lambda, forcing you to write a lot of different setters, polluting the class source code and __dict__ with a lot of function. To remove the pollution from the dict you can always delete the function you used to create the property

class A(object):
    def get_x(self):
        return self.hidden_x
    def set_x(self, value):

        #do the check
        self.hidden_x = value
    x = property(get_x,set_x)
    del get_x,set_x


This could work, but you still have 7 or more lines to define something that is no more than a lambda with a message error attached.

So, here come the Descriptor. To keep the pollution to the minimum, I store all the protected values in a intern dictionary called props.
What this code does is to take a test function for testing if the given value is acceptable, then set it if it's correct or raise the given error if it's not.

class CheckAttr(object):
    """create a class attribute which only check and transform an attribute on setting its value"""
    def __init__(self, name, default, test=lambda i,v: True, error='test failed', converter = lambda v:v):
        self.name = name
        self.test = test
        self.error = error
        self.default = default
        self.conv = converter
       
    def checkprops(self,instance):
        try:
            instance.props
        except AttributeError:
            instance.props={}
           
    def __get__(self, instance, type):
        self.checkprops(instance)
        return instance.props.setdefault(self.name,self.default)
       
    def __set__(self, instance, value):
        val =  self.conv(value)
        self.checkprops(instance)
        if not self.test(instance, val):
            raise ValueError(self.error)
        instance.props[self.name]=val

Now it's time to test it on a simple real case. We want to describe a rectangle, so we have two dimensions, height and width, and we need an attribute to return the area of the rectangle itself.

class Rect(object):
    h = CheckAttr('h', 0.0, lambda i,v: v>= 0.0, 'height must be greater than 0', lambda v:float(v))
    w = CheckAttr('w', 0.0, lambda i,v: v>= 0.0, 'width must be greater than 0', lambda v:float(v))
    area = property(lambda self: self.h*self.w)
    def __init__(self,h=0.0,w=0.0):
        self.w = w
        self.h = h
      
Annnnnd...That's it. With this Descriptor code we imposed the condition that bot the width and height should be greater than zero and obtained an attribute area which return the value without giving the possibility of setting it, in only 7 lines of code. Talk about synthesis!

To end with something more difficult let's try to describe the Triangle, which condition use also the values of the other side. This is not a 100% safe version and not performance fine-tuned, but I guess is simple enough to be used:

infty = float('inf')
from math import sqrt
class Triangle(object):
    l1 = CheckAttr('l1', infty, lambda i,v: i.l2+i.l3 > v >= 0.0, 'side 1 must be greater than 0 and smaller than the sum of l2 and l3', lambda v:float(v))
    l2 = CheckAttr('l2', infty, lambda i,v: i.l1+i.l3 > v >= 0.0, 'side 2 must be greater than 0 and smaller than the sum of l1 and l3', lambda v:float(v))
    l3 = CheckAttr('l3', infty, lambda i,v: i.l2+i.l1 > v >= 0.0, 'side 3 must be greater than 0 and smaller than the sum of l2 and l1', lambda v:float(v))
    p = property(lambda self: self.l1+self.l2+self.l3)
    a = property(lambda self: sqrt( self.p/2 * (self.p/2-self.l1) * (self.p/2-self.l2) * (self.p/2-self.l3) ) )
    def __init__(self,l1=0.0,l2=0.0,l3=0.0):
        self.l1 = l1
        self.l2 = l2
        self.l3 = l3
    def __str__(self):
        return "Triangle({t.l1}, {t.l2}, {t.l3})".format(t=self)
    __repr__ = __str__


t = Triangle(1,2,1.5)
print t

# Triangle(1.0, 2.0, 1.5)
print t.p
# 4.5
print t.a

# 0.726184377414
t.l3=5
# ValueError: side 3 must be greater than 0 and smaller than the sum of l2 and l1

EDIT:
I forgot two interesting details for the implementation of the descriptors. The first one address the issue of accessing the descriptor from the class rather than from an instance. I would expect to obtain a reference to the Descriptor instance, but I got the default value. What I should have done was to check if the instance was None (meaning access from the class) and return the descriptor itself:

def __get__(self, instance, type):
    #this allow me to access the descriptor instance
    if not instance:
        return self
    return instance.value

The second bit is about the documentation. If I write the documentation of the Descriptor, I lose the opportunity to obtain a documentation for each instance, that is one of the cool feature of the property object. This can be done in a simple way...using a property ;)


class DocDescriptor(object):
    def __init__(self,doc):
        self.doc=doc
    __doc__ = property(lambda self: self.doc) 
    #other methods to follow

This allow to write code like this one:

class A(object):
    x = DocDescriptor("this is the documentation of x")
    y = DocDescriptor("and this is for y")  

help(A.x)
# this is the documentation of x

Friday, November 23, 2012

Creating a colormap in matplotlib

Matplotlib, as I said before, is quite an amazing graphics library, and can do some power heavy-lifting in data visualization, as long as you lose some time to understand how it works. Usually it's quite intuitive, but one field where it is capable of giving huge headhace is the generation of personalized colormaps.

This page (http://matplotlib.org/examples/api/colorbar_only.html) of the matplotlib manual give some direction, but it's not really useful. What we usually want is to create a new, smooth colormap with our colors of choice.
To do that the only solution is the matplotlib.colors.LinearSegmentedColormap class...which is quite a pain to use. Actually there is a very useful function that avoid this pain, but I will tell the secret after we see the basic behavior.

The main idea of the LinearSegmentedColormap is that for each color (red, green and blue) we divide the colormap in intervals and explain to the colormap two colors to interpolate in between. This is the code to create the simplest colormap, a grayscale:

mycm = mpl.colors.LinearSegmentedColormap('mycm',
{'red':((0., 0., 0.), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 1., 1.)),
 'blue':((0., 0., 0.), (1., 1., 1.)),
},256)


First of all there is the name of the colormap, the last is the number of point of the interpolation and the middle section is the painful one.
The colormap is described for each color by a sequence of three numbers: the first one is the position in the colormap, and can go from 0 to 1, monotolically. The second and the third numbers represents the value of the color before and after the selected position.
This basic example is composed of two point for each color, 0 and 1, and it say that at those position the color is absent (0) or present (1)

To understand better, we can use a colormap that go from red 0 to 0.25 in the first half, then just after the half switch to 0.75 and go to 1 as the colormap go to 1

import matplotlib as mpl
lscm = mpl.colors.LinearSegmentedColormap
mycm = lscm('mygray',
{'red':((0., 0., 0.), (0.5, 0.25, 0.75), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 0., 0.)),
 'blue':((0., 0., 0.), (1., 0., 0.)),
},256)


Ok, this is really powerful, but is clearly an overshot in most cases! The matplotlib developers realized this, but for some reason didnt create a whole new class clearly in the module, deciding to create a method of the LinearSegmentedColormap instead, called from_list.
This is the magic cure that we need: to make a simple colormap that goes from red to black to blue, we just need this.

mycm = lscm.from_list('mycm',['r','k','b'])

of course you can mix named colors with tuple of rgb, at your hearth content!

mycm = lscm.from_list('mycm',['pink','k',(0.5,0.5,0.95)])

Ok, now we have our wonderful colormap...but if we have some nan value in our data, everything is going bad, and value is represented in white, out of our control. Don't worry, as what we need is just to set the color to use for the nan values (actually, for the masked ones) with the function set_bad. in this case we put it to green:

#the colormap
mycm = mpl.colors.LinearSegmentedColormap.from_list('mycm',['r','k','b'])
mycm.set_bad('g')

#the corrupted data
a = rand(10,10)
a[5,5] = np.nan

#the image with a nice green spot
matshow(a,cmap = mycm)

Note: use matshow when you think that nan values can be present, as pcolor doesn't get along well with them and imshow keep the white color.

Wednesday, November 14, 2012

Lost week

I'm sorry to have lost this weekend of posting, but I'm currently blocked by some health issues and cannot use the pc in these days. I will keep up to speed next weekend.
I will start discussing some more fun argument, like playing around with matplotlib.

see you soon!