Friday, November 23, 2012

Creating a colormap in matplotlib

Matplotlib, as I said before, is quite an amazing graphics library, and can do some power heavy-lifting in data visualization, as long as you lose some time to understand how it works. Usually it's quite intuitive, but one field where it is capable of giving huge headhace is the generation of personalized colormaps.

This page (http://matplotlib.org/examples/api/colorbar_only.html) of the matplotlib manual give some direction, but it's not really useful. What we usually want is to create a new, smooth colormap with our colors of choice.
To do that the only solution is the matplotlib.colors.LinearSegmentedColormap class...which is quite a pain to use. Actually there is a very useful function that avoid this pain, but I will tell the secret after we see the basic behavior.

The main idea of the LinearSegmentedColormap is that for each color (red, green and blue) we divide the colormap in intervals and explain to the colormap two colors to interpolate in between. This is the code to create the simplest colormap, a grayscale:

mycm = mpl.colors.LinearSegmentedColormap('mycm',
{'red':((0., 0., 0.), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 1., 1.)),
 'blue':((0., 0., 0.), (1., 1., 1.)),
},256)


First of all there is the name of the colormap, the last is the number of point of the interpolation and the middle section is the painful one.
The colormap is described for each color by a sequence of three numbers: the first one is the position in the colormap, and can go from 0 to 1, monotolically. The second and the third numbers represents the value of the color before and after the selected position.
This basic example is composed of two point for each color, 0 and 1, and it say that at those position the color is absent (0) or present (1)

To understand better, we can use a colormap that go from red 0 to 0.25 in the first half, then just after the half switch to 0.75 and go to 1 as the colormap go to 1

import matplotlib as mpl
lscm = mpl.colors.LinearSegmentedColormap
mycm = lscm('mygray',
{'red':((0., 0., 0.), (0.5, 0.25, 0.75), (1., 1., 1.)),
 'green':((0., 0., 0.), (1., 0., 0.)),
 'blue':((0., 0., 0.), (1., 0., 0.)),
},256)


Ok, this is really powerful, but is clearly an overshot in most cases! The matplotlib developers realized this, but for some reason didnt create a whole new class clearly in the module, deciding to create a method of the LinearSegmentedColormap instead, called from_list.
This is the magic cure that we need: to make a simple colormap that goes from red to black to blue, we just need this.

mycm = lscm.from_list('mycm',['r','k','b'])

of course you can mix named colors with tuple of rgb, at your hearth content!

mycm = lscm.from_list('mycm',['pink','k',(0.5,0.5,0.95)])

Ok, now we have our wonderful colormap...but if we have some nan value in our data, everything is going bad, and value is represented in white, out of our control. Don't worry, as what we need is just to set the color to use for the nan values (actually, for the masked ones) with the function set_bad. in this case we put it to green:

#the colormap
mycm = mpl.colors.LinearSegmentedColormap.from_list('mycm',['r','k','b'])
mycm.set_bad('g')

#the corrupted data
a = rand(10,10)
a[5,5] = np.nan

#the image with a nice green spot
matshow(a,cmap = mycm)

Note: use matshow when you think that nan values can be present, as pcolor doesn't get along well with them and imshow keep the white color.

Wednesday, November 14, 2012

Lost week

I'm sorry to have lost this weekend of posting, but I'm currently blocked by some health issues and cannot use the pc in these days. I will keep up to speed next weekend.
I will start discussing some more fun argument, like playing around with matplotlib.

see you soon!

Monday, November 5, 2012

natural sorting (with a hint of regular expressions)

When we talk about sorting of strings in informatics we usually mean the lexicographic ordering, i.e. the same ordering that we have in dictionary (a paper one, not the python one). This is formally correct, but have a notorious drawback when we have to present those string to a human.

if we have the following list:

>>> strings = [ 'a1', 'a2', 'a10' ]

and we sort it, we encounter an unexpected problem:
 
>>> sorted(string)
['a1', 'a10', 'a2']

What is happening is that the string 'a10' is lexicographically before the string 'a2'.
This is very counterintuitive for our users, and in the long run can sometimes give a little headache even to us.

So, what if we want to sort our objects in a lexicographic order? The basic idea is that we want to order the string dividing the proper string part from the numeric part.
if we know how our strings are composed, as in the preceding example, we can simply tamper with the sorted key parameter. This parameter allow us to use a derivated object to order our list instead of the original one. in our case what we need is a tuple with a string part and a numeric part:

>>> splitter = lambda s: ( s[0],int(s[1:]))
>>> sorted(strings, key=splitter)

['a1', 'a2', 'a10']

Ok, this works, but is far from general. the basic idea is good, but we need a way to split a string into his numerical parts, no matter where and how many of them there are!
One method is to use the itertools module (yes, my favourite standard library module), the groupby function, to be exact. 
This function run over an iterable and group it's elements based on a lambda given by the user. In our case we need the isdigit function of the string to identify which pieces are numbers and which aren't. The solution is a simple one-liner

>>> from itertools import  groupby
>>> string = 'aaa111aaa111aaa111aaa111'
>>> [ (a,''.join(b)) for a,b in groupby(string, lambda s: s.isdigit())] 
[(False, 'aaa'), 
(True, '111'), 
 (False, 'aaa'), 
(True, '111'), 
 (False, 'aaa'), 
 (True, '111'), 
 (False, 'aaa'), 
 (True, '111')]

Where the first value of each tuple is the results of the splitting and the second is the matched text. This is already a solution to our problem, but is rough around the edges. To cite one, it read wrongly the dot inside a floating point number, and it's not easy to insert any knowledge of the structure of our string.

To solve the first problem we can fuse together  the triplets number-dot-number, while the other is quite hard to implement.

>>> string = 'aaa111aaa1.11aaa111aaa111'
>>> res = [ (a,''.join(b)) for a,b in groupby(string, lambda s: s.isdigit())]
>>> res2 = []
>>> idx = 0>>> while idx<len(res):>>>     if idx<len(res)-2:>>>         i,j,k = res[idx],res[idx+1],res[idx+2]>>>     else: >>>         i=None>>>     if i and i[0] and not j[0] and k[0] and j[1]=='.':>>>         res2.append((True,"".join([i[1],j[1],k[1]])))>>>         idx+=3>>>     else:>>>         res2.append(res[idx])>>>         idx+=1>>> res2
[(False, 'aaa'),
 (True, '111'),
 (False, 'aaa'),
 (True, '1.11'),
 (False, 'aaa'),
 (True, '111'),
 (False, 'aaa'),
 (True, '111')]

Ok, this works, but is ugly as hell. We need to find a better way. To do this, we need to borrow the power of the regular expressions. The regular expressions (or regex, for short) are a standard way to analyze a string to obtain pieces of it, using a road tested state machine.

To use the regex we need to import the re module, using the findall function to search a string for the given pattern. The pattern is described with another string with a special syntax, but we will come to that later.

Let's see some basic usage of the re module. We need to feed the findall function with a pattern string, in this case the word dog, to search into the given string. The r before the pattern is to indicate that it is a regex string, and will simplify how to write the patters

>>> import re
>>> string = "i have two dogs, the first one is called fido, while the second dog is rex"
>>> re.findall(r'dog', string) 
['dog', 'dog']

So, the re module reply to us that it has found two occurences of the word dog. Note that the resulting list only contains the exact match: so even if the first word was plural (dogs), the matched string is just the 'dog' component.  

If one of the words starts with a capital letter, the search will find only one of them. If we want to find both the cases we can use the square brackets to indicate that the strings inside are equivalent. So our new code look like this

>>> string = "i have two Dogs, the first one is called fido, while the second dog is rex"
>>> re.findall(r'[Dd]og', string)  

['Dog', 'dog']


 Ok, next step, we want to include the s of the plural if found. To obtain this, we have to say that the last s is optional: if is present, include it, but don't worry if it's missing. This is done with a question mark following the subject of interest, the letter s.

>>> re.findall(r'[Dd]ogs?', string)  
['Dogs', 'dog']

Ok, for now I will stop, you can find a huge amount of material online that explain how to use them. Prepare to suffer a little bit, understanding the regex has quite a learning curve.
The pattern to separate the any number of string block from number is the following:

r'[0-9]+|[^0-9]+'

It say that you can alternatively (the | operator) match one or more (the + operator) groups of digits ([0-9]) or something that is not a digit ([^0-9]).

Let's put it to the test:

>>> test = [ 'aaa123bbb.tex', '123aaa345.txt' ]
>>> for string in test:
>>>     res = re.findall(r'[0-9]+|[^0-9]+', string)
>>>     print string,res

aaa123bbb.tex ['aaa', '123', 'bbb.tex'] 
123aaa345.txt ['123', 'aaa', '345', '.txt']

It's not perfect around the edges, but with a little work it can be perfect. What we can do is to specify that a dot that interrupt a number is part of that number, while one that is not between numbers should be on it's own

>>> test = [ 'aaa123bbb.tex', '123aaa345.txt', "aaa3.14bbb.jpg" ]
>>> for string in test:
>>>     res = re.findall(r'[0-9]+\.?[0-9]+]?|[^.0-9]+|.', string)
>>>     print string,res

aaa123bbb.tex ['aaa', '123', 'bbb', '.', 'tex'] 
123aaa345.txt ['123', 'aaa', '345', '.', 'txt'] 
aaa3.14bbb.jpg ['aaa', '3.14', 'bbb', '.', 'jpg']

Dig deeper in the regex module..a lot of power is in it!