Un pitone a San Luca: October 2012

Friday, October 26, 2012

Using an IPython Notebook as a module

Ok, simple post today, while I work on some bigger, juicier post.

First of all, if you never worked with the ipython notebook, put this post on pause, go to the ipython home, install everything you find, play with it and fall in love.
I'll be waiting, don't worry.

Ok, so you love the ipython notebook, work everyday with it, and cry everytime you have to go back to the usual shell. The only problem is writing a module in it is very uncomfortable: each time you made a modification, you have to save it as a python script, and if you work over a network, you have to send it to the same directory, fiddle with the permissions and so on.

The secret is that you can start the ipython notebook server with the option --script. This option tell the server to save a copy of the notebook in the script version (the .py file, to be clear) any time you save the notebook.My tipical line to execute the ipython notebook became:

ipython notebook ./my_notebook_folder --pylab=inline --no-browser --script

So, every time you edit a notebook, you will have the corresponding module to import from other notebook, closing the circle. Using a notebook to write your own module has a lot of advantage, in my opinion, first of all the possibility to explain with a lot of well formatted text how your library works, accompaining it with link to pages ( use <page to be linked> ) or multimedia objects ( use ![text](link) ) and even latex formulas, really a killer feature when you have to explain scientific code.

The only limitation is that, as far as I know, the script must be traslate into pure python, so no ipython magic or cell magic. It's not that bad, but would have been a real game changer (anybody thinking "seamless cython integration"?).

So you can declare your classes and functions as usual, and put the test code in a block for execution in main:

if __name__ == '__main__':
test code in here

as you would do with a normal script. It will be executed normally when you run the notebook (as it has the __name__ set to __main__ by default), but will be skipped in the import phase.
Remember to use the __all__ parameter to avoid useless name import on

from mylibrary import *

...no wait, you should never do that. Forget the __all__.

Last thing, remember to insert documentation for your code! in a matter of few days you will not remember what each parameter does, so write it down.
If you put a string on the first cell of the notebook, it will be seen as the standard documentation of the whole module.

Last of all, one of the way I prefer to implement documentation and testing in one shot is the use of the DocTest module, which scan all the documentation present in the module, search for documentation lines that looks like shell code, execute it and confront it with the result that you have put in the documentation.
If they are not equal it will complain that a test is failed.

This is obviously not the way to document production code, you should use something like unittest, but is very practical and, honestly, i found that it compels you to write both documentation with examples and useful testing at the same time, both very tedious activity and any help to do them is welcome.

The problem is that doctesting a notebook will fail in a spectacular way, due to its dynamic nature. To help with that, I wrote a function that analyze an object (a class, an instance, a function, a module...whatever) and test each docstring it find in it, reporting a dictionary of the results for each method.

Here it is:

def test(obj,verbose=False, globs = globals()):
    """
    test the docstring of an object, a function or a module
    if verbose is set to True, it will report the result
    of each test, otherwise it will report only
    the failed ones
    """
    test = doctest.DocTestFinder().find(obj, globs=globs)
    runner = doctest.DocTestRunner(verbose=verbose)
    results = {}
    name = ''
    def out(s):
        results[name].append(s)
    for t in test:
        name = t.name
        results[name]=[]
        runner.run(t,out=out)
    if not verbose:
        rimuovi = [ k for k,v in results.iteritems() if not len(v) ]
        for k in rimuovi:
            del results[k]
    return results

How does it work? it's actually pretty simple, once you understand how the doctest module works. First of all you take the object and parse it with the class DocTestFinder:

test = doctest.DocTestFinder().find(obj, globs=globs)

This class will return a list of Test object to be run. To run these tests you use the class DocTestRunner, then create a dictionary to store the results.

runner = doctest.DocTestRunner(verbose=verbose)
results = {}

Now the runner will take each Test in our list and execute it, than confront it with the expected result. It would normally print it to terminal, but we can override this behavior giving it a parameter out, which is a function that can manipulate the result. In our case it put the result of the test in the dictionary under the name of the tested object.

def out(s):
    results[name].append(s)
for t in test:
    name = t.name
    results[name]=[]
    runner.run(t,out=out)

In the end, if you select a non verbose result (the default argument) it will scan the resulting dictionary and remove all the test that didn't return error.
Than the resulting dictionary is returned, and if your documentation is up to date, you will obtain a void dictionary.

To print the results in a friendly version, I use this ancillary function, that simply scan and print keys and values of the dictionary (doing nothing if it is empty):

def verify(obj):
    res = test(obj)
    if res!={}:
        for k,v in res.iteritems():
            print "--------------"
            print k
            for v1 in v:
                print v1

Just another short tip: if you need to test a random function, don't worry. just set the seed to a fixed value at the beginning of the test, and the program will execute the same steps with the same random number each time. In numpy this is done with:

np.random.seed(0)

That's all folks! Have fun, and see you soon!

Sunday, October 21, 2012

Splitting a sequence

I would like to start with one common exercise, that a lot of people wrote about, each one proposing its own version.

Let say that we have a sequence of objects, and we want to split this sequence in equally sized chunks of given size. The idea is the following:

>>> seq = range(12)
>>> split(seq, 3)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]

It looks simple enough, doesn't it? There are probably one hundred different solution to this problem. Anyone has its own preferred version. I will try to show the most common, curious or instructive ones.

Coming from languages like C or Java would be spontaneous to write code like this:

L = len(seq)
size = 3
solution = list()for i in range(L):
     if i%size:
         solution[-1].append(seq[i])
   else:
       solution.append([seq[i]])

Ok, this gives me the correct answer, but this isn't python code. This is C written in python.

a better solution is the following:

[ seq[size*i:size*i+size] for i in range(len(seq)//size+1) ]

It does the same work, but is only one line of code, and, after you are used to the list comprehension magic, it looks a lot more readable. But it still suck, if I should be honest.
It calculate the number of pieces the list will be divided into (len(seq)//size), add one to keep the last fragment, the iterate on the obtained indices. Not so elegant.

A more pythonic code can be written using the list comprehension, and is simply one line of code:

[ seq[i:i+size] for i in range(0,len(seq),size) ]

What this piece of code does is to iterate over the indices of the list, starting from 0 and increasing of "size" step at the time. for each step it takes the element of the list from the given index to the following "size" elements

This will do the trick in the exact same way as the initial code, but is way more simple to write and to read.

But this is not the end of the story. I would like to show you a slightly more esoteric way to slit the sequence:

zip(*([iter(seq)] * 3))

This is a trick that use several effect at the same time. First an iterator is generated from the list. than this iterator is placed inside a list, and the list is triplicated. Each element of the list is now a reference to the same iterator, as you can see writing

>>> print [iter(seq)]*3
[<listiterator object at 0x3900610>, <listiterator object at 0x3900610>, <listiterator object at 0x3900610>]

Where the address of the iterator will change each time you lanch the program. This triplicated access to the iterator means that every time the zip function call the next function on one o them to obtain an element, each one will be increased. The result is a splitted sequence. Due to the behavior of the zip function, it will trim the sequence to the shorter one, so if the sequence is not a multiple of the given size, it will be trimmed.

>>> seq = range(11)
>>> zip(*([iter(seq)] * 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8)]

This can be avoided using the function izip_longest of the package itertools, which extend the shortest sequence with a serie of None. It does return a generator that can be easily converted into a list:

>>> from itertools import izip_longest as lzip
>>> list(lzip(*([iter(seq)] * 3)))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

You can specify a filling value for the izip_longest, if the None could lead to problems:

>>> from itertools import izip_longest as lzip
>>> list(lzip(*([iter(seq)] * 3), fillvalue = -1))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, -1)]

A similar effect can be obtained with the function map, which by default extend the sequence to the longest element.

>>> map( lambda *s:s, [iter(seq)]*3 )
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

with python 2.7 you can also use None instead of the identity lambda *s:s, but it is always a good habit to think of the compatibility with python 3, where it's possible.
To obtain the initial result of having a shorter last sequence, one can explicitly trim the None:

>>> map( lambda *s: [ r for r in s if not r is None ], [iter(seq)]*3 )
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10)]

Going back to the itertools module, one can think also to the groupby function. This function split an iterator into chuncks that respect the same condition, and when they change it start another chunk. So if we use an integer division, we can split the array into equally sized parts:

>>> from itertools import groupby
>>> for key_value, split_generator in groupby(seq, lambda s: s//3):
.... print key_value, list(split_generator)
0 [0, 1, 2]
1 [3, 4, 5]
2 [6, 7, 8]
3 [9, 10]

This version has just one problem: it works only for the sequence of number given by range. to adapt it to a more general case, we need to use the enumerate function to obtain the golden sequence, filter it and then keep only the interesting data:

from itertools import groupby
seq = 'abcdefghilm'
for key, split_gen in groupby(enumerate(seq), lambda s: s[0]//3):
    print key, list(i[1] for i in split_gen)
0 ['a', 'b', 'c']
1 ['d', 'e', 'f']
2 ['g', 'h', 'i']
3 ['l', 'm']

We will meet again this function when we will talk about natural ordering.

Last, but not least, using the numpy module, one obtain the function array_split, that allow to divide the sequence in n given partition of variable size. In this case, to obtain the same splitting, you should use 4 division. It also return not a list of lists, but a list of numpy arrays.

>>> from numpy import array_split as split
>>> split(range(11), 4)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([ 9, 10])]

This function is quite powerful, as it can also split around specific points or in a given axis for multidimensional arrays. Also matplotlib as a similar function, called pieces, under the submodule matplotlib.cbook, which contains several small recipes from the matplotlib cookbook (http://matplotlib.org/api/cbook_api.html). If you use matplotlib give it a look, it is not very well documented, but contains a lot of useful objects.

In conclusion, I would like to gave you a puzzle that i found on StackOverFlow. It is a piece of BAD PYTHON, very bad indeed, but is curious how it get the job done.

>>> f = lambda x, n, acc=[]: f(x[n:], n, acc+[(x[:n])]) if x else acc
>>> f(range(11), 3)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

I have to admit that my first reaction was "wait...what happened?!?". I'm actually still confused on how someone could have thought of something like that. sure as hell no one would be able to debug it if something goes in the bad direction. This is a recursive function, that take a list, extract a first chunk and add it to the accumulator value acc, then pass to itself the shorter list and the accumulator, until the list is empty and it return the accumulator. We can visualize what is happening by rewriting the function in a more explicit way and by printing the intermediate results:

>>> def g(n, x, acc=[]):
...     print (n,x,acc)
... #recursive until exausted
...     if x:
...         # launch the same function on a shorter
... # version of the list with the
... # accumulated list of lists
...         return g(n, x[n:], acc+[(x[:n])])
... #when exausted return the result
...     else:
...         return acc
>>> s = g(3, range(11))
>>> print '\n',s
(3, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [])
(3, [3, 4, 5, 6, 7, 8, 9, 10], [[0, 1, 2]])
(3, [6, 7, 8, 9, 10], [[0, 1, 2], [3, 4, 5]])
(3, [9, 10], [[0, 1, 2], [3, 4, 5], [6, 7, 8]])
(3, [], [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]])

[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

Yes, it is close to black magic, but it works. But remember kids: recursion is bad, unless it is the only sensible solution to your problem (and 99% of the time, it is not).

Of course these methods are not the only that one can think, but should cover a wide range of necessity, teaching us something in between.

See you next time!

Saturday, October 20, 2012

Hi Internet!

How do you do? I'm not sure where I'm headed, but i would like to try and play with you a little bit.

What i would like to share is the feeling of beauty and power that i feel every time i open my python shell.

Who am I? I'm just a physicist who work on biophysics and biological data analysis, and few years ago fell in love with this programming language.

I've always programmed, in a way or another, but for the first time I actually had fun programming. I'm not an expert, but these are my two cents for the python community, which gave me this amazing instrument.

Given my field of research, my best friends are:

IPython (http://ipython.org/). and especially its notebook feature..simply stunning
Numpy (http://numpy.scipy.org/), for dealing with matrices and array
Matplotlib (http://matplotlib.sourceforge.net/) for plotting the data
Scipy (http://www.scipy.org/) for all the numerical algorithms
Sympy (http://sympy.org/) for the symbolic mathematics

But sometimes I hang out with these buddies:

Networkx (http://networkx.lanl.gov/) for dealing with network analysis
Pandas (http://pandas.pydata.org/) for managing tabular data
PyTables (http://www.pytables.org/moin) for managing HUGE tabular data
Cython (http://cython.org/) when performance is a necessity

Too often I stumbled upon great snippet of code to do all sort of amazing stuffs, without any eplanation on how they do that, forcing me to lose an entire day to get the idea if I ever wanted to try and modify them. What I will try to do (will I succeed? we'll see) is to share with you some code snippet, utility classes, insights and similar, focusing on "what i want to do" and "what is the easier code to do that", explaining the idea behind the code.

Have fun, and say hello if you like the work!