When you come from different programming languages like C and C++, where you are used to control very finely what the machine does, learning Python is quite disconcerting, especially when it comes to performance.
Without being exactly an optimization freak, I rather like to use the best algorithms, favoring algorithms over an naïve algorithms/implementations, so you can guess my surprise when after replacing the initial implementation by an implementation I did not observe the expected speed-up.
The specifics of the algorithm doesn’t really matter, but the important part of my supposedly implementation boils down to
i=0 s=0 while a: s+=a.pop(0) t[i]=s i+=1
Where pop returns a value from a list and removes it from the list. a_list.pop() will return and remove the last item; a_list.pop(0) will return and remove the zeroth item, and you can use any index to return and remove that item. But this version did show significant speed-up over the the initial version using sum and slices.
So the other thing I tried is to use pop without an argument, that is, doing it from the end (that reverses the values stored in t, but let’s ignore that for now):
i=0 s=0 while a: s+=a.pop() t[i]=s i+=1
Big surprise, the above is incredibly faster than the preceding version. Not 50% faster. 500×.
OK, so, what’s the deal?
Well, after discussing the issue with the people on Freenode’s #python channel, what I suspected is confirmed: Python’s lists are not lists, but array lists.
That is, they’re not implemented in the CPython engine as linked lists, but as arrays. This means that removing an item from the front of the list results in an copy, which isn’t really fast. Appending or removing and item at the far end is (amortized) . Removing you can understand, it suffice to update the length of the array to make it appear shorter. Adding can be amortized if the underlying container pre-allocates large chunks of memory when the current storage is exhausted—a strategy used by the STL vector container (something you can inspect by looking at capacity() rather than size().
The above table summarizes the performance differences between the two version, for an array of 1 million elements, with sum (that computes the sum over an iterable object such as a list/array). The speed of sum, compared to the other two, explains why the version wasn’t significantly slower than the first version.
So while the pythonistas usually advise you to not worry too much about performance, it turns out that knowing small details such as the underlying implementation of a given container may lead to very different strategies for performance.