rdrand

04/11/2019

Last week (at the time of writing, anyway), Ars Technica reported a serious bug in AMD’s implementation of rdrand, an instruction that helps you generate random numbers. Apparently, on (some) Ryzen 3000, 0xfff…ff is “random”.

I recently got an AMD Ryzen 9 3900x, and I wondered if I have the bug as well.

Read the rest of this entry »


Yet More __builtins

21/01/2014

So last week we saw how to use some of GCC’s built-ins, this week, let’s have a look at how we can create our own, if need be. Say because you need to have access to some instruction and that GCC does not offer the corresponding built-in.

coin-reverse-small

To do so, we’ll use a bit of the C preprocessor and GCC’s inline assembly extension.

Read the rest of this entry »


GCC Built-ins

07/01/2014

In the discussion of The Speed of GCD, Daniel Lemire remarked that one could use the compiler-specific intrinsic function __builtin_ctz to get a good speed-up on binary GCD. That remark made me look into all the others intrinsics and built-ins offered by GCC.

amonite

Let’s have a look to at least a few of them!

Read the rest of this entry »


Faster Collatz

01/05/2012

Quite a while ago, I presented the Collatz conjecture and I was then interested in the graphical representation of the problem—and not really going anywhere with it.

In this entry, let us have a look at the implementation of the Collatz function.

Read the rest of this entry »


Trigonometric Tables Reconsidered

28/02/2012

A couple of months ago (already!) 0xjfdube produced an excellent piece on table-based trigonometric function computations. The basic idea was that you can look-up the value of a trigonometric function rather than actually computing it, on the premise that computing such functions directly is inherently slow. Turns out that’s actually the case, even on fancy CPUs. He discusses the precision of the estimate based on the size of the table and shows that you needn’t a terrible number of entries to get decent precision. He muses over the possibility of using interpolation to augment precision, speculating that it might be slow anyway.

I started thinking about how to interpolate efficiently between table entries but then I realized that it’s not fundamentally more complicated to compute a polynomial approximation of the functions than to search the table then use polynomial interpolation between entries.

Read the rest of this entry »


Being Shifty

27/07/2010

Hacks with integer arithmetic are always fun, especially when they’re not too hard to understand (because some are really strange and make you wonder what the author was thinking when he wrote that). One such simple hack is to replace divisions or multiplies by series of shifts and additions.

However, these hacks make a lot of assumptions that aren’t necessarily verified on the target platform. The assumptions are that complex instructions such as mul and div are very slow compared to adds, right shifts or left shifts, that execution time of shifts only depends very loosely on the number of bit shifted, and that the compiler is not smart enough to generate the appropriate code for you.

Read the rest of this entry »


Is Python Slow? (Part II)

08/06/2010

In a previous post I expressed my worries about Python being excruciatingly slow and I used a toy problem to compare the speed of Python to programs in other several languages, including C.

Of course, all kind of people complained that I couldn’t compare a dynamic, interpreted language with static, compiled languages. First, let met tell you that I sure can. First, the goal was to measure speed, and not the effects of type system of the language (although logically correlated) nor the programming paradigm: the amount of CPU used to solve a given problem was the primary (if not only) point in interest.

But to be fair to Python, I extended the tests to other interpreted, dynamic languages, such as Lua, Perl, PHP and JavaScript. I also added Pascal and Haskell in the compiled languages groups.

Read the rest of this entry »


Powers of Ten (so to speak)

29/06/2009

I am not sure if you are old enough to remember the 1977 IBM movie Powers of Ten (trippy version, without narration) [also at the IMDB and wikipedia], but that’s a movie that sure put things in perspective. Thinking in terms of powers of ten helps me sort things out when I am considering a design problem. Thinking of the scale of a problem in terms of physical scale is a good way to assess its true importance for a project. Sometimes the problem is the one to solve, sometimes, it is not. It’s not because a problem is fun, enticing, or challenging, that it has to be solved optimally right away because, in the correct context, considering its true scale, it may not be as important as first thought.

atomic-cycle

Maybe comparing problems’ scales to powers of ten in the physical realm helps understanding where to put your efforts. So here are the different scales and what I think they should contain:

Read the rest of this entry »


More Blinking Lights (and a disgression)

28/04/2009

In Blinking Lights I told you about how I feel the modern computer for its exterior, except for its screen, is boring. When I look at my Antec case, I see a large, silent black box, which, by its very definition, is uninteresting at best. Something like a rock that slowly dissipates heat.

However Bill Buzbee built a computer that has an interesting exterior, and a much more interesting interior: the Magic-1. The Magic-1 is a computer running at 4.something MHz, and is in the same computational power range as the original 8086 4.77 Mhz IBM PC, except with a more advanced instruction set.

The Magic-1 Computer

The Magic-1 Computer

Read the rest of this entry »


Compact Tree Storage

07/04/2009

Implementing data structures in a way that uses efficiently memory should always be on your mind. I do not mean going overboard and micro-optimizing memory allocation right down to the bit. I mean organize data structures in memory so that you can avoid pointers, so that you can use contiguous memory segments, etc. Normally, minimizing storage by avoiding extra pointers when possible will benefit your program in at least two ways.

First, the reduced memory requirement will make your data structure fit in cache more easily. Remember that if pointers are 4 bytes long in 32 bits programming, they are 8 bytes long in 64 bits environments. This yields better run time performance because you maximize your chances of having the data you need in cache.

Second, contiguous memory layouts also allow for efficient scans of data structures. For example, if you have a classical binary tree, implemented using nodes having each two pointers, you will have to use a tree traversal algorithm, possibly recursive, to enumerate the tree’s content. If you don’t really care about the order in which the nodes are visited, what’s quite cumbersome.

It turns out that for special classes of trees, complete trees, there is a contiguous, and quite simple, layout.

Read the rest of this entry »