In this quarantine week, let’s answer a (not that) simple question: how many bits do you need to encode sound and images with a satisfying dynamic range?
Let’s see what hypotheses are useful, and how we can use them to get a good idea on the number of bits needed.
I have shown how dB and bits are related, here and also here. Basically, adding one bit to a code adds about 6 dB to the resulting signal. Now, by definition, the threshold of hearing, is set at 0 dB. This corresponds to the weakest sound you can distinguish from true silence. Threshold of pain (the point where you kind of expect your ears to start bleeding) is somewhere above 120 dB. Much louder sounds lead to actual hearing damage—explosions, rocket launches, etc. If we assume that we stay in the 0 to 120 dB range, the useful range for safe sound reproduction, at about 6 dB by bit,
So about 20 bits would be enough. If you consider 0 dB as the threshold of hearing, you might want to use 1 or 2 more bits to account for people with much finer hearing (as would suggest the loudness contour chart). Round to the next byte, you get 24 bits. What pros suggest you use.
That one asked me a bit more research to find good references. Some report the total visual dynamic range is about 10 orders of magnitude (from to ) (in appropriate luminosity units), others, like Fein and Szuts, report 16 (from to ). Depending on the range, that’d yield
However, while the human eye can see luminosity on that range, it can’t do it simultaneously. The following figure (from Gonzalez & Woods, ) shows that around a base value (average scene luminosity), shown as in the figure, only a certain range can be perceived (with lower range marked as ). That range seems to be only 4, or 5 order of magnitude, so only
So if we consider the simultaneously perceivable range around some standard average-but-bright-enough luminosity, we might get away with 16 bits per color component (maybe less?).
The number we get are pretty much in line with what we find in audio and video. 24 bits is considered “professional” (but not necessarily useful, depending on the quantity of noise in the original source) for audio. HDMI support up to 48 bits per pixel (16 bits per component) while digital camera often sport 10, 12 or 14 bits per component.
 Alan Fein, Ete Zoltan Szutz
— Photoreceptors: Their Role in Vision —
Cambridge University Press (1982)
 Rafael C. Gonzalez, Richard E. Woods
— Digital Image Processing — 2nd ed, Prentice
Something you don’t touch on is the translation of scene brightness values into digital values, which is a key part of image storage requirements. While scene linear brightness values can be used directly (usually just called linear), this requires at least 16 bits (as you say). Often a gamma (power function) or logarithmic encoding is used to compress values into a more limited number of code values, taking advantage of the fact our visual system is best at perceiving changes in brightness around middle gray and below, and allocating more code values there. Except in VFX, images are almost always stored with gamma or log encoding.
More relevant to storage requirements than what the eye can perceive is what modern cameras can perceive, since computer images are not created by the eye. Professional cameras can generally store at least 12 stops of dynamic range (12 doublings of brightness). Arri cameras, the industry standard for film and television, provide about 15 stops, and use log encoding to compress the range. So for professional film and TV recording examples might be:
10 bit integer, log encoding (eg a film scan)
12 bit integer, log encoding (e.g. arriraw digital recording)
14 bit integer, linear encoding (I think stills cameras use this, but they’re not very public about their formats)
For professional postproduction purposes, a few standards have emerged for images which will be processed (not simply displayed).
10 bit integer, log encoded (e.g. film scan)
16 bit integer, linear encoded (uncommon anymore, except in Photoshop)
16 bit floating point, linear encoded (most common for storage, esp of computer generated images)
32 bit floating point, linear encoded (often used internally for image processing programs)
For display, generally images are gamma encoded with a roughly .45 power curve:
8 bit integer, gamma encoded (tv, streaming, DVD)
10 bit integer, gamma encoded (some Blu-ray, professional display)
I know, but while standards define those non-linear transfer function, I’m not as familiar with them as I should be. ITU-R BT.2020-2 (“parameter values for ultra-high definition television systems…”) defines a gamma-type correction/compression exponent as 0.45. It also states that the images are 10 or 12 bits.
I’m also pretty sure that gamma-type curves are a simplification of the actual reponse curve, that I believe is more sigmoid-like. Also, the range isn’t nearly as extended as what humans can actually see (in a same adaptation level). It may also mean that their step is coarser than what we could (in principle) perceive.
So instead of having bits, the formula should probably be , accounting from some step . If we want 10 bits, then (Exactly to map onto 10 bits).
It depends on the standard or even the manufacturer how sigmoid-like the curve is. Rec709 (the current TV/HDTV standard – 2020 does not have wide adoption yet) is almost a pure power function with very limited dynamic range, whereas the ACES or Arri “Rec709” display functions are aggressively tone-mapped (s-curved) and can display much more range.
There’s also the consideration of whether we’re discussing scene-referred or display-referred images (i.e. whether we’re discussing capturing actual light levels in a scene, or describing light levels we’d like a monitor to display), which is a whole kettle of fish on its own. Scene-referred images are most of what we deal with in computer graphics, though, so are most relevant here.
The issue with linear gamma integer encoding of images is the eye is not sensitive to brightness changes in bright areas of the image, but it is sensitive to brightness changes in dark and midtone areas of the image. Much like hearing, we perceive images logarithmically, hence the use of “stops” in imaging circles (each stop is a doubling or halving of light intensity).
With a naive linear integer encoding, the darkest stop (to which we are fairly sensitive) is represented by only a single code value, whereas the brightest stop (which we’re not as sensitive to) is represented by fully half of the code values. We waste half of our storage budget on finely discriminating between brightness changes the viewer can’t perceive.
That’s why the overwhelming majority of images stored in integer formats (like DPX or JPEG) are stored in gamma-corrected linear or logarithmic encodings – it expands the number of code values for the darker areas and limits the number of code values wasted in bright areas. With gamma-corrected linear values, 8 bits is sufficient to represent a pleasing image (usually these are display-referred, e.g. a jpeg on the web, a DVD) intended for display, while with log encoding, 10 bits is sufficient to represent a high-dynamic range film scan or digital image intended for post-processing.
10-bit integer log is still fairly common in postproduction, but floating point is much, much more flexible and is gaining ground. All VFX work uses 16-bit or 32-bit float representations anymore.
Anyway, I find this all terribly interesting (I work in VFX), so thanks for listening to me go on.
I’m interested your back of the napkin calculation turned up the same result careful experimentation did – 16 integer bits is required to usefully represent a high-dynamic range scene-referred image.
Well, sometimes napkins are what we need, especially that the value of an equation doesn’t depend on its size :)
The dynamic (adaptation) range is about “units of brightness” wide. Those numbers come from various sources, and they all pretty much agree, once you correct from one to another luminosity unit (candelas vs lux vs lumen vs…) and compensate for less than optimal tables and graphics. So, I have to get about 16 bits.
I was thinking of “world referred” (probably what you call “scene referred”). Of course, if you consider limited gamut (brightness and color primaries) of actual screen, 10, 12 bits is probably what’s quite enough.
Now I’m curious: what experiments did you conduct?
Unfortunately, I didn’t conduct any experiments! It was the folks at Kodak who developed the Cineon format and standardized film scanning on 10-bit log-encoded integer files.
What’s interesting about your back of the napkin math coming up with a similar answer to practical experience is you’re using the human eye as a starting point, but in point of fact, it’s actually not relevant what the eye can perceive, but only what our synthetic imaging sensors can perceive (densitometers in the case of cineon files and film scans, ccds or cmos chips in the case of digital images).
Some more info on the Cineon format, if you’re interested, bit of reverse-engineering, but an interesting outline of the format which standardized digital postproduction for quite a while (until 16-bit float EXR became the standard):
Click to access Cineon.pdf
Thanks for the document. I’ll have a look