Audio Files and Audiophiles

2012/12/21

Ok, it’s past 2am and I can’t sleep, because I keep thinking about nerd topics. As promised, the first section is for the majority of you…

0. I couldn’t care less – give me the 411

The #secretalbum project is, to a certain degree, to be considered an “audiophile production”. The masters are released as stereo/24bit/48kHz, and make use of the format. So, to get the intended listening experience:

When you download this, get a full-range lossless format. I recommend FLAC.

There’s different versions – one with one file per track, and one with the whole release. The one-file version is the one you should use for audiophile listening. The single files are best for typical MP3 listening.

Also, when listening to it, listen on a high-quality system in quiet surroundings, and take time listening to it – after all, the tracks are short enough.

Now, on to the nerd talk…

1. #secretalbum Audio Format

Throughout this project, the sample rate is 48kHz. Audio files are either mono or stereo. The initial recordings are 24bit, processing and intermediate files (mixes etc.) is 32bit IEEE (float) wherever possible.

1.1. On the Sample Rate

For decades, sample rates were 44.1kHz in audio productions, simply because it’s the sample rate of the CD. However, motion picture applications have always used 48kHz, and interestingly, DAT recorders (the first digital tape systems available to the consumer) usually used that as well.

Nowadays, of course, 96kHz is the new 44.1kHz due to the DVD – so, why 48kHz, why not 44.1kHz or 96kHz to get it real?

I will in this section not go on at length about the theoretical background of signal theory, the Shannon theorem and the Nyquist frequency, nor will I defend it. All you need to know (if you don’t know it already) is that the useable bandwith of digital audio is roughly half the sampling frequency (i.e. for 48kHz, it’s up to 24kHz).

When subjected to listening tests with sine waves (i.e. “pure” frequencies), the human hearing usually extends up to around 20kHz – at least at around birth. For various reasons (misuse of the ears, as well as ageing), this threshold decreases over time, at least with many people. Interestingly though, some RCTs conducted when listening to recordings of music (i.e. not test tones) have shown that a relevant amount of people can make out a difference between a 44.1kHz and a 48kHz recording and perceive the latter as “better”.

The choice here was simple: 48kHz is a small effort compared to 44.1kHz (in terms of computing power, storage demands etc.). So I decided to use it.

But why not 96kHz?

In essence, while there hasn’t been a proof that 96kHz really sounds better, some experiments suggest that it does actually sound worse, and there’s theory to support that. First of all, that added 24kHz of bandwidth is truly in the “can’t hear a thing” range. However, the audio signal chain will also generate noise in that frequency region – which we can’t hear, but which contains energy and as such takes up headroom (“the bits”). Furthermore, due to nonlinear effects somewhere in the chain (e.g. on your listening systems), these inaudible (and unwanted) things can generate frequencies which you can hear.

Add to that at best inconclusive evidence the fact that it doubles the demands on computing power and storage, and this quickly was decided against.

1.2. On the Bit Depth

Again, CD (and DAT) was 16bit, as were most early professional digital audio devices. However, we’ve seen a constant shift to 24bit (or even higher) during the last decade, and that for a reason:

First of all, for simplicity of calculation, I’ll go with the +6dB = twice as much approximation, which immediately means +1bit = +6dB. Using that approximation, we find the usual 96dB dynamic range for 16bit (CD) and a stunning 144dB for 24bit (e.g. DVD).

The dynamic range of the human ear is about 130dB. Having said that, you best not subject your ears to a 130dB SPL/sound pressure level (that 0dB for SPL is defined at the hearing threshold, although some people can hear down to about -3dB).

Now the human hearing has some kind of automatic gain control – which means that while standing next to a quartet of percussion drills or sledgehammers, you won’t be able to hear that feather drop to the ground, which you would be in completely quiet surroundings.

But there’s another point to be considered, and I’d like to call it the “effective bit depth”. Let’s say you’re using 16bit audio, and you have some very dynamic music (say, a Gustav Mahler symphony), and let’s furthermore assume it has during a very hot part a peak-to-RMS ratio of 12dB (meaning the short spikes e.g. of a drum hit are four times as high as the average level during that very loud passage), then the RMS (average) during a very quiet passage is around 24dB softer than that loud passage, i.e. the RMS is 12+24dB=36dB below maximum. Or maximum is 96dB, which leaves us with 60dB (or 10bits) of sampling depth during that very quiet passage.

Now, of course, this Mahler symphony isn’t a pop song – even in that passage, there may be a pause where we only hear the trailing decay of a soft timpany hit, a cymbal or a double bass body resonating. Our hearing adjusts to maximum sensibility – and this is when we’re able to hear the noise, and – much more important – the nonlinear distortion of the sampling process.

The latter one, the distortion, will also be subject of the section on dither and noiseshaping below, which is essentially a “what’s worse – noise or distortion?” question.

Fortunately, with 24bit, the problem is already solved, simply because 144dB is bigger than 130dB, which means that even considering the ear’s automatic gain control, we have more dynamic bandwidth than the ear.

All in all, 24bit gives us benefits that you (yes, you! Not only the guy with some bat’s ears) can hear, and it also works well with the available technology. And, leaving the added demand of data storage aside, it has no negative side effects.

So what about the 32bit float format?

As with the frequency thingie, there’s always more. 48bit (sometimes called “double precision”), 32bit integer, 64bit integer, 32bit float (“IEEE”), 64bit float (“IEEE double”) and even 128bit (“IEEE quad”).

For the master ( i.e. the final product going to you), anything more than 24bit doesn’t make that much sense, mostly because how would you play back a 64bit float file, other than via 24bit DA converters? The same line of argument holds for the recorded files (the stems), also for another reason: there’s only so many devices with a SNR and/or dynamic range in excess of 144dB, and they usually work in perfectly shielded laboratories with liquid helium cooling.

Now this should also be true for the entire production (i.e. mainly mixing/editing/mastering) process, shouldn’t it?

The first reason why this isn’t the case is the topic of gain staging (or lack thereof). In the world of old, both gear designers and musicians working with electric/electronic instruments alike were versed in this art: if you set the volumes wrong in your guitar stompbox signal chain, then the end result would be noisy or undesirable in another way.

The same is true in the digital world. Let’s say I take my 24bit audio file (which already has a 12dB=2bit headroom) and send it through a compressor with a rather low threshold (e.g. for a parallel compression application) which reduces gain by 24dB, then mix that half-half with the direct signal (means -6dB, to be conservative), and later in the mix, pull that up by 12dB [all figures picked so they're 6dB multiples, so it's easy to calculate].

What happens is this: we start out with effectively 22bits, then after the compressor it’s 18bits, then after the mix with the direct signal it’s down to 17bits, then we pull it back up 2bits – but the resolution is still only 17bits. Now isn’t there a way of working around this, like taking the whole 24 (or 22) bits and just adding a “make 12dB softer” info?

In fact there is, and this is the 32bit float format. Instead of describing your level with a fixed-point number, say “1234″, you describe it as a floating number plus exponent (in that example and in a ten-base “1.234E3). Now in the fixed-point example, if you divide by 100 and then multiply by 100, you get (because it’s fixed-point) 1234/100=0012 (rounded), and 0012*100=1200. In the floating-point example, its 1.234E3*1E-2=1.234E1 and 1.234E1*1E2=1.234E3. There are no rounding errors.

The second reason has mainly to do with equalizers, and those were also the first (software) applications to feature double precision: either 48bit (fixed) or 64bit (float). Two examples from this field are the Waves LinEQ (48bit) and the PSP MasterQ (64bit).

Now the calculation here starts with something where we ask ourselves which setting of an EQ we can “hear”. Of course, this has to do with a lot of things (such as bandwidth of the band, source material and whatnot), but let’s consider this: assume we can hear a difference of 2dB, and assume furthermore we apply a lift in the highshelf, and it just so happens that less than 5% of the energy is in that shelved band. Now calculating to and from logarithms etc., we finally will find that the multiplicator “beyond one” (i.e. the 1.xxx the signal gets multiplied with in effect for this to happen) will happen some 13 to 14 bits “after the comma”. In other words, if we’re back with our 24bit source (of which 2bits are headroom), first bring that down 12dB and we calculate this in 24bit, then the EQ only affects the signal back at the 8-bit-level, i.e. Eighties’ chiptune beauty.

Now this very specific realm hasn’t seen proper scientific research with RCTs yet (at least not to my knowledge), and it’s interesting to see that while PSP offers 64bit precision (over Wave’s 48bit), they do not offer a dither functionality at the output…all in all, what should we do? Unfortunately, this is something where the float format only helps to a small degree (namely because it ensures that the -12dB we applied does not “worsen” the signal), on the other hand, several professional DAWs are still perfectly happy with 32bit float (and what’s more, so are many plugins), the move to “beyond 32″ did not happen – at least not for me, at least not now.

1.3. On the Number of Channels (i.e. “Stereo”)

I never worked with surround. But I enjoy stereo. Simple thing.

As a matter of fact, I enjoy stereo so much, that #secretalbum sees the widespread advance of the Secret Moinsound Panning Technique, premiered for A tätowierte Katz‘s drum mix, which I’ll elaborate more on later, but here’s the relevant part:

The stereo sound on #secretalbum sounds different than on most pop/rock recordings – in fact, more like on a classical music recording. With that comes the fact that mono compatibility is not perfect. It will work really fine on 2.1 settings, and it will work well with mono playback – only not as well as in stereo.

2. From the Production to the Audient (that’s you)

We already described (and explained) that (and why) the audio format at the end of the production, right until the second to last step of the mastering stage, is 32/48/2, and also that the “32″ won’t be of that much use for you.

A typical “hi-end” listening situation for you might thus be 24/48, which will typically be available if you play back what you get from me on your computer or contemporary media station. There’s however other sitatuion where this is not possible. One is the standard “burn to a CD” use case. CD audio is 16/44.1, so you need to convert the audio to that format (which many CD burning applications will do automatically). Another (perhaps more relevant) use case is MP3 – which typically also uses “CD quality”, i.e. 16/44.1. Yes, later revisions of the MP3 standard now support higher sample rates and bit depth, but a lot of players (and thus, also encoders) only use 16/44.1 – I know that bandcamp does.

So that means that at some point, someone (either you or bandcamp) might take my supplied 24/48 masters and convert them to 16/44.1. How bandcamp does that I don’t know (nor do they go out of their way to explain), but the rules of business may let us assume they do it in a way which is less than the possible optimum.

So essentially what someone must do at one point is three things (in that order):

a) convert sample rate (48kHz to 44.1kHz),

b) convert bit depth (24bit to 16bit),

c) encode to MP3 (optional).

Let’s start with c) to get that out of the way: the process of encoding MP3 files from a 16/44.1/2 PCM source is well understood, high quality implementations are widely available both commercially and not so, and the use of them is a no-brainer. So we really shouldn’t care if it’s me, bandcamp or you who does that (unless you’re using a really old and crappy piece of software to do so).

This part must be the last of the three, because it takes a 16/44.1 source as input.

Part a) is slightly more interesting: the sample rate conversion. Without going into too many details, I’d like to state that while this is also a no-brainer, very good solutions are available (and by “very good” I mean they don’t affect the audio other than taking the top frequencies away), but not the norm. What’s more, a top-quality implementation is 1) not the norm in more basic software (like the software you might use to burn a CD) and b) very computation-intensive (which again, by the rules of business may steer bandcamp towards using “nearly top quality”).

So why is it computation-intensive? The most tricky part here is a filter. The most computationally-efficient way for this use case (fixed, rational sample rates) is to first oversample to a common multiple of both frequencies (which, due to the numbers being ugly, ends up with something like 7.056MHz), and then only take every 160ths sample so we arrive at 44.1kHz.

Wait, we forgot something! Between the oversample and the throw-away step, a filter is required, and this takes power, because it should not be audible. So we best make it rather steep, with constant group transit time (colloquially “phase linear”), and as we can get a lot of steepness with enough filter taps (at the expense of more computing power), we optimize it to have minimum overshoots in the passband (maximum flat approximation).

All in all, not that much trickery, but it requires computing power…

Now we’re only left with b), the bit depth conversion. Now a simple approach to reduce bit depth would be to simply throw away the excessive eight bits. But this is considered “not good”. Why? Enter the arena of Dither and Noiseshaping…

2.1. Dither and Noiseshape

Above, I already briefly mentioned this topic, which will get us deeply into nerd territory. Generally, this is something you can do when converting to a lower bit depth (which obviosuly includes analog to digital conversion). To not get things too complicated, I’d like to start this with two introductional statements.

1. When converting to a 16bit format, this should be the last thing in your chain.

2. The necessity for it when converting to 24bit is debatable.

So what is this dither thing?

Without dither, when converting to a lower bit depth, the signal is simply either rounded or just trunctated. That means, everything in between those counts gets lost, and we get the staircase-kind of signal shape (the signal does not leave the DA converter at the end in that staircase format, but this way of thinking helps in understanding). To give an example, I’d like to use the analogy of fixed-point decimal numbers. Assume we have a constant series of 0.7 values, say ten in a row. By rounding them to full numbers, we get a series of ten 1s. In a second example with a series of ten times 0.9, we get the same, namely ten times 1.

Now assume we would add some small noise the quantity of -0.5…+0.4 before we round. Statistically speaking, we’d add -0.5 one time out of ten, -0.4 one time out of ten, up to 0.4 one time out of ten. This means, in the first example, the result would be a total of three times 0 and seven times 1. In the second example, it would be one time 0 and nine times 1. In other words, we got that seemingly lost information back – that those two different signals, which without our trick resulted in the same digitized representation, now were different (and allowed us to extract the difference in the input signals).

Yes, a constant input signal is hardly an interesting one for music applications, but the general goal here (and one working not only in theory and in special cases, but very generally) is that by applying that noise before rounding, you can reduce nonlinearities (read: distortion!) and seemingly make things visible which were lost in rounding before.

Of course, there’s no free beer in signal theory: we bought that reduced distortion with added noise. And that is one of the main statements in the use of dither:

Using dither is a tradeoff between distortion and noise.

The choice in the tradeoff here has been made for less noise (and thus, slightly more distortion). This, however, doesn’t matter that much if you’re using the 24bit files.

2.2. On Track Spacing

A considerable amount of time is invested in the mastering process into track spacing – the pause (or lack thereof) between the individual tracks on a release. However, depending on your playback situation, there’s a problem:

Life was easy in the days of vinyl and CD: the mastering engineer prepared the master, it was put onto the carrier, and it was ensured that the track gaps were reproduced exactly like the mastering guy intended them to be. Today, however, that’s different: if you have a typical digital download release, this is a collection of individual files; one per track. The player you use will put a pause of varying length between those. In a few tests, I found them to be between about 20ms (which is in the “hardly noticeable” range) up to more than 2.2 seconds.

So, what to do if the actual track spacing is considered important by the artist?

There’s actually two possibilities: one is one monolithic file with all the tracks in them, perfectly sequenced – with the disadvantage that the listener can’t easily jump to one track by the push of a button. The other is individual files as usual (perhaps with a slightly shortened gap to compensate what the player does), but that puts the listening experience at the player’s mercy.

Of course, for the audiophile nerds who still want the comfort of being able to jump to an individual track: by providing a replication sheet, they get the timing info to enable their player to do just that. Unfortunately, there’s no agreed-on automated format for that that is compatible with bandcamp’s stuff.

2.3. Two Different Versions

So, we have two important steps for getting from the actual sonic output at Moinsound to you. And I’d like you to have the end result in a way that best suits your requirements. For that reason, the #secretalbum releases contain two different versions, both of which are included in the download:

Version 1: “The Normal Version”

This is the version which appears in the bandcamp player page. Its properties:

audio has been prepared for 16bit, 44.1kHz
all tracks are individual files
track gaps have been included in the files, but shortened by 30ms each.

This is the version you’ll most probably like to use for the MP3 player use case.

Version 2: “The Audiophile Version”

This is included as a “hidden track” in the download. Its properties:

audio has been prepared for 24bit, 48kHz
one monolithic file for the entire release
a replication sheet is included if needed

This is the version for audiophile listening. Go use it!

: general, tech

: audiophile, bandcamp, dither, flac