-
Notifications
You must be signed in to change notification settings - Fork 6
2. Training DeepClean
The process DeepClean is tasked with modelling is fundamentally a sequence-to-sequence problem: given a timeseries of witness channel observations, what is the timeseries of noise I expect to observe in the gravitational wave strain channel? Unfortunately for us, we can only get our hands on observations from the witness channel timeseries! For if we knew which parts of the observed strain were caused by noise and which parts were caused by honest-to-goodness gravitational wave events, we wouldn't need to be here in the first place.
The DeepClean training procedure gets around this by taking advantage of a simple physical fact that most noise regression algorithms rely on: the witness channels are only correlated, or coupled, with the noisy component of the strain channel. By the time they reach Earth, gravitational waves are so small that their effect on the environmental conditions that our witness sensors monitor is more or less negligible. Therefore, if we try to regress from the witness channels to the strain channel, our model will only be able to encode information about the noise component of the strain channel, leaving the astrophysical signal untouched (if it's trained well anyway).
Of course, I don't need the entire (for our purposes infinite) timeseries of witnesses to make predictions right now: presumably there is some relatively small window over which witnessed samples are informative about the observed noise. So DeepClean takes as input fixed length kernels of data, amounting to small snapshots of the timeseries at different moments in time, and attempts to predict the noise observed at the same snapshot in the strain channel. How long of a kernel you need, and whether that kernel needs to be the same size during both training and inference, is an open question that we'll address in more detail in a bit. But you might imagine that it depends pretty heavily on what kind of information is contained in your witness channels (though there are other considerations that will become clear shortly).
So we need a network architecture that can efficiently model short-term correlations in a longer-term timeseries. The architecture-of-choice for the moment for achieving this is a 1-dimensional convolutional autoencoder, which uses 1D convolutions to map the kernels of witness timeseries to a latent vector, then transposed convolutions to map that latent vector back to a single kernel of the same length as the inputs. This architecture has the benefit of being invariant with respect to the length of the input kernel: it encodes purely local information between the output and the inputs, and so (down to a limit) we can feed in kernels of any size and not lose any performance. This is good for us because, for reasons that I promise I'll get to, it's often useful to use kernel lengths at training time that are much longer than those we might need during inference. That said, there are good reasons to believe DeepClean might be too local in the information it encodes (see the [open questions section](no link yet) for more). However, it seems to perform well enough on the problem at hand that we're willing to live with it for now.
While we've already discussed what the timeseries DeepClean operates on represent, there still remains the practical question of specifically which part of this timeseries is used to train it, and how does DeepClean ingest it. DeepClean can be trained on any stretch of time with more than a few hundred seconds of valid data to learn from. The longer the stretch, the more data DeepClean has to learn from, but in return the longer training takes and the more memory is required. We typically find that for our application ~2000s of data (training and validation) is sufficient to get DeepClean to fit well in a reasonable amount of time, while allowing us to keep all of the data on the GPU in-memory to make data loading faster. Typically in pratice, data from the requested stretch is read from NDS2 archives and cached locally for later use, with some fraction of the end of the segment reserved for validation.
Once the data is loaded into memory, all channels are resampled to the same rate so that we can form the nice, even, rectangular tensors that deep learning libraries love so much. Each channel is normalized to have 0 mean and unit variance (see the discussion in issue #20), and filtering is applied to the strain channel to optimize the network on just those frequency components believed to be witnessed by the... well, witnesses (if I were trying to sound prestigious I would describe this filtering step as "using physical priors to regularize optimization").
At data loading time, kernels are sampled from the in-memory timeseries at some fixed stride (i.e. not every possible snapshot will end up as an input kernel, unless the stride is 1). How large of a stride to use is another tradeoff between training time and quality of fit. Smaller strides create more "samples" for the network to learn from, but at small strides those samples are highly redundant and don't provide much new information, so we just increase processing time without adding more information. Understanding this tradeoff more thoroughly is one of the important open questions we'd like to address.
Another important open question around this sampling scheme is that at present, this sampling happens sequentially. As noted in the linked issue, this is strange enough that it warrants serious study on its own.
So we have a network architecture, some data to optimize it on, a rough idea of what we'd like to optimize it to do ("produce good estimates of witnessed noise in the strain channel"), and a rough idea of how we'd like it to do that ("reconstruct the strain channel from the witness channels").
The obvious first choice for any regression task is to optimize the mean squared error (MSE) between the raw strain at each timestep h[t] and DeepClean's noise estimate at each timestep h_hat[t]:
where i indexes the over the samples we're measuring on (training batch, validation set, whatever).
You might hear this difference h[t] - h_hat[t] referred to as the residual, and denoted as r[t].
In this notation then, the loss function is
But is this how we're going to measure DeepClean's success in production? Will we report DeepClean's validation MSE to researchers performing astrophysical searches and expect them to be able to map that to some meaningful increase in the sensitivity of their searches? If not, and if there does exist a scalar quantity of physical significance that we can estimate via differentiable functions, why don't we use that as our loss function and optimize it directly via gradient descent?
Since gravitational wave events tend to reveal themselves via their frequency signatures, IGWN scientists tend to think of signals in terms of their spectrum. This suggests that one good candidate for a scalar quantity to optimize is the average ASD ratio (ASDR), the ratio between the amplitude spectral density of the cleaned strain (the raw strain minus DeepClean's noise estimate) and that of the raw strain in frequency space.
This communicates the factor by which DeepClean removes noisy content from the spectrum, allowing signal content to poke through more strongly. Considered from a purely machine learning perspective, you might think of this loss function as a mean absolute percent error in frequency space. Considered from another physical perspective (and the one that initially motivated its use), you can think of this loss function as downweighting each frequency component's contribution to the MSE by its power in the raw spectrum. Because different noise sources can contribute to the witnessed noise by different orders of magnitude, this has the effect of optimizing more evenly over all the relevant frequency components, not just those with particularly strong signatures in the raw strain.
In practice, this loss function is calculated by using a Torch implementation of the Welch transform to estimate the power spectral density (whether you optimize the PSD or ASD is an input parameter to the training function). This, finally, is why it's useful to use longer kernels during training than might be necessary or practical at inference time: longer kernels mean we can take longer FFTs and more FFTs to estimate this PSD, giving us better frequency resolution and more stable PSD estimates respectively. I don't think we want to be taking FFTs much shorter than we're doing now (we're using 2 seconds, which amounts to a frequency resolution of 1Hz), but how many of these we need in order to get good enough PSD estimates to learn from is a good open question.
All of this said, what we would like the network to learn and what it is easy for the network to learn are not always one and the same. So even though we might measure and report ASDR to validate and publicize DeepClean, it may turn out that including as MSE term in the cost function makes good ASDR solutions easier for the network to learn. For this reason, we parameterize the loss function by a parameter α which regulates how much optimization is dominated by MSE or ASDR loss
There are of course other metrics of interest that we might use to benchmark the production efficacy of DeepClean, some of which we could never take gradients with respect to, and some of which we might. The equation SenseMonitor uses to estimate the average distance at which a binary inspiral with certain properties can be detected (equation (1) in the linked document) suggests a different loss function that might be able to optimize this distance directly (another open question).