Statistical Challenges in Modern Astronomy VIII

The conference Statistical Challenges in Modern Astronomy VIII (SCMA) at Penn State University, 12-16 June 2023, was a great experience! As a statistician I learnt a lot about Astrophysics, but I was also interested to see what statistical methods are being used by scientists in practice. In this blog post, I try to outline some of the challenges in modern astronomy and give an overview of some of the methods used to solve them.

What are the statistical challenges in modern astronomy?

Lots of Data

Astronomers are preparing for a number of astronomical surveys that are launching in the next few years. One of these is the Legacy Survey of Space and Time, conducted by the Rubin Observatory. This survey will produce about 20 terabytes of data every night for ten years! Whilst it’s great to have such large quantities of data, it poses computational problems.

You may want to repeat the same analysis on each data-set. Then the more data-sets you have, the greater the cost of doing all the computation. James Buchanan presented work on fitting galaxy models using MCMC for inference. He was able to use cluster computing to fit roughly 100,000 models using over 1000 CPU cores. Not all researchers have access to this much computing power, but James Buchanan works at the Lawrence Livermore National Laboratory, which has world-class computing facilities. Wang et al. (2023) also look at the problem of fitting a particular model to each galaxy that will be observed by the Rubin Observatory. They calculate that so many galaxies will be observed, that for their model, fitting it to each galaxy using traditional methods, such as MCMC (Goodman and Weare, 2010) or Nested Sampling (Skilling, 2004), will require 100 billion CPU hours. They propose using Simulation-Based Inference instead, to achieve massive speed-ups. The Rubin Observatory.
In a Bayesian hierarchical model, the number of model parameters scales with the size of the data, making inference increasingly difficult. For example, Karchev et al. (2022) aim to use data on \(10^5\) supernovae to infer two cosmological parameters. However, the hierarchical model that they use has three latent variables per supernovae, so the model has over \(3 \times 10^5\) parameters in total. This makes Bayesian inference very difficult. Their proposed solution is based on Simulation-Based Inference. SN 1994D: a Type Ia Supernovae (shown in the lower left of the image), captured by the Hubble Space Telescope in 1994.

Complex Models

Accounting for instrument effects. Usually, one must not only model the mechanism by which some quantity (such as light) was produced, but also how it was observed. This is certainly the case within my own research, where we must allow for all the ways that instrument effects have affected the observed data. Kyle Cranmer, who worked at the Large Hadron Collider and contributed to the discovery of the Higgs boson in 2012, made this point in his talk on Simulation-Based Inference. A similar problem arises in particle physics; the detectors for particle physics processes are incredibly complex and can be modelled with stochastic simulations with billions of latent variables (Cranmer et al., 2020).
Complicated, High-Dimensional Data. Nguyen et al., (2022) provides an interesting example of an astronomical model of complicated, high-dimensional data. Nguyen et al., (2022) aims to learn about dark matter by studying Dwarf galaxies (small, galaxies dominated by dark matter). The data are the kinematics of stars in dwarf galaxies. This poses a statistical challenge, because the data may consist of tens of thousands of galaxies, with galaxies consisting of different numbers of stars (100 on average) and each star being represented by 6 co-ordinates (position and velocity). Nguyen et al., (2022) use Simulation-Based Inference. An interesting component of their method is that they use a Graph Neural Network to automatically extract 128 features to represent each galaxy. Thus each galaxy is compressed to be represented by a summary statistic of fixed dimension. Dwarf irregular galaxy IC1613.
Transdimensional Models. Jeffrey Regier presented a good example of a model for which inference is complex. The motivation for the work is that there will be a greater density of light sources in upcoming astronomical surveys. This is because these surveys will look deeper into space, so more light sources will overlap. These “blends” of overlapping light sources create ambiguity in the interpretation of the data. Hansen et al. (2022) present a model (and inference method) for inferring the number of stars and galaxies in an image, their locations, fluxes and shapes. The mathematical formulation of the model is rather beautifully simple (it may be expressed in 5 equations). However the model is complex from the inference point of view, because the inference problem is transdimensional. That is, the total number of latent variables depends on the number of sources in the image, which is itself a latent variable. Variational Inference is used to to infer the model’s parameters. Jeffrey Regier found that in comparison to inference via MCMC, Variational Inference was more accurate and 10,000 times faster.

What are some solutions?

Of the many technical developments discussed at the conference, I will mention 5 key ideas:

Simulation-Based Inference (SBI)
Variational Inference (VI)
Normalizing Flows (NFs)
Neural Networks (NNs)
Parallel computing

These ideas are interconnected. SBI and VI are methods for fast inference. NFs may be used for density estimation, which makes them applicable to SBI and VI. A NF is made of a sequence of transformations; the parameters of each transformation may be defined by a NN. Furthermore, data compression may be needed for SBI; NNs may be used for data compression. Finally, parallel computing is vital to training and evaluating NNs and hence is applicable to NFs, VI and SBI.

Simulation-Based Inference

SBI refers to a class of methods for Bayesian inference when the likelihood is intractable but simulating data from the model is possible. See Lueckmann et al. (2021) for a concise overview of four categories of SBI methods:

Monte Carlo ABC
Likelihood Estimation
Posterior Estimation
Ratio Estimation

SBI was prominent at SCMA; I’ve already mentioned talks by Wang, Karchev, Cranmer and Nguyen on SBI, but there were also talks on the subject by Justin Alsing and Ann Lee. The work presented by Ann Lee (Masserano et al., 2023) was in part motivated by Hermans et al., (2022), which gives empirical evidence that SBI methods can produce overconfident posterior approximations. Masserano et al., (2023) provide a method that may be used to correct overconfident posterior regions.

Variational Inference

VI (Jordan et al., 1999) approximates the posterior distribution by using optimisation to find a member of a family of probability distributions that is close (in terms of Kullback-Leibler divergence) to the posterior. See Blei et al., 2017 for a review. Early work on VI uses a mean-field variational family, which means that the distribution used to approximate the posterior distribution is a product of mutually independent distributions. This is of limited use, since it is then impossible to learn anything about the correlation structure of the target posterior distribution. However, more recent research uses as the variational family the rich class of distributions defined by NFs (Rezende and Mohamed, 2015, Kingma et al., 2016). Using this class of distributions as the variational family allows for far greater flexibility.

Normalizing Flows

Quoting Rezende and Mohamed (2015): “A normalizing flow describes the transformation of a probability density through a sequence of invertible mappings. By repeatedly applying the rule for change of variables, the initial density ‘flows’ through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow”. NFs may be used for both SBI (Papamakarios et al., 2019) and VI (Rezende and Mohamed, 2015, Kingma et al., 2016). In addition, Kaze Wong presented work on using NFs to speed-up MCMC sampling (Wong et al., 2023). NNs may be used to define the parameters of each transformation making up the flow. Thus, training and evaluating a NF may be sped up by using a GPU.

Here is an example of a NN being used to define a NF, which is used for SBI. A Masked Autoencoder for Distribution Estimation (MADE) (Germain et al., 2015) is a type of feedforward NN. In their proposed SBI method, Papamakarios et al., 2019 use a conditional Masked Autoregressive Flow (Papamakarios et al., 2017), which consists of a stack of MADEs.

Neural Networks

As mentioned, NNs may be used to define the parameters of each transformation making up a NF. This helps in making the NF flexible.

NNs may also be used for data compression, which is often needed for SBI. For example, the SBI method Neural Likelihood (see for instance Papamakarios et al., 2019) uses density estimation to estimate the joint distribution of the data and model parameters. Generally, density estimation is harder in higher dimensional spaces. Therefore, compressing the data to have a smaller dimension will make this density estimation step easier. An application of data compression using NNs may be found in Nguyen et al. (2023). The authors have data that is of variable dimension, so they use a graph NN to compress each data point down to a summary statistic with 128 features.

Parallel Computing

Parallel computing may refer to using a computing cluster (made of hundreds or thousands of CPU cores) or using a GPU (which may have thousands of smaller cores). It’s also possible to have clusters of GPUs. Parallel computing will only benefits algorithms that may be parallelized! A weakness of MCMC is that it is fundamentally a sequential algorithm: states of the Markov chain must be generated one after the other. A key idea I took away from the conference is that parallelizable algorithms have a massive advantage. Furthermore, advances in computer software (such as JAX) make using parallel computing easier.

Conclusion

Vast quantities of data and complex models pose computational challenges for astronomers. New statistical methods are being developed and used to handle these challenges.

References

Cranmer, K., Brehmer, J., and Louppe, G. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062.

Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). MADE: masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR: W&CP, pages 881–889.

Goodman, J. and Weare, J. (2010). Ensemble samplers with affine invariance. Communications in applied mathematics and computational science, 5(1):65–80.

Hansen, D., Mendoza, I., Liu, R., Pang, Z., Zhao, Z., Avestruz, C., and Regier, J. (2022). Scalable Bayesian Inference for Detection and Deblending in Astronomical Images.

Hermans, J., Delaunoy, A., Rozet, F., Wehenkel, A., Begy, V., and Louppe, G. (2022). A Crisis In Simulation-Based Inference? Beware, Your Posterior Approximations Can Be Unfaithful. Transactions on Machine Learning Research.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An Introduction to Variational Methods for Graphical Models. Machine learning, 37:183–233.

Karchev, K., Trotta, R., and Weniger, C. (2022). SICRET: Supernova Ia Cosmology with truncated marginal neural Ratio EsTimation. Monthly Notices of the Royal Astronomical Society, 520(1):1056–1072.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved Variational Inference with Inverse Autoregressive Flow. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.

Lueckmann, J.-M., Boelts, J., Greenberg, D., Goncalves, P., and Macke, J. (2021). Benchmarking Simulation-Based Inference. In Banerjee, A. and Fukumizu, K., editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 343–351. PMLR

Masserano, L., Dorigo, T., Izbicki, R., Kuusela, M., and Lee, A. B. (2023). Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems.

Nguyen, T., Mishra-Sharma, S., Williams, R., and Necib, L. (2023). Uncovering dark matter density profiles in dwarf galaxies with graph neural networks. Physical Review D, 107(4).

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2021). Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research, 22(57):1–64

Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked Autoregressive Flow for Density Estimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Papamakarios, G., Sterratt, D., and Murray, I. (2019). Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 837–848. PMLR.

Rezende, D. and Mohamed, S. (2015). Variational Inference with Normalizing Flows. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France. PMLR.

Skilling, J. (2004). Nested Sampling. AIP Conference Proceedings, 735(1):395–405.

Wang, B., Leja, J., Villar, V. A., and Speagle, J. S. (2023). SBI++: Flexible, Ultra-fast Likelihood-free Inference Customized for Astronomical Applications. The Astrophysical Journal Letters, 952(1):L10

Wong, K., Gabrié, M., and Foreman-Mackey, D. (2023). flowMC: Normalizing flow enhanced sampling package for probabilistic inference in JAX. Journal of Open Source Software, 8(83):5021.