Machine Learning Techniques for Data Reduction/Compression

This work was partially supported by DOE RAPIDS2 DE-SC0021320 and DOE DE-SC0022265

Scientific applications from high energy physics, nuclear physics, radio astronomy, and light sources generate large volumes of data at high velocity and are increasingly outpacing the growth of computing power, network, storage bandwidths and capacities. Furthermore, this growth is also seen in next-generation experimental and observational facilities, making data reduction or compression an essential stage of future computing systems. Since reduction approaches should seamlessly integrate into data management workflows and storage at leading computing facilities, they should support variable levels of reduction across a diverse range of hardware with different memory and storage requirements. Scientists are principally interested in downstream quantities called Quantities of Interest (QoI) that are derived from raw data. Thus, it is important that the methods quantify the impact of data reduction not only on the primary data (PD) outputs but also on QoIs. The ability to quantify these with realistic numerical bounds is essential if the scientist is to have confidence in applying data reduction.

Guaranteed Autoencoders: We have developed a new Guaranteed Autoencoder (GAE) that provides error guarantees on all instances [3,11]. Our approach is based on a key observation that neural networks with piece-wise linear activation units (PLUs) can be represented as instance- specific linear operators. Experimental results on scientific and image datasets show that the GAE can successfully satisfy user-provided error bounds at an instance level while achieving high compression. We have developed novel post-processing algorithms for both linear [6] and nonlinear constraint satisfaction [2,4] that can be applied to outputs of neural networks and other approaches for data compression.

Hybrid  Approaches: We have developed novel hybrid machine learning based algorithmic and software pipelines that guarantee error bounds on PD and QoI and scale to thousands of processors. For XGC fusion energy particle-in-cell simulations, we have shown that our approach can compress the data by two orders of magnitude [5,9,11]. Our approach is the first of its kind to provide such guarantees, while ensuring that the resources required for compression are less than one percent of that required for generating the data, making it highly practical for real-time compression.

Differential Compression: For climate applications, the QoI are binary in nature and represent the presence or absence of a physical phenomenon. We have developed a pipelined compression approach that first uses MGARD [7,10,13] or neural-network-based techniques [1] to derive regions where QoI are highly likely to be present. Then, we employ a Guaranteed Autoencoder (GAE) to compress data with differential error bounds. GAE uses QoI information to apply low-error compression to only these regions. This results for E3SM shows that overall high compression ratios can be achieved while still maintaining downstream goals of guarantees on Tropical Cyclones and Atmospheric Rivers.

A detailed description of MGARD software is provided in [8]. The machine learning software is available on request (



1. X. Li, Q. Gong, J. Lee, S. Klasky, A. Rangarajan, and S. Ranka, Hybrid Approaches for Data Reduction of Spatiotemporal Scientific Applications. Proceedings of Data Compression Conference (DCC) 2024, to appear.
2. J. Lee, A. Rangarajan, and S. Ranka, Guaranteeing Error Bounds with Preservation of Derived Quantities in Compressive Autoencoders. Proceedings of of Data Compression Conference (DCC) 2024, to appear.
3. J. Lee, A. Rangarajan, and S. Ranka, Nonlinear-by-Linear: Guaranteeing Error Bounds in Compressive Autoencoders, Proceedings of 2023 International Conference on Contemporary computing, 2023, pp. 552-561.
4. J. Lee, A. Rangarajan, and S. Ranka, Nonlinear constraint satisfaction for compressive autoencoders using instance-specific linear operators. IC3 2023: 562-571.
5. T. Banerjee, J. Lee, J. Choi, Q. Gong, J. Chen, CS. Chang, S. Klasky, A. Rangarajan, and S. Ranka: Online and Scalable Data Compression Pipeline with Guarantees on Quantities of Interest. Proceedings of e-Science 2023, pp. 1-10.
6. J. Lee, A. Rangarajan, and S. Ranka, Constrained Autoencoders: Incorporating equality constraints in learned scientific data compression. Proceedings of DCC 2023: 347.
7. Q. Gong, C. Zhang, X. Liang, V. Reshniak, J. Chen, A. Rangarajan, S. Ranka, N. Vidal, L. Wan, P. Ullrich, N. Podhorszki, R. Jacob, S. Klasky, Spatiotemporally Adaptive Compression for Scientific Dataset with Feature Preservation – A Case Study on Simulation Data with Extreme Climate Events Analysis. Proceedings of 2023 e-Science, 2023: pp. 1-10.
8. Q. Gong, J. Chen, B. Whitney, X. Liang, V. Reshniak, T. Banerjee, J. Lee, A. Rangarajan, L. Wan, N. Vidal, Q. Liu, A. Gainaru, N. Podhorszki, R. Archibald, S. Ranka and S. Klasky. MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring. SoftwareX 24: 101590 (2023).
9. T. Banerjee, J. Choi, J. Lee, Q. Gong, J. Chen, C. Chang, S. Klasky, A. Rangarajan, S. Ranka, Online and Scalable Data Compression Pipeline With Guarantees on Quantities of Interest, Proceedings of IEEE eScience 2023, pp.1-10.
10. Q. Gong, C. Zhang, X. Liang, V. Reshniak, J. Chen, A. Rangarajan, S. Ranka, N. Vidals, P. Ullrich, N. Podhorszki and S. Klasky Spatiotemporally adaptive compression for scientific dataset with feature preservation – a case study on simulation data with extreme climate events analysis, Proceedings of IEEE eScience 2023, pp. 1-10.
11. T. Banerjee, J. Choi, J. Lee, Q. Gong, R. Wang, S. Klasky, A. Rangarajan, S. Ranka, An Algorithmic and Software Pipeline for Very Large-Scale Scientific Data Compression with Error Guarantees, Proceedings of International Conference on High Performance Computing, Data, and Analytics (HiPC), 2022, pp. 226-235.
12. J. Lee, Q. Gong, J. Choi, T. Banerjee S. Klasky S. Ranka, A. Rangarajan, Error-Bounded Learned Scientific Data Compression with Preservation of Derived Quantities. Applied Sciences. 2022; 12(13):6718.
13. Q. Gong, B. Whitney, C. Zhang, X. Liang, A. Rangarajan, J. Chen, L. Wan, P Ullrich. Q. Liu, R. Jacob, S. Ranka, S. Klasky, Region-adaptive, Error-controlled Scientific Data Compression using Multilevel Decomposition. Proceedings of SSDBM 2022: pp. 5:1-5:12