Finding the Goldilocks zone: Compression-error trade-off for large gridded datasets
Jeremy D. Silver1 and Charles S. Zender21School of Earth Sciences, University of Melbourne, Australia 2Departments of Earth System Science and of Computer Science, University of California, Irvine, USA
Abstract. The netCDF-4 format is widely used for large gridded scientific datasets, and includes several compression methods: lossy linear scaling and non-lossy deflate and shuffle algorithms. Many multidimensional datasets exhibit considerable variation over one or several spatial dimensions (e.g. vertically) with less variation in the remaining dimensions (e.g. horizontally). On such datasets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We propose a method (termed "layer packing") that simultaneously exploits lossy linear scaling and lossless compression. Layer packing stores arrays (instead of a scalar pair) of scale and offset parameters.
An implementation of this method is compared with existing compression techniques in terms of compression ratio, accuracy, and speed. Layer packing produces typical errors of 0.01–0.02 % of the standard deviation within the packed layer, and yields files roughly 33 % smaller than the lossless deflate algorithm. This was similar to storing between 3 and 4 significant figures per datum. In the six test datasets considered, layer packing demonstrated a better compression/error trade-off than storing 3–4 significant digits in half of cases and worse in the remaining cases, highlighting the need to compare lossy compression methods in individual applications. Layer packing preserves substantially more precision than scalar linear packing, whereas scalar linear packing achieves greater compression ratios. Layer-packed data files must be "unpacked" to be readily usable. These characteristics make layer-packing a competitive archive format for many geophysical datasets.
Silver, J. D. and Zender, C. S.: Finding the Goldilocks zone: Compression-error trade-off for large gridded datasets, Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016-177, in review, 2016.