Journal cover Journal topic
Geoscientific Model Development An interactive open-access journal of the European Geosciences Union
doi:10.5194/gmd-2016-177
© Author(s) 2016. This work is distributed
under the Creative Commons Attribution 3.0 License.
Development and technical paper
29 Jul 2016
Review status
A revision of this discussion paper was accepted for the journal Geoscientific Model Development (GMD).
Finding the Goldilocks zone: Compression-error trade-off for large gridded datasets
Jeremy D. Silver1 and Charles S. Zender2 1School of Earth Sciences, University of Melbourne, Australia
2Departments of Earth System Science and of Computer Science, University of California, Irvine, USA
Abstract. The netCDF-4 format is widely used for large gridded scientific datasets, and includes several compression methods: lossy linear scaling and non-lossy deflate and shuffle algorithms. Many multidimensional datasets exhibit considerable variation over one or several spatial dimensions (e.g. vertically) with less variation in the remaining dimensions (e.g. horizontally). On such datasets, linear scaling with a single pair of scale and offset parameters often entails considerable loss of precision. We propose a method (termed "layer packing") that simultaneously exploits lossy linear scaling and lossless compression. Layer packing stores arrays (instead of a scalar pair) of scale and offset parameters.

An implementation of this method is compared with existing compression techniques in terms of compression ratio, accuracy, and speed. Layer packing produces typical errors of 0.01–0.02 % of the standard deviation within the packed layer, and yields files roughly 33 % smaller than the lossless deflate algorithm. This was similar to storing between 3 and 4 significant figures per datum. In the six test datasets considered, layer packing demonstrated a better compression/error trade-off than storing 3–4 significant digits in half of cases and worse in the remaining cases, highlighting the need to compare lossy compression methods in individual applications. Layer packing preserves substantially more precision than scalar linear packing, whereas scalar linear packing achieves greater compression ratios. Layer-packed data files must be "unpacked" to be readily usable. These characteristics make layer-packing a competitive archive format for many geophysical datasets.


Citation: Silver, J. D. and Zender, C. S.: Finding the Goldilocks zone: Compression-error trade-off for large gridded datasets, Geosci. Model Dev. Discuss., doi:10.5194/gmd-2016-177, in review, 2016.
Jeremy D. Silver and Charles S. Zender

Data sets

Sample CAM SE model output
C. S. Zender
doi:10.4225/49/576ca64db2a14
Sample output of the mineral Dust Entrainment And Deposition (DEAD) model
C. S. Zender
doi:10.4225/49/576c95f254b67
Sample MERRA analysis
C. S. Zender
doi:10.4225/49/576c934f73be7
Sample output from the Weather, Research and Forecasting model
J. D. Silver
doi:10.4225/49/576c900f7d289
Sample MOZART model output
J. D. Silver
doi:10.4225/49/576c8ba706ac9
Jeremy D. Silver and Charles S. Zender

Viewed

Total article views: 123 (including HTML, PDF, and XML)

HTML PDF XML Total BibTeX EndNote
96 24 3 123 7 3

Views and downloads (calculated since 29 Jul 2016)

Cumulative views and downloads (calculated since 29 Jul 2016)

Saved

Discussed

Latest update: 15 Jan 2017
Publications Copernicus
Download
Short summary
Many modern scientific research projects generate large amounts of data. Storage space is valuable and may be limited, hence compression is vital. We tested different compression methods for large gridded datasets, assessing the space savings and the amount of precision lost. We found a general trade-off between precision and compression, and that the method that optimises this trade-off depends on the dataset. A method introduced here proved to be a competitive archive format for gridded data.
Many modern scientific research projects generate large amounts of data. Storage space is...
Share