The Weather Research and Forecasting (WRF) model is a numerical
weather prediction system designed to serve both atmospheric
research and operational forecasting needs. The WRF development is
a done in collaboration around the globe. Furthermore, the WRF is
used by academic atmospheric scientists, weather forecasters at the
operational centers and so on. The WRF contains several physics
components. The most time consuming one is the microphysics. One
microphysics scheme is the Goddard cloud microphysics scheme. It is
a sophisticated cloud microphysics scheme in the Weather Research
and Forecasting (WRF) model. The Goddard microphysics scheme is very
suitable for massively parallel computation as there are no
interactions among horizontal grid points. Compared to the earlier
microphysics schemes, the Goddard scheme incorporates a large number
of improvements. Thus, we have optimized the Goddard scheme code. In
this paper, we present our results of optimizing the Goddard
microphysics scheme on Intel Many Integrated Core Architecture (MIC)
hardware. The Intel Xeon Phi coprocessor is the first product based
on Intel MIC architecture, and it consists of up to 61 cores
connected by a high performance on-die bidirectional interconnect.
The Intel MIC is capable of executing a full operating system and
entire programs rather than just kernels as the GPU does. The MIC
coprocessor supports all important Intel development tools. Thus,
the development environment is one familiar to a vast number of CPU
developers. Although, getting a maximum performance out of MICs will
require using some novel optimization techniques. Those optimization
techniques are discussed in this paper. The results show that the
optimizations improved performance of Goddard microphysics scheme on
Xeon Phi 7120P by a factor of 4.7
Scientific computing is moving towards the exascale era. In order to prepare legacy software to take full advantage of modern many-core supercomputing environments, the legacy code has to be modernized. This means taking a careful look at the code modernization opportunities. Modern CPU cores contain several Single Instruction Parallel Data (SIMD) execution units for performing the same instruction on multiple data point simultaneously. Intel's Many Integrated Core Architecture (MIC) accelerator cards up the ante by having up to 61 CPU cores with 16 SIMD execution units per core. Thus, both threading and vectorization techniques are explored in this paper. Furthermore, code has to be optimized for frequent reuse of the cached data to increase computational intensity. This issue will also be explored in the paper. In addition, the Intel MIC coprocessor uses the same programming models and tools as processors. Thus, the optimization effort also benefits the code running on CPUs.
The Weather Research and Forecasting (WRF) is an open source Numerical Weather Prediction (NWP) model. It is suitable for simulating weather phenomena ranging from meters to thousands of kilometers. WRF is the most widely used community weather forecast and research model in the world. Both operational forecasters and atmospheric researcher are using it in 153 countries. The WRF model contains two dynamics cores: the Non-hydrostatic Mesoscale Model core (Janjic, 2003) and the Advanced Research WRF (Wang et al., 2008) core. A dynamic core contains a set of dynamic solvers that operates on a particular grid projection, grid staggering, and vertical coordinate system. The WRF model also contains several physics components, many of which can be used with both dynamic cores. The WRF has an extensible design. Therefore, it is possible to add physics, chemistry, hydrology models and other features to it. In the real world scenarios WRF is initialized with boundary conditions and topography using observations. Thus, WRF is used in a variety of areas such as tropical storm prediction, simulation of wild fire, air-quality modeling, prediction of regional climate and storm-scale research.
WRF software architecture takes advantage of the distributed and shared memory systems. However, vector capabilities of Xeon Phi coprocessors and the latest CPUs are only used by the WSM5 microphysics scheme. The other modules have not been vectorized at all. In this paper, we present our results of vectorizing and further optimizing another microphysics scheme, called Goddard scheme, on the current Xeon Phi. Furthermore, with the advent of unified AVX-512 vector extension instructions on Intel's future Knights Landing MIC and Skylake CPU, utilization of vector processing is becoming essential for a peak performance on data parallel programs such as WRF. Meanwhile, OpenMP 4.0 offers important solutions for vectorization is a standard way. These will make our optimization work imperative for optimal performance of the WRF on any future Intel hardware.
Although, Intel Xeon Phi is a new product there are already exists of a lot of publications characterizing its performance. A case study of the WRF on Xeon Phi analyzed the scalability trade-offs running one or more ranks of an MPI program on an Intel MIC chip (Meadows, 2012). Currently, the latest community release (WRFV3.6) runs natively on Xeon Phi. However, only WSM5 microphysics scheme has been fully optimized to take full advantage of the Intel MIC (Michalakes, 2013). WSM5 contains 1800 lines of code out of an approximately 700 000 in WRF. WSM5 is just one of over ten microphysics schemes in WRF. There are also six other physics components in WRF for a total of seven physics components. In addition, there is also a dynamics core in WRF that has not been optimized for Xeon Phi yet. This paper describes our efforts to optimize Goddard microphysics scheme to take advantage of the Intel's MIC. This represents our first step towards fully acceleration WRF on Intel MIC.
Other examples of Intel MIC acceleration are leukocyte tracking using medical imaging techniques (Fang et al., 2014), real-world geophysical application (Weinberg et al., 2014) and an astrophysics package (GADGET), which is used for cosmological N-body/SPH simulations to solve a wide range of astrophysical tasks (Borovska and Ivanova, 2014). Other successful examples of using Intel MIC for accelerated computing are multiple sequence alignment software (Borovska et al., 2014), atomistic simulation package (Reida and Bethunea, 2014) and general purpose subroutine system for the numerical solution of the incompressible Navier–Stokes equations (Venetis et al., 2014). In addition, the use of Intel MIC for the computation of the partial correlation coefficient with information theory (PCIT) method for detecting interactions between networks has also been explored (Stanzione, 2014). Furthermore, the UCD-SPH utilizing the Smoothed Particle Hydrodynamics (SPH) method for modelling wave interaction with an Oscillating Wave Surge Converter (OWSC) device has been optimized for Intel MIC (Lalanne, 2014). Moreover, the performance of Intel MIC vs. CPU has been studied using a parallel simulation of high-dimensional American option pricing (Hu et al., 2014). A fast short-read aligner has also been implemented on Intel MIC (Chan et al., 2014).
WRF modules have a relatively low arithmetic intensity and limited operand reuse. Furthermore, large working set size of WRF modules overflows cache memories of the current coprocessors. Therefore, WRF optimization process will start by modifying the code to expose vectorization and reducing the memory footprint size for better cache utilization. Techniques such as less-than-full-vector loop vectorization have to be used for a peak performance (Tian et al., 2013).
A nice advantage of using Xeon Phi for WRF acceleration is the fact that an improvement in the whole WRF can be observed after each optimization step. This is due to WRF running on a Xeon Phi's native mode. This is one huge advantage of Xeon Phi compared to the GPUs, which we have been using for WRF optimization earlier. Just like Xeon Phi's offload mode, an individual GPU kernel needs to transfer data between CPU and GPU. The data transfer overhead for a WRF module on a GPU is usually enough to kill any potential speedup for the whole WRF. Only when the whole code-base has been translated to run on GPU can WRF get any speedup on GPUs. For Xeon Phis, a speedup can be observed from day one.
In Sect. 2, we review the main characteristics of the Goddard microphysics scheme. Section 3 introduces MIC hardware characteristics. Results of the code optimization of the Goddard scheme are given in Sect. 4. Conclusion and plans for future work are described in Sect. 5.
The WRF physics components are microphysics, cumulus parameterization, planetary boundary layer (PBL), land-surface/surface-layer model and short-/longwave radiation. WRF physics components and their interactions are shown in Fig. 1. Dynamics core and its model integration procedure are shown in Fig. 2.
The Goddard microphysics scheme was introduced in (Tao and Simpson, 1993). The microphysical scheme has been modified to reduce the over-estimated and unrealistic amount of graupel in the strati form region (Tao et al., 2003 and Lang et al., 2007). In addition, saturation issues have been addressed better (Tao et al., 2003) and more realistic ice water contents for longer-term simulations have been obtained (Zeng, 2008, 2009). The microphysical processes simulated in the model are demonstrated in Fig. 3 and explained in Table 1. The process is a single moment bulk cloud microphysics scheme, which update 7 variables, (deviation of potential temperature, water vapor and mixing ratio of 5 hydrometeors (cloud water, cloud ice, rain water, snow, graupel/hail) in every time step using prognostic equation saturation process and microphysical interaction process.
Intel MIC can be programmed using either offload or native programming model. Compiler directives can be added to the code to mark regions of the region of the code that should be offloaded to the Intel Xeon Phi coprocessor. This approach is familiar to the users of Graphics Processing Units (GPU)s as computational accelerators. Instead of using this cumbersome and error-prone approach we used the native programming model of Intel MIC for our work. In the native mode, the code is compiled on the host using Intel's compiler with the compiler switch –mmic to generate code for the MIC architecture. After that, the resulting binary is copied to the MIC coprocessor and the binary is executed there. This makes code porting process straightforward, but the code optimization still takes skill and effort. Furthermore, the latest community release of WRF (WRF v3.6) already runs natively on Xeon Phi. So, we are taking the current Intel MIC WRF code as a baseline and performing optimizations to individual modules in the WRF code. In this paper code optimization results for Goddard microphysics scheme will be presented.
Intel Xeon Phi coprocessor 7120P, which we used for
benchmarks in the paper, contains 61 independent cores
running at a low frequency of
Furthermore, Intel Xeon Phi has 16 GB of high bandwidth Graphics Double Data Rate (GDDR5) memory. Unified L2 caches in each core are used to cache the access to GDDR5 memory. In addition, each core has L1 cache both for instruction and data, with a typical access time of 1 cycle. The caches are fully coherent and implement x86 memory order model. The memory bandwidth provided by L1 and L2 caches is 15 and 7 times faster than GDDR5 bandwidth. Therefore, effective use of caches is the optimization feature for achieving the peak performance of Intel Xeon Phi (Chrysos, 2014).
Summary of the specifications of the Intel Xeon Coprocessor 7120P is shown in Table 2. More details of Intel MIC programming can be found in (Jeffers and Reinders, 2013). A dual socket octa-core Intel Xeon E5-2670 CPU was also used for benchmarking the performed Goddard code optimizations. The specifications of the CPU are shown in Table 3.
To test Goddard scheme on MIC we used a CONtinental United States
(CONUS) benchmark data set for 12
As can be seen from Fig. 5, the microphysics is the most time consuming physics component in the WRF. Goddard microphysics scheme has 2355 lines of code compared to over 60 000 lines of code in the ARW dynamics code. In total, the WRF consists of over 500 000 lines of code. Therefore, it makes sense to optimize Goddard microphysics scheme first as this relatively small module can still give an overall speedup to the whole WRF. A summary of the various optimization steps is shown in Fig. 6. The processing times are an average of 10 execution runs. These processing times are measured running a stand-alone driver for the Goddard scheme. A default domain decomposition of 4 horizontal and 61 vertical tiles were used by the first version of Goddard code (v1).
The first optimization step (v2) was modifying the dimensions of some
intermediate variables so that the
Vectorizing the main subroutine, saticel_s, was done in preparation of the next major optimization step. Vectorization refers to the process where a scalar implementation, which does an operation one pair of operands at a time, is converted to a vector process, where a single instruction can refer to a vector. This adds a form of parallelism to software in which one instruction or operation is applied to multiple pieces of data. The benefit is more efficient data processing and improved code performance.
In this optimization step, an
In the next optimization step (v3) input data is copied to vector sized arrays of 16 elements. Each thread will process only 16 data elements. Thus, better data locality is achieved as more of the data can be accessed from the cache memories. In Fig. 9, Goddard multithreaded code for this technique is shown. The drawback of this method is the overhead of copying input data to temporary arrays. Similarly, output data have to be copied from vector size arrays into to the original big arrays. Dynamic scheduling was used for OpenMP work sharing construct in the do-loop. The actual computation is performed in the main Goddard subroutine, saticel_s, 16 grid points at a time.
In the code version (v4) vector alignment directives were added before
each
In v6, we modified the dimensions of some intermediate variables so
that the
Code validation was performed by checking that all code versions produced the same output values on CPU. During code validation precise math compiler options were used. This tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations. It disables optimizations that can change the result of floating-point calculations, which is required for strict ANSI conformance. These semantics ensure the reproducibility of floating-point computations, including code vectorized or auto-parallelized by the compiler. However, this compiler options slows down the performance. Therefore, the WRF does not use this compiler option as a default option.
This is in stark contrast to a situation when the default WRF compiler options are used, which tells the compiler to use more aggressive optimizations when implementing floating-point calculations. These optimizations increase speed, but may affect the accuracy or reproducibility of floating-point computations.
Figure 10 shows the effect of the compiler optimizations on CPU
performance. The overall speedup from the code optimization on an
Intel Xeon E5-2670 CPU was 2.8
We used Intel Software Development Emulator to get a count of the number of executed. Figure 11 shows the total number of instructions executed for both the original and optimized Goddard source code. Instruction counts are shown for AVX, AVX2 and AVX512 instructions sets Instruction count reduction for AVX512 instruction set is quite small for the original code due to limited vectorization of the code. However, the optimized code has a lower instruction count because of the use of vector instructions instead of scalar instructions. Furthermore, AVX512 offers significantly improved code vectorization capabilities as is evident from its low instruction count compared to AVX and AVX2 instruction counts on the optimized code.
In Fig. 12, a summary of total elapsed times for execution of Gorrard calculation on Xeon CPU with 1 socket and 2 socket configurations and on Intel MIC. The results show that the scaling from a single CPU socket to a dual socket configuration is less than optimal due to memory bound nature of the code. Before the code optimization was performed, a dual socket CPU was faster than MIC. However, after the optimization process the performance of MIC is even higher than a dual socket CPU configuration. Furthermore, the relative performance increase from the code optimizations was higher for the Intel MIC than CPUs. Thus, the performed optimizations are even more important on MIC due to its larger speedup benefit from those optimizations.
In Fig. 13, effects of orthogonal optimization techniques of multi-threading and vectorization on 1 socket Xeon CPU configurations and on Intel MIC are shown. Unoptimized original code was used for scalar performance on the figure. That required adding -no-vec compiler option for Intel's Fortran compiler to disable code vectorization. Utilizing both multi-threading and vectorization was a key to a good performance both on CPU and MIC platforms. In the end, Intel Xeon Phi coprocessor were able to reach heights in performance beyond that of an Intel Xeon processor, but it required both multi-threading and vectorization to achieve that.
The optimized Goddard microphysics code was incorporated back into the WRF. Intel's VTune profiles measurements in Fig. 14, show that after code optimizations Goddard scheme takes only 7.6 % of the total processing time and dynamics is even more dominant at 71.5 % share of the total processing time. Thus, it is a natural candidate for further optimization of the WRF. In addition, both radiative transfer and planetary boundary layer are also two physics categories that warrant further optimization efforts as they both take over 7 % of the total processing time.
We have shown in this paper that Intel MIC can be faster than a dual socket CPU system. However, to achieve that, both multi-threading and vectorization have to utilized. Furthermore, code has to be optimized for frequent reuse of the cached data to increase computational intensity. Following there guidelines, the optimization of the WRF Goddard microphysics scheme was described in this paper.
The results show that the optimizations improved performance of Xeon
Phi 7120P by 4.7
The generic nature of the performed optimizations means that the optimized code will run efficiently on future Intel MICs and CPU. The next generation Intel MIC, Intel Knights Landing (KNL), is not fully compatible with the previous generation, since AVX-512 SIMD instructions on KNL are encoded differently than the Larrabee New Instructions (LRBni). However, it is expected that the optimization that we have performed so far on the code will also run well on KNL. The future Intel CPUs, such as Skylake-EX, will also use the AVX-512 instruction encoding for vector instructions.
The Goddard microphysics scheme optimization work represents an initial step towards fully vectorizing the WRF. As could be seen from the speedup figures, optimization of the code is essential for a good performance on the MIC and modern CPUs. Thus, the work will continue in the optimization of the other WRF modules.
This work is supported by Intel Corporation under Grant No. MSN171479 to establish the Intel Parallel Computing Center at the University of Wisconsin-Madison. The authors would like to thank Bob Burroughs and Michael Greenfield for generous support.
Key to Fig. 3.
Specifications of the Intel Xeon Coprocessor 7120P.
Specifications of the Intel Xeon E5-2670.
WRF physics components are microphysics, cumulus parametrization, planetary boundary layer (PBL), land-surface/surface-layer model, shortwave (SW) and longwave (LW) radiation.
WRF model integration procedure.
Cloud physics process simulated in the model with the snow field included. See Table 1 for an explanation of the symbols.
CONUS 12
MIC processing times for the original WRF code as measured by Intel's profiling tool VTune.
MIC processing time for the Goddard microphysics scheme.
The code with
The original non-vectorized code is shown on the left. Vectorized code is shown on the right.
Fortran code for multithreading using OpenMP. Input data is copied to vector sized arrays of CHUNK (16) elements.
CPU processing time for the Goddard microphysics scheme.
The total number of instruction executed.
Total elapsed time in milliseconds for execution of Goddard calculation on Xeon CPU with 1 socket and 2 socket configurations and on Intel MIC.
Effects of orthogonal optimization techniques of multi-threading and vectorization on 1 socket Xeon CPU configurations and on Intel MIC.
MIC processing times for the WRF after code optimizations as measured by Intel's profiling tool VTune.