Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF4 files trigger HDF5 error messages #11

Open
MHBalsmeier opened this issue Feb 28, 2025 · 2 comments
Open

NetCDF4 files trigger HDF5 error messages #11

MHBalsmeier opened this issue Feb 28, 2025 · 2 comments

Comments

@MHBalsmeier
Copy link

MHBalsmeier commented Feb 28, 2025

I have coupled RTE-RRTMGP to a GCM but disabled OMP for RRTMGP and parallelized the calls from the outside (calling RRTMGP in chunks) which redues memory usage by a lot.

With RRTMGP 1.9 I started getting long HDF warnings in the console:

HDF5-DIAG: Error detected in HDF5 (1.10.10) thread 1:
  #000: ../../../src/H5A.c line 484 in H5Aopen_by_name(): can't open attribute
    major: Attribute
    minor: Can't open object
  #001: ../../../src/H5Aint.c line 542 in H5A__open_by_name(): unable to load attribute info from object header
    major: Attribute
    minor: Unable to initialize object
  #002: ../../../src/H5Oattribute.c line 496 in H5O__attr_open_by_name(): can't locate attribute: '_QuantizeBitGroomNumberOfSignificantDigits'
    major: Attribute
    minor: Object not found

And then this message repeats again and again. The code continues to function but much slower.

It only occurs if OMP is on and not in the thread number 0, which is why I assume it is a very tricky low-level problem. I was able to track it down to the conversion to netCDF4 made in #4 by the following reproducible example:


program control
  
  use mo_gas_optics_util_string,  only: lower_case
  use mo_gas_optics_rrtmgp,       only: ty_gas_optics_rrtmgp
  use mo_load_coefficients,       only: load_and_init
  use mo_gas_concentrations,      only: ty_gas_concs
  use mo_optical_props,           only: ty_optical_props_1scl, &
                                        ty_optical_props_2str
  use mo_cloud_optics_rrtmgp,     only: ty_cloud_optics_rrtmgp
  use mo_load_cloud_coefficients, only: load_cld_lutcoeff
  use omp_lib,                    only: omp_get_thread_num
  
  character(len=3), dimension(8) :: active_gases = (/ & 
   "N2 ","O2 ","CH4","O3 ","CO2","H2O","N2O","CO " &
   /)
  character(len=32), dimension(size(active_gases)) :: gases_lowercase
  
  call read_nc_files()
  
  contains
  
  subroutine read_nc_files()
    
    integer :: jc
    type(ty_gas_concs) :: gas_concentrations_sw
    type(ty_gas_concs) :: gas_concentrations_lw
    type(ty_gas_optics_rrtmgp) :: k_dist_sw,k_dist_lw
    type(ty_cloud_optics_rrtmgp) :: cloud_optics_sw
    type(ty_cloud_optics_rrtmgp) :: cloud_optics_lw
    
    do jc=1,size(active_gases)
      gases_lowercase(jc) = trim(lower_case(active_gases(jc)))
    end do
    
    write(*,*) gas_concentrations_sw%init(gases_lowercase)
    write(*,*) gas_concentrations_lw%init(gases_lowercase)
    
    !$omp parallel do private(jc)
    do jc=1,2
      
      write(*,*) omp_get_thread_num()
      !$omp critical
      ! call load_and_init(k_dist_sw,trim("rrtmgp-gas-sw-g112_old.nc"),gas_concentrations_sw)
      call load_and_init(k_dist_sw,trim("rrtmgp-gas-sw-g112_new.nc"),gas_concentrations_sw)
      ! call load_and_init(k_dist_sw,trim("rrtmgp-gas-sw-g112.nc"),gas_concentrations_sw)
      ! call load_and_init(k_dist_lw,trim("rrtmgp-gas-lw-g128.nc"),gas_concentrations_lw)
      ! call load_cld_lutcoeff(cloud_optics_sw,trim("rrtmgp-clouds-sw-bnd.nc"))
      ! call load_cld_lutcoeff(cloud_optics_lw,trim("rrtmgp-clouds-lw-bnd.nc"))
      !$omp end critical
      write(*,*) omp_get_thread_num()
      
    enddo
    !$omp end parallel do
    
  end subroutine read_nc_files
  
end program control

I assume it is the same error as this one here: https://code.mpimet.mpg.de/boards/1/topics/14326
It should be resolved in netCDF 4.9.3 (I am using netCDF 4.9.2).

It does not occur when converting the files back to netCDF 3 with nccopy -3 infile.nc outfile.nc.

If this is an option I can make a pull request for this, otherwise this is just to inform others of this problem.

CMakeLists.txt

@RobertPincus
Copy link
Member

@MHBalsmeier Thanks, this is quite interesting. Thanks for letting us know.

In multi-threads environments I understand it to be more common to read the data with a single thread and broadcast it, in part so there aren't lots of processes trying to access the same file and the same time. That's why the underlying RRTMGP initialization routines accept data rather than files names. Would this eliminate the error messages?

@MHBalsmeier
Copy link
Author

@RobertPincus I began a version where the data is only read once and then broadcast to the other threads and the error did not occur. However, I did not go through with this way since in my actual code I call the RRTMGP-coupler (which is a subroutine) inside the omp do loop for each radiation slice individually (all the slices have different numbers of columns also). Thus, this would require some refactoring and I'm currently happy with using the version 3 files.

I also think through the omp critical section the file should never be accessed by more than one thread at the same time.

If the problem persists with netCDF 4.9.3 I will have to do that though. Will report here if it is solved with 4.9.3.
Feel free to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants