cgroup awareness #1155

kcgthb · 2017-04-13T17:54:27Z

Hi!

I understand that OpenBLAS tries to automatically detect the number of CPU cores on a machine at runtime, to determine the number of threads to start when {OPENBLAS,GOTO,OMP,_NUM_THREADS} is not set.

It works fine in most cases, but when the process runs in a cgroup context, for instance one where the cpuset subsystem is in use, it may result in less-than-optimal behavior.

For instance, on a 16-core machine, if a process runs inside a cgroup where 4 CPUs have been allocated via cpuset, OpenBLAS will start 16 threads, which will be pinned on just 4 CPU-cores and which will compete with each other. In the end, the performance will be about 1/4th of what it would have been by just starting 4 threads.

So I'm wondering if any thought has been given about this already, and how OpenBLAS could try to detect if it's running in a constrained context, in order to properly allocate the resources it can use.

Thanks!

The text was updated successfully, but these errors were encountered:

brada4 · 2017-04-13T20:24:15Z

cgroups do not signal affected process at all. Do you want to check before setaffinity for duplicate processors?

brada4 · 2017-04-13T22:59:53Z

You can check around lxc and docker reports and lists where they come to conclusion that it is how kernel works. Unless you have stable code to detect crippled cpu number please close this issue. It is you configuring resource fences, should not be too hard to slip an environment variable between the lines

kcgthb · 2017-04-14T00:50:20Z

With the rise of containers of all kind, I think better detecting what resources are actually usable would be a sensible thing to tend to. And would also avoid to produce erroneous bug reports regarding performance regression.

Getting CPU affinity could be done with sched_getaffinity(), and it would also cover the cases where a parent process is called with numactl or taskset.

brada4 · 2017-04-14T06:11:54Z

You mentioned that containers are broken. What workaround you offer?
sched_getaffinity() does not see cpuset.

martin-frbg · 2017-04-14T10:08:59Z

From my very limited understanding (gained by reading the cpuset, sched_setaffinity and CPU_SET manpages) a possible (and probably Linux-only) solution could be to use the number of cpus originally obtained by get_num_procs (in driver/others/init.c) as input to CPU_ALLOC() and CPU_ALLOC_SIZE() to setup a sufficiently large cpuset struct and then query the CPU_COUNT_S() of that cpuset and continue working with this value instead of the global cpu count.
Though looking through that file I see it already working with a cpu_set struct and NUMA mappings so it could be that the required code is already there but either it is no longer working correctly on modern systems (some comments in the code mention the counts being limited to sizeof(long) etc) or its result gets overridden by a thread count determined elsewhere.

martin-frbg · 2017-04-14T10:29:30Z

Could also be a subtle difference between _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN being used to determine the processor count - a change made recently (#981) with the intention to accomodate ARM platforms that bring additional cpus online on demand.

brada4 · 2017-04-14T11:38:31Z

Another upper constraint could be read from sched_getaffinity and counting affinity bits.
But it will not consider cgroup cpuset that all fancy containers use. (I mean likening taskset with all kinds of cgroups is wrong, they are different APIs)

kcgthb · 2017-04-14T15:57:01Z

@brada4 I never mentioned "containers are broken", merely that detection of number of CPU-cores could be improved to take cpuset-like restrictions into account.
I believe that the assumption that one can always use all the CPU-cores that are physically present on a machine is not a valid one anymore.

Detecting affinity should be sufficient in most cases. If I get things correctly, there are two mechanisms that could be at play here:

CPU sets, which are a hard constraint. From the documentation:

cpusets constrain the CPU and Memory placement of tasks to only the resources within a task's current cpuset

affinity masks (with sched_{set,get}affinity()), which are soft constraint, and can only apply to CPUs within the cpuset. From the same document:

Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and set_mempolicy(2) system calls to include Memory Nodes in its memory policy, are both filtered through that task's cpuset, filtering out any CPUs or Memory Nodes not in that cpuset.

So since affinity masks are always a subgroup of cpusets, I guess it would be sufficient to detect the process' affinity mask (with sched_getaffinity()), count the number of items in that mask and only start as many threads.

brada4 · 2017-04-14T17:54:50Z

sched_getaffinity response is not filtered by cgroups. But it can be used to see taskset (which is very uncommon, usually used to start realtime task on isolated CPUs)

i hope you know that setting affinity is not default. And there is no info call that returns CPU count considering cgroups.

kcgthb · 2017-04-14T21:13:22Z

sched_getaffinity response is not filtered by cgroups.

Well, affinity cannot be set to CPUs that are not part the cpuset, so yes, it kind of is:

$ cat /cgroup/cpuset/$(cat /proc/$$/cpuset)/cpuset.cpus
8,10,12,14
$ taskset -pc 0-15 $$
pid 32386's current affinity list: 8,10,12,14
pid 32386's new affinity list: 8,10,12,14
$ taskset -c -p $$
pid 32386's current affinity list: 8,10,12,14

Also, you can get the list of allowed CPUs within the cpuset from /proc/<PID>/status:

$ grep Cpus_allowed /proc/$$/status
Cpus_allowed:   0000,00005500
Cpus_allowed_list:      8,10,12,14

Setting affinity or defining cpusets may not be default on a laptop with a single user, but it's very common in the world of containers and HPC centers, which is probably the place where people care the most about performance.

But anyway, I'm just trying to point out an area where OpenBLAS could be enhanced and give more performance for users out of the box. If that's not of interest to you, so be it.

martin-frbg · 2017-04-14T21:32:43Z

Note that cpu affinity handling is disabled by default (NO_AFFINITY=1 set in Makefile.rule) so you may not benefit from any code that possibly is cpuset-aware already if you are using a default build. (I think the reasoning for this was that dynamically loading an affinity-aware OpenBLAS from python or similar could lock the entire interpreter onto just the cpu(s) used by OpenBLAS.)
If building with NO_AFFINITY=0 does not change anything you could undo the changes from #981 to see if this affects the cpu count seen (or just whip up a test code that just calls sysconf() to get the two variables to see if the "ONLN" value is affected by cpuset while the CONF that is used since that PR is not)

brada4 · 2017-04-14T21:47:17Z

Imo whole awareness is to get population count from affinity mask in no affinity case too?

martin-frbg · 2017-04-15T19:46:38Z

Some observations:

querying _SC_NPROCESSORS_CONF vs _SC_NPROCESSORS_ONLN (i.e. the change from USE NPROCESSOR_CONF instaed of NPORCESSOR_ONLN #981) has no bearing on this
building without the NO_AFFINITY=1 does not appear to enable correct handling of cpusets (although the code in init.c already uses sched_getaffinity) but it reduced the number of cpus it reports to the full-featured cores
cpuset handling for the NO_AFFINITY case would need to be added to the get_num_procs() function in driver/others/memory.c which would have to look something like this:

  if defined(OS_LINUX) || defined(OS_SUNOS) || defined(OS_NETBSD)
  #ifndef NO_AFFINITY
  int get_num_procs(void);
  #else
  int get_num_procs(void) {
    static int nums = 0;   
+ cpu_set_t *cpusetp;      
+ size_t size;
+ int ret;

   if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF);
+  cpusetp = CPU_ALLOC(nums);
+  size = CPU_ALLOC_SIZE(nums);
+  ret = sched_getaffinity(0,size,cpusetp);
+  if (ret!=0) return nums;
+  nums = CPU_COUNT_S(size,cpusetp);
+  CPU_FREE(cpusetp);
  return nums;
}
#endif
#endif

with the potential drawback that the CPU_ macros are only provided since about glibc 2.6 (probably even later for the full CPU_ALLOC,CPU_COUNT_S,CPU_FREE that would be necessary for >1023 processors only, the minimal version would just declare cpu_set_t cpuset and call CPU_COUNT(&cpuset) though the latter may need to be emulated with a loop calling CPU_ISSET as well). Does anybody happen to know how the same is handled on SunOS and NetBSD ?

brada4 · 2017-04-15T23:20:19Z

NetBSD: pthread_getaffinity_mp() and has some macro requirements, nor OS has any sort of containers.
Solaris: Zones dont see full host machine, xvm and virtualbox are full machine virtualisation.
i.e need to check taskset etc is not necessary on other platforms, and linux code will not be valid there either.

martin-frbg · 2017-07-24T07:04:24Z

Should be fixed now in 0.2.20

martin-frbg · 2017-07-24T14:21:05Z

Reverting for now as my usage of the __GLIBC_PREREQ macro breaks the build for Android and other non-glibc systems. cf #630

martin-frbg · 2018-04-15T11:34:24Z

PR #1520 fixed another regression caused by this, so should be good to go.

martin-frbg mentioned this issue Jul 3, 2017

Release OpenBLAS 0.2.20 #1223

Closed

2 tasks

martin-frbg mentioned this issue Jul 15, 2017

Honor cgroup/cpuset limits when enumerating cpus #1239

Merged

martin-frbg mentioned this issue Jul 25, 2017

Honor cgroup/cpuset limits when enumerating cpus #1249

Merged

obi1kenobi mentioned this issue Oct 11, 2017

OrientDB's available memory estimate ignores cgroup (container) limits orientechnologies/orientdb#7812

Closed

martin-frbg mentioned this issue Feb 23, 2018

performance on AMD Ryzen and Threadripper #1461

Open

martin-frbg added this to the 0.3.0 milestone Apr 2, 2018

martin-frbg closed this as completed Apr 15, 2018

martin-frbg mentioned this issue Nov 28, 2018

Performance issue with many cores #1881

Closed

brada4 mentioned this issue Mar 14, 2019

openBLAS nested parallelism #2052

Open

martin-frbg mentioned this issue Jul 12, 2020

Fix OpenMP builds defaulting to singlethreading with OMP_PLACES or OMP_PROC_BIND set #2706

Merged

nathanweeks mentioned this issue Dec 13, 2024

Update Singularity guide rocker-org/website#120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup awareness #1155

cgroup awareness #1155

kcgthb commented Apr 13, 2017

brada4 commented Apr 13, 2017

brada4 commented Apr 13, 2017

kcgthb commented Apr 14, 2017

brada4 commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

brada4 commented Apr 14, 2017 •

edited

Loading

kcgthb commented Apr 14, 2017

brada4 commented Apr 14, 2017

kcgthb commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

brada4 commented Apr 14, 2017

martin-frbg commented Apr 15, 2017

brada4 commented Apr 15, 2017 •

edited

Loading

martin-frbg commented Jul 24, 2017

martin-frbg commented Jul 24, 2017

martin-frbg commented Apr 15, 2018

cgroup awareness #1155

cgroup awareness #1155

Comments

kcgthb commented Apr 13, 2017

brada4 commented Apr 13, 2017

brada4 commented Apr 13, 2017

kcgthb commented Apr 14, 2017

brada4 commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

brada4 commented Apr 14, 2017 • edited Loading

kcgthb commented Apr 14, 2017

brada4 commented Apr 14, 2017

kcgthb commented Apr 14, 2017

martin-frbg commented Apr 14, 2017

brada4 commented Apr 14, 2017

martin-frbg commented Apr 15, 2017

brada4 commented Apr 15, 2017 • edited Loading

martin-frbg commented Jul 24, 2017

martin-frbg commented Jul 24, 2017

martin-frbg commented Apr 15, 2018

brada4 commented Apr 14, 2017 •

edited

Loading

brada4 commented Apr 15, 2017 •

edited

Loading