Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup awareness #1155

Closed
kcgthb opened this issue Apr 13, 2017 · 17 comments
Closed

cgroup awareness #1155

kcgthb opened this issue Apr 13, 2017 · 17 comments
Milestone

Comments

@kcgthb
Copy link

kcgthb commented Apr 13, 2017

Hi!

I understand that OpenBLAS tries to automatically detect the number of CPU cores on a machine at runtime, to determine the number of threads to start when {OPENBLAS,GOTO,OMP,_NUM_THREADS} is not set.

It works fine in most cases, but when the process runs in a cgroup context, for instance one where the cpuset subsystem is in use, it may result in less-than-optimal behavior.

For instance, on a 16-core machine, if a process runs inside a cgroup where 4 CPUs have been allocated via cpuset, OpenBLAS will start 16 threads, which will be pinned on just 4 CPU-cores and which will compete with each other. In the end, the performance will be about 1/4th of what it would have been by just starting 4 threads.

So I'm wondering if any thought has been given about this already, and how OpenBLAS could try to detect if it's running in a constrained context, in order to properly allocate the resources it can use.

Thanks!

@brada4
Copy link
Contributor

brada4 commented Apr 13, 2017

cgroups do not signal affected process at all. Do you want to check before setaffinity for duplicate processors?

@brada4
Copy link
Contributor

brada4 commented Apr 13, 2017

You can check around lxc and docker reports and lists where they come to conclusion that it is how kernel works. Unless you have stable code to detect crippled cpu number please close this issue. It is you configuring resource fences, should not be too hard to slip an environment variable between the lines

@kcgthb
Copy link
Author

kcgthb commented Apr 14, 2017

With the rise of containers of all kind, I think better detecting what resources are actually usable would be a sensible thing to tend to. And would also avoid to produce erroneous bug reports regarding performance regression.

Getting CPU affinity could be done with sched_getaffinity(), and it would also cover the cases where a parent process is called with numactl or taskset.

@brada4
Copy link
Contributor

brada4 commented Apr 14, 2017

You mentioned that containers are broken. What workaround you offer?
sched_getaffinity() does not see cpuset.

@martin-frbg
Copy link
Collaborator

From my very limited understanding (gained by reading the cpuset, sched_setaffinity and CPU_SET manpages) a possible (and probably Linux-only) solution could be to use the number of cpus originally obtained by get_num_procs (in driver/others/init.c) as input to CPU_ALLOC() and CPU_ALLOC_SIZE() to setup a sufficiently large cpuset struct and then query the CPU_COUNT_S() of that cpuset and continue working with this value instead of the global cpu count.
Though looking through that file I see it already working with a cpu_set struct and NUMA mappings so it could be that the required code is already there but either it is no longer working correctly on modern systems (some comments in the code mention the counts being limited to sizeof(long) etc) or its result gets overridden by a thread count determined elsewhere.

@martin-frbg
Copy link
Collaborator

Could also be a subtle difference between _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN being used to determine the processor count - a change made recently (#981) with the intention to accomodate ARM platforms that bring additional cpus online on demand.

@brada4
Copy link
Contributor

brada4 commented Apr 14, 2017

Another upper constraint could be read from sched_getaffinity and counting affinity bits.
But it will not consider cgroup cpuset that all fancy containers use. (I mean likening taskset with all kinds of cgroups is wrong, they are different APIs)

@kcgthb
Copy link
Author

kcgthb commented Apr 14, 2017

@brada4 I never mentioned "containers are broken", merely that detection of number of CPU-cores could be improved to take cpuset-like restrictions into account.
I believe that the assumption that one can always use all the CPU-cores that are physically present on a machine is not a valid one anymore.

Detecting affinity should be sufficient in most cases. If I get things correctly, there are two mechanisms that could be at play here:

cpusets constrain the CPU and Memory placement of tasks to only the resources within a task's current cpuset

  • affinity masks (with sched_{set,get}affinity()), which are soft constraint, and can only apply to CPUs within the cpuset. From the same document:

Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and set_mempolicy(2) system calls to include Memory Nodes in its memory policy, are both filtered through that task's cpuset, filtering out any CPUs or Memory Nodes not in that cpuset.

So since affinity masks are always a subgroup of cpusets, I guess it would be sufficient to detect the process' affinity mask (with sched_getaffinity()), count the number of items in that mask and only start as many threads.

@brada4
Copy link
Contributor

brada4 commented Apr 14, 2017

sched_getaffinity response is not filtered by cgroups. But it can be used to see taskset (which is very uncommon, usually used to start realtime task on isolated CPUs)

i hope you know that setting affinity is not default. And there is no info call that returns CPU count considering cgroups.

@kcgthb
Copy link
Author

kcgthb commented Apr 14, 2017

sched_getaffinity response is not filtered by cgroups.

Well, affinity cannot be set to CPUs that are not part the cpuset, so yes, it kind of is:

$ cat /cgroup/cpuset/$(cat /proc/$$/cpuset)/cpuset.cpus
8,10,12,14
$ taskset -pc 0-15 $$
pid 32386's current affinity list: 8,10,12,14
pid 32386's new affinity list: 8,10,12,14
$ taskset -c -p $$
pid 32386's current affinity list: 8,10,12,14

Also, you can get the list of allowed CPUs within the cpuset from /proc/<PID>/status:

$ grep Cpus_allowed /proc/$$/status
Cpus_allowed:   0000,00005500
Cpus_allowed_list:      8,10,12,14

Setting affinity or defining cpusets may not be default on a laptop with a single user, but it's very common in the world of containers and HPC centers, which is probably the place where people care the most about performance.

But anyway, I'm just trying to point out an area where OpenBLAS could be enhanced and give more performance for users out of the box. If that's not of interest to you, so be it.

@martin-frbg
Copy link
Collaborator

Note that cpu affinity handling is disabled by default (NO_AFFINITY=1 set in Makefile.rule) so you may not benefit from any code that possibly is cpuset-aware already if you are using a default build. (I think the reasoning for this was that dynamically loading an affinity-aware OpenBLAS from python or similar could lock the entire interpreter onto just the cpu(s) used by OpenBLAS.)
If building with NO_AFFINITY=0 does not change anything you could undo the changes from #981 to see if this affects the cpu count seen (or just whip up a test code that just calls sysconf() to get the two variables to see if the "ONLN" value is affected by cpuset while the CONF that is used since that PR is not)

@brada4
Copy link
Contributor

brada4 commented Apr 14, 2017

Imo whole awareness is to get population count from affinity mask in no affinity case too?

@martin-frbg
Copy link
Collaborator

Some observations:

  1. querying _SC_NPROCESSORS_CONF vs _SC_NPROCESSORS_ONLN (i.e. the change from USE NPROCESSOR_CONF instaed of NPORCESSOR_ONLN #981) has no bearing on this
  2. building without the NO_AFFINITY=1 does not appear to enable correct handling of cpusets (although the code in init.c already uses sched_getaffinity) but it reduced the number of cpus it reports to the full-featured cores
  3. cpuset handling for the NO_AFFINITY case would need to be added to the get_num_procs() function in driver/others/memory.c which would have to look something like this:
  if defined(OS_LINUX) || defined(OS_SUNOS) || defined(OS_NETBSD)
  #ifndef NO_AFFINITY
  int get_num_procs(void);
  #else
  int get_num_procs(void) {
    static int nums = 0;   
+ cpu_set_t *cpusetp;      
+ size_t size;
+ int ret;

   if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF);
+  cpusetp = CPU_ALLOC(nums);
+  size = CPU_ALLOC_SIZE(nums);
+  ret = sched_getaffinity(0,size,cpusetp);
+  if (ret!=0) return nums;
+  nums = CPU_COUNT_S(size,cpusetp);
+  CPU_FREE(cpusetp);
  return nums;
}
#endif
#endif

with the potential drawback that the CPU_ macros are only provided since about glibc 2.6 (probably even later for the full CPU_ALLOC,CPU_COUNT_S,CPU_FREE that would be necessary for >1023 processors only, the minimal version would just declare cpu_set_t cpuset and call CPU_COUNT(&cpuset) though the latter may need to be emulated with a loop calling CPU_ISSET as well). Does anybody happen to know how the same is handled on SunOS and NetBSD ?

@brada4
Copy link
Contributor

brada4 commented Apr 15, 2017

NetBSD: pthread_getaffinity_mp() and has some macro requirements, nor OS has any sort of containers.
Solaris: Zones dont see full host machine, xvm and virtualbox are full machine virtualisation.
i.e need to check taskset etc is not necessary on other platforms, and linux code will not be valid there either.

@martin-frbg
Copy link
Collaborator

Should be fixed now in 0.2.20

@martin-frbg
Copy link
Collaborator

Reverting for now as my usage of the __GLIBC_PREREQ macro breaks the build for Android and other non-glibc systems. cf #630

@martin-frbg
Copy link
Collaborator

PR #1520 fixed another regression caused by this, so should be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants