-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup awareness #1155
Comments
cgroups do not signal affected process at all. Do you want to check before setaffinity for duplicate processors? |
You can check around lxc and docker reports and lists where they come to conclusion that it is how kernel works. Unless you have stable code to detect crippled cpu number please close this issue. It is you configuring resource fences, should not be too hard to slip an environment variable between the lines |
With the rise of containers of all kind, I think better detecting what resources are actually usable would be a sensible thing to tend to. And would also avoid to produce erroneous bug reports regarding performance regression. Getting CPU affinity could be done with |
You mentioned that containers are broken. What workaround you offer? |
From my very limited understanding (gained by reading the cpuset, sched_setaffinity and CPU_SET manpages) a possible (and probably Linux-only) solution could be to use the number of cpus originally obtained by get_num_procs (in driver/others/init.c) as input to CPU_ALLOC() and CPU_ALLOC_SIZE() to setup a sufficiently large cpuset struct and then query the CPU_COUNT_S() of that cpuset and continue working with this value instead of the global cpu count. |
Could also be a subtle difference between _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN being used to determine the processor count - a change made recently (#981) with the intention to accomodate ARM platforms that bring additional cpus online on demand. |
Another upper constraint could be read from sched_getaffinity and counting affinity bits. |
@brada4 I never mentioned "containers are broken", merely that detection of number of CPU-cores could be improved to take cpuset-like restrictions into account. Detecting affinity should be sufficient in most cases. If I get things correctly, there are two mechanisms that could be at play here:
So since affinity masks are always a subgroup of cpusets, I guess it would be sufficient to detect the process' affinity mask (with |
sched_getaffinity response is not filtered by cgroups. But it can be used to see taskset (which is very uncommon, usually used to start realtime task on isolated CPUs) i hope you know that setting affinity is not default. And there is no info call that returns CPU count considering cgroups. |
Well, affinity cannot be set to CPUs that are not part the cpuset, so yes, it kind of is:
Also, you can get the list of allowed CPUs within the cpuset from
Setting affinity or defining cpusets may not be default on a laptop with a single user, but it's very common in the world of containers and HPC centers, which is probably the place where people care the most about performance. But anyway, I'm just trying to point out an area where OpenBLAS could be enhanced and give more performance for users out of the box. If that's not of interest to you, so be it. |
Note that cpu affinity handling is disabled by default (NO_AFFINITY=1 set in Makefile.rule) so you may not benefit from any code that possibly is cpuset-aware already if you are using a default build. (I think the reasoning for this was that dynamically loading an affinity-aware OpenBLAS from python or similar could lock the entire interpreter onto just the cpu(s) used by OpenBLAS.) |
Imo whole awareness is to get population count from affinity mask in no affinity case too? |
Some observations:
with the potential drawback that the CPU_ macros are only provided since about glibc 2.6 (probably even later for the full CPU_ALLOC,CPU_COUNT_S,CPU_FREE that would be necessary for >1023 processors only, the minimal version would just declare |
NetBSD: pthread_getaffinity_mp() and has some macro requirements, nor OS has any sort of containers. |
Should be fixed now in 0.2.20 |
Reverting for now as my usage of the __GLIBC_PREREQ macro breaks the build for Android and other non-glibc systems. cf #630 |
PR #1520 fixed another regression caused by this, so should be good to go. |
Hi!
I understand that OpenBLAS tries to automatically detect the number of CPU cores on a machine at runtime, to determine the number of threads to start when
{OPENBLAS,GOTO,OMP,_NUM_THREADS}
is not set.It works fine in most cases, but when the process runs in a cgroup context, for instance one where the
cpuset
subsystem is in use, it may result in less-than-optimal behavior.For instance, on a 16-core machine, if a process runs inside a cgroup where 4 CPUs have been allocated via
cpuset
, OpenBLAS will start 16 threads, which will be pinned on just 4 CPU-cores and which will compete with each other. In the end, the performance will be about 1/4th of what it would have been by just starting 4 threads.So I'm wondering if any thought has been given about this already, and how OpenBLAS could try to detect if it's running in a constrained context, in order to properly allocate the resources it can use.
Thanks!
The text was updated successfully, but these errors were encountered: