|
| 1 | +# Concepts |
| 2 | + |
| 3 | +Hierarchy from top to bottom: |
| 4 | + |
| 5 | +- Host: the entire system |
| 6 | +- Device group: multiple devices, e.g. one GPU and one CPU. |
| 7 | +- Compute device |
| 8 | +- Compute unit |
| 9 | +- Processing element |
| 10 | +- Work group |
| 11 | +- Work item |
| 12 | + |
| 13 | +## TODO |
| 14 | + |
| 15 | +GPU vs CPU hardware level. |
| 16 | + |
| 17 | +<https://youtu.be/e-2bTxKuS2U?list=PLTfYiv7-a3l7mYEdjk35wfY-KQj5yVXO2&t=319> mentions GPU has no cache. |
| 18 | + |
| 19 | +## Platform |
| 20 | + |
| 21 | +TODO what is a platform? |
| 22 | + |
| 23 | +<http://stackoverflow.com/questions/3444664/does-any-opencl-host-have-more-than-one-platform> |
| 24 | + |
| 25 | +## Compute device |
| 26 | + |
| 27 | +One CPU, one GPU, etc. |
| 28 | + |
| 29 | +## Compute unit |
| 30 | + |
| 31 | +TODO vs core? |
| 32 | + |
| 33 | +Can be obtained with: `clGetDeviceInfo(CL_DEVICE_MAX_COMPUTE_UNITS)` |
| 34 | + |
| 35 | +## Processing element |
| 36 | + |
| 37 | +TODO |
| 38 | + |
| 39 | +## Work group |
| 40 | + |
| 41 | +Contains many work items. |
| 42 | + |
| 43 | +Work items inside the same work group can share local memory, and can synchronize. |
| 44 | + |
| 45 | +Work groups have a maximum size (otherwise the concept wouldn't even exist). |
| 46 | + |
| 47 | +Ideally we would like to have a single work group for all items, as that would allow us to worry less about the location of memory on the Global / Constant / Local / Private hierarchy. |
| 48 | + |
| 49 | +But memory localization on GPUs is important enough that OpenCL exposes this extra level. |
| 50 | + |
| 51 | +Synchronization only works inside a single work groups: http://stackoverflow.com/questions/5895001/opencl-synchronization-between-work-groups |
| 52 | + |
| 53 | +### Local size |
| 54 | + |
| 55 | +Size of the work group. |
| 56 | + |
| 57 | +On CPU: always 1. TODO why? |
| 58 | + |
| 59 | +On GPU; must divide Global size. |
| 60 | + |
| 61 | +### Uniform work group |
| 62 | + |
| 63 | +### Non-uniform work group |
| 64 | + |
| 65 | +Work groups with different sizes. |
| 66 | + |
| 67 | +Application: take care of edge cases of the data, e.g. image edges: <https://software.intel.com/en-us/articles/opencl-20-non-uniform-work-groups> |
| 68 | + |
| 69 | +## Work item |
| 70 | + |
| 71 | +Each work item runs your kernel code in parallel to the other ones. |
| 72 | + |
| 73 | +An work item can be seen as a thread. |
| 74 | + |
| 75 | +Contains private memory, which no other work item can see. |
| 76 | + |
| 77 | +## Local and Private memory |
| 78 | + |
| 79 | +TODO: why use those at all instead of global memory? |
| 80 | + |
| 81 | +- <http://stackoverflow.com/questions/21872810/whats-the-advantage-of-the-local-memory-in-opencl> |
| 82 | +- <http://stackoverflow.com/questions/9885880/effect-of-private-memory-in-opencl> |
| 83 | + |
| 84 | +Might be faster, and global memory is limited. |
| 85 | + |
| 86 | +HandsOnOpencl Example 8 shows how matrix multiplication becomes 10x faster with some local memory usage. Looks like memory access was the bottleneck. |
| 87 | + |
| 88 | +It also shows how we must make an explicit copy to use private memory. |
| 89 | + |
| 90 | +### Local memory |
| 91 | + |
| 92 | +- <http://stackoverflow.com/questions/8888718/how-to-declare-local-memory-in-opencl> |
| 93 | +- <http://stackoverflow.com/questions/2541929/how-do-i-use-local-memory-in-opencl> |
| 94 | +- <http://stackoverflow.com/questions/17574570/create-local-array-dynamic-inside-opencl-kernel> |
0 commit comments