Skip to content

Commit e52e447

Browse files
committed
Start opencl for real
1 parent 5475036 commit e52e447

30 files changed

+1010
-163
lines changed

bullet.md

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Bullet
2+
3+
## Build
4+
5+
Tested on Ubuntu 2.83.5 in Ubuntu 15.10.
6+
7+
git clone https://github.com/bulletphysics/bullet3
8+
cd build3
9+
./premake4_linux gmake
10+
cd gmake
11+
make
12+
13+
Outputs are `.a` and executables under `bin/`. The most interesting is:
14+
15+
./bin/App_ExampleBrowser_gmake_x64_release
16+
17+
which allows you to view with OpenGL and interact with mouse dragging with the examples under `examples/`.

c/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
1. [Parameter without name](parameter_without_name.c)
6565
1. [Static array argument](static_array_argument.c)
6666
1. [_Noreturn](noreturn.c)
67+
1. [Identifier list](identifier_list.c)
6768
1. [Operator](operator.c)
6869
1. [sizeof()](sizeof.c)
6970
1. [Sequence point](sequence_point.c)

c/function.c

+5-6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
/*
2-
# function
1+
/*# function
32
43
A function is basically a branch, but in which you have to:
54
@@ -462,13 +461,13 @@ int main() {
462461
463462
http://stackoverflow.com/questions/5481579/whats-the-difference-between-function-prototype-and-declaration
464463
465-
- Prototype is a declaration that specifies the arguments.
464+
- Prototype is a declaration that specifies the arguments.
466465
Only a single prototype can exist.
467466
468-
- a declaration can not be a prototype if it does not have any arguments.
467+
- a declaration can not be a prototype if it does not have any arguments.
469468
The arguments are left unspecified.
470469
471-
- to specify a prototype that takes no arguments, use `f(void)`
470+
- to specify a prototype that takes no arguments, use `f(void)`
472471
473472
In C++ the insanity is reduced, and every declaration is a prototype,
474473
so `f()` is the same as `f(void)`.
@@ -543,7 +542,7 @@ int main() {
543542
/*
544543
# K&R function declaration
545544
546-
This form of funciton declaration, while standard,
545+
This form of function declaration, while standard,
547546
is almost completely obsolete and forgotten today.
548547
549548
It is however still ANSI C.

c/identifier_list.c

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
/*
2+
# Identifier list function declarator
3+
4+
Old style thing that should never be done today.
5+
*/
6+
7+
#include "common.h"
8+
9+
/* TODO without definition. Should never be done. Conforming or not? */
10+
/*int f(x, y);*/
11+
12+
int f(x, y)
13+
int x;
14+
int y;
15+
{ return x + y; }
16+
17+
/* Also identifier list: it is the only optional one. */
18+
void g() {}
19+
20+
/* Identifier type list. This one is not optional. */
21+
void h(void) {}
22+
23+
int main() {
24+
assert(f(1, 2) == 3);
25+
return EXIT_SUCCESS;
26+
}

c/interactive/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ Programs in this directory should be run manually one by one because they do thi
1010
1. [Command line arguments](command_line_arguments.c)
1111
1. [abort](abort.c)
1212
1. [clock](clock.c)
13+
1. [Ugly grammar](ugly_grammar.c)

c/interactive/ugly_grammar.c

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
/*
2+
# Ugly grammar
3+
4+
C allows for grammar obscenities to be compatible with a distant past.
5+
6+
Here are some perfectly legal jewels. Don't compile with `-pedantic-errors`, only `-std=c99`.
7+
*/
8+
9+
#include "common.h"
10+
11+
/* After you've removed -pedantic-errors. */
12+
#define ON 1
13+
14+
#if ON
15+
/* Empty declaration. */
16+
int;
17+
18+
/* Declaration without type. */
19+
a;
20+
f(void);
21+
int dec_arg_no_type(x, y)
22+
{ return 1; }
23+
24+
/* Declaration-list function arguments in declaration. TODO should be illegal? */
25+
int g(x, y);
26+
27+
/* Declaration-list function arguments in definition. */
28+
int g(x, y)
29+
int x;
30+
int y;
31+
{ return 1; }
32+
#endif
33+
34+
int main(void) {
35+
return EXIT_SUCCESS;
36+
}

opencl/Makefile.bak

-9
This file was deleted.

opencl/README.md

+11-3
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,20 @@
22

33
1. [Getting started](getting-started.md)
44
1. Examples
5-
1. [min](min.c)
6-
1. [hello_world](hello_world.c)
5+
1. [increment](inc.c)
6+
1. [Build error](build_error.c)
7+
1. [Pass by value](pass_by_value.c)
8+
1. [Work item built-ins](work_item_builtin.c)
9+
1. [Increment vector](inc_vector.c)
10+
1. [Vector built-in](vector_builtin.c)
711
1. Tools
812
1. [clinfo](clinfo.md)
913
1. Theory
1014
1. [Introduction](introduction.md)
11-
1. [Concepts](concepts.md)
15+
1. [Implementations](implementations.md)
16+
1. [Alternatives](alternatives.md)
17+
1. [Architecture](architecture.md)
18+
1. [C](c.md)
19+
1. [Host API](host-api.md)
1220
1. [Bibliography](bibliography.md)
1321
1. [TODO](TODO.md)

opencl/TODO.md

+6-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
11
# TODO
22

3-
1. Compare speeds of `CL_DEVICE_TYPE_GPU` and `CL_DEVICE_TYPE_CPU`.
3+
1. synchronization, work_group_barrier https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/work_group_barrier.html || http://stackoverflow.com/questions/7673990/in-opencl-what-does-mem-fence-do-as-opposed-to-barrier https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/mem_fence.html `mem_fence` (TODO not in OpenCL 2?)
4+
1. images
5+
1. local and private memory to optimize things. Done in HandsOnOpenCL exercise 8 chapter 21 of 2011 OpenCL programming guide.
6+
1. create a bunch of educational and actually useful examples where GPU owns CPU and time them
7+
1. understand why kernels / work items / groups are SIMD, even if they seem completely independent. How does it work? They can only be parallel is the same instruction is to be used on all kernels at once? What breaks it's efficiency? Branching clearly does: we could do an `switch (get_global_id())` and have completely different code running on each kernel. Looks like that is correct: https://news.ycombinator.com/item?id=1969631 | http://stackoverflow.com/questions/5897454/conditionals-in-gpu-programming
8+
1. how much parallelism do GPUs actually have? http://stackoverflow.com/questions/6490572/cuda-how-many-concurrent-threads-in-total | http://gamedev.stackexchange.com/questions/17243/how-many-parallel-units-does-a-gpu-have Depends on what that means, data parallelism? Don't forge that CPU's have 4 wide SIMD nowadays.

opencl/alternatives.md

+22-11
Original file line numberDiff line numberDiff line change
@@ -6,30 +6,41 @@ NVIDIA's. More closed since controlled by NVIDIA. Also more popular for the same
66

77
<https://www.reddit.com/r/programming/comments/49uw97/cuda_reverse_engineered_to_run_on_nonnvidia/>
88

9+
<https://en.wikipedia.org/wiki/CUDA> NVIDIA's, only runs in NVIDIA hardware. TODO could AMD implement it legally without paying royalties to NVIDIA?
10+
11+
## OpenMP
12+
13+
<http://stackoverflow.com/questions/7263193/opencl-performance-vs-openmp>
14+
915
## RenderScript
1016

1117
Google's choice for Android: <http://stackoverflow.com/questions/14385843/why-did-google-choose-renderscript-instead-of-opencl>
1218

1319
Google somewhat opposes OpenCL, maybe because it was created by Apple?
1420

15-
## Vulkan
21+
## Metal
1622

17-
<https://en.wikipedia.org/wiki/Vulkan_%28API%29>
23+
<https://en.wikipedia.org/wiki/Metal_%28API%29>
1824

19-
Also by Khronos.
25+
Apple's response to Google's RenderScript.
2026

21-
TODO why another?
27+
## DirectX
2228

23-
- <http://gamedev.stackexchange.com/questions/96014/what-is-vulkan-and-how-does-it-differ-from-opengl>
29+
Microsoft, Windows, Xbox.
2430

25-
Derived from <https://en.wikipedia.org/wiki/Mantle_%28API%29> by AMD, now abandoned in favor of Vulkan, and will somewhat be the new OpenGL.
31+
## Cilk
2632

27-
## Metal
33+
<https://en.wikipedia.org/wiki/Cilk>
2834

29-
<https://en.wikipedia.org/wiki/Metal_%28API%29>
35+
Intel's
3036

31-
Apple's response to Google's RenderScript.
37+
## DirectCompute
3238

33-
## DirectX
39+
<https://en.wikipedia.org/wiki/DirectCompute>
3440

35-
Microsoft, Windows, Xbox.
41+
Microsoft's
42+
43+
## Unified parallel C
44+
45+
- <https://en.wikipedia.org/wiki/Unified_Parallel_C>
46+
- OpenGL compute shaders <http://stackoverflow.com/questions/15868498/what-is-the-difference-between-opencl-and-opengls-compute-shader>

opencl/applications.md

+22
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,30 @@
11
# Applications
22

3+
For an application to experience speedup compared to the CPU, it must:
4+
5+
- be highly parallelizable
6+
- do a lot of work per input byte, because IO is very expensive
7+
8+
## Actual applications
9+
310
- Monte Carlo
411

512
- PDEs
613

714
- <https://en.wikipedia.org/wiki/Black%E2%80%93Scholes_model>
815
- Reverse Time Migration: RTM <http://www.slb.com/services/seismic/geophysical_processing_characterization/dp/technologies/depth/prestackdepth/rtm.aspx>
16+
17+
Matrix multiplication:
18+
19+
- <http://hpclab.blogspot.fr/2011/09/is-gpu-good-for-large-vector-addition.html>
20+
- <https://developer.nvidia.com/cublas>
21+
22+
Not surprising, since rendering is just a bunch of matrix multiplications, with fixed matrices and varying vectors.
23+
24+
Sparse: <http://stackoverflow.com/questions/3438826/sparse-matrix-multiplication-on-gpu-or-cpu>
25+
26+
Bolt: C++ STL GPU powered implementation by AMD: <http://developer.amd.com/tools-and-sdks/opencl-zone/bolt-c-template-library/>
27+
28+
## Non-applications
29+
30+
Vector addition. Too little work per input byte (1 CPU cycle). <https://forums.khronos.org/showthread.php/7741-CPU-faster-in-vector-addition-than-GPU>, <http://stackoverflow.com/questions/15194798/vector-step-addition-slower-on-cuda> <http://hpclab.blogspot.fr/2011/09/is-gpu-good-for-large-vector-addition.html>

opencl/architecture.md

+94
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Concepts
2+
3+
Hierarchy from top to bottom:
4+
5+
- Host: the entire system
6+
- Device group: multiple devices, e.g. one GPU and one CPU.
7+
- Compute device
8+
- Compute unit
9+
- Processing element
10+
- Work group
11+
- Work item
12+
13+
## TODO
14+
15+
GPU vs CPU hardware level.
16+
17+
<https://youtu.be/e-2bTxKuS2U?list=PLTfYiv7-a3l7mYEdjk35wfY-KQj5yVXO2&t=319> mentions GPU has no cache.
18+
19+
## Platform
20+
21+
TODO what is a platform?
22+
23+
<http://stackoverflow.com/questions/3444664/does-any-opencl-host-have-more-than-one-platform>
24+
25+
## Compute device
26+
27+
One CPU, one GPU, etc.
28+
29+
## Compute unit
30+
31+
TODO vs core?
32+
33+
Can be obtained with: `clGetDeviceInfo(CL_DEVICE_MAX_COMPUTE_UNITS)`
34+
35+
## Processing element
36+
37+
TODO
38+
39+
## Work group
40+
41+
Contains many work items.
42+
43+
Work items inside the same work group can share local memory, and can synchronize.
44+
45+
Work groups have a maximum size (otherwise the concept wouldn't even exist).
46+
47+
Ideally we would like to have a single work group for all items, as that would allow us to worry less about the location of memory on the Global / Constant / Local / Private hierarchy.
48+
49+
But memory localization on GPUs is important enough that OpenCL exposes this extra level.
50+
51+
Synchronization only works inside a single work groups: http://stackoverflow.com/questions/5895001/opencl-synchronization-between-work-groups
52+
53+
### Local size
54+
55+
Size of the work group.
56+
57+
On CPU: always 1. TODO why?
58+
59+
On GPU; must divide Global size.
60+
61+
### Uniform work group
62+
63+
### Non-uniform work group
64+
65+
Work groups with different sizes.
66+
67+
Application: take care of edge cases of the data, e.g. image edges: <https://software.intel.com/en-us/articles/opencl-20-non-uniform-work-groups>
68+
69+
## Work item
70+
71+
Each work item runs your kernel code in parallel to the other ones.
72+
73+
An work item can be seen as a thread.
74+
75+
Contains private memory, which no other work item can see.
76+
77+
## Local and Private memory
78+
79+
TODO: why use those at all instead of global memory?
80+
81+
- <http://stackoverflow.com/questions/21872810/whats-the-advantage-of-the-local-memory-in-opencl>
82+
- <http://stackoverflow.com/questions/9885880/effect-of-private-memory-in-opencl>
83+
84+
Might be faster, and global memory is limited.
85+
86+
HandsOnOpencl Example 8 shows how matrix multiplication becomes 10x faster with some local memory usage. Looks like memory access was the bottleneck.
87+
88+
It also shows how we must make an explicit copy to use private memory.
89+
90+
### Local memory
91+
92+
- <http://stackoverflow.com/questions/8888718/how-to-declare-local-memory-in-opencl>
93+
- <http://stackoverflow.com/questions/2541929/how-do-i-use-local-memory-in-opencl>
94+
- <http://stackoverflow.com/questions/17574570/create-local-array-dynamic-inside-opencl-kernel>

0 commit comments

Comments
 (0)