cirosantilli
diff --git a/‎bullet.md
+17 b/‎bullet.md
+17
diff --git a/‎c/README.md
+1 b/‎c/README.md
+1
diff --git a/‎c/function.c
+5-6 b/‎c/function.c
+5-6
diff --git a/‎c/identifier_list.c
+26 b/‎c/identifier_list.c
+26
diff --git a/‎c/interactive/README.md
+1 b/‎c/interactive/README.md
+1
diff --git a/‎c/interactive/ugly_grammar.c
+36 b/‎c/interactive/ugly_grammar.c
+36
diff --git a/‎opencl/Makefile.bak
-9 b/‎opencl/Makefile.bak
-9
diff --git a/‎opencl/README.md
+11-3 b/‎opencl/README.md
+11-3
diff --git a/‎opencl/TODO.md
+6-1 b/‎opencl/TODO.md
+6-1
diff --git a/‎opencl/alternatives.md
+22-11 b/‎opencl/alternatives.md
+22-11
diff --git a/‎opencl/applications.md
+22 b/‎opencl/applications.md
+22
diff --git a/‎opencl/architecture.md
+94 b/‎opencl/architecture.md
+94
@@ -0,0 +1,17 @@
+# Bullet
+
+## Build
+
+Tested on Ubuntu 2.83.5 in Ubuntu 15.10.
+
+    git clone https://github.com/bulletphysics/bullet3
+    cd build3
+    ./premake4_linux gmake
+    cd gmake
+    make
+
+Outputs are `.a` and executables under `bin/`. The most interesting is:
+
+    ./bin/App_ExampleBrowser_gmake_x64_release
+
+which allows you to view with OpenGL and interact with mouse dragging with the examples under `examples/`.
@@ -64,6 +64,7 @@
             1. [Parameter without name](parameter_without_name.c)
             1. [Static array argument](static_array_argument.c)
             1. [_Noreturn](noreturn.c)
+            1. [Identifier list](identifier_list.c)
     1.  [Operator](operator.c)
         1. [sizeof()](sizeof.c)
     1.  [Sequence point](sequence_point.c)
 
@@ -1,5 +1,4 @@
-/*
-# function
+/*# function
 
     A function is basically a branch, but in which you have to:
 
@@ -462,13 +461,13 @@ int main() {
 
                 http://stackoverflow.com/questions/5481579/whats-the-difference-between-function-prototype-and-declaration
 
-                - Prototype is a declaration that specifies the arguments.
+                -   Prototype is a declaration that specifies the arguments.
                     Only a single prototype can exist.
 
-                - a declaration can not be a prototype if it does not have any arguments.
+                -   a declaration can not be a prototype if it does not have any arguments.
                     The arguments are left unspecified.
 
-                - to specify a prototype that takes no arguments, use `f(void)`
+                -   to specify a prototype that takes no arguments, use `f(void)`
 
                 In C++ the insanity is reduced, and every declaration is a prototype,
                 so `f()` is the same as `f(void)`.
@@ -543,7 +542,7 @@ int main() {
     /*
     # K&R function declaration
 
-        This form of funciton declaration, while standard,
+        This form of function declaration, while standard,
         is almost completely obsolete and forgotten today.
 
         It is however still ANSI C.
 
@@ -0,0 +1,26 @@
+/*
+# Identifier list function declarator
+
+    Old style thing that should never be done today.
+*/
+
+#include "common.h"
+
+/* TODO without definition. Should never be done. Conforming or not? */
+/*int f(x, y);*/
+
+int f(x, y)
+    int x;
+    int y;
+{ return x + y; }
+
+/* Also identifier list: it is the only optional one. */
+void g() {}
+
+/* Identifier type list. This one is not optional. */
+void h(void) {}
+
+int main() {
+    assert(f(1, 2) == 3);
+    return EXIT_SUCCESS;
+}
@@ -10,3 +10,4 @@ Programs in this directory should be run manually one by one because they do thi
 1. [Command line arguments](command_line_arguments.c)
 1. [abort](abort.c)
 1. [clock](clock.c)
+1. [Ugly grammar](ugly_grammar.c)
@@ -0,0 +1,36 @@
+/*
+# Ugly grammar
+
+    C allows for grammar obscenities to be compatible with a distant past.
+
+    Here are some perfectly legal jewels. Don't compile with `-pedantic-errors`, only `-std=c99`.
+*/
+
+#include "common.h"
+
+/* After you've removed -pedantic-errors. */
+#define ON 1
+
+#if ON
+/* Empty declaration. */
+int;
+
+/* Declaration without type. */
+a;
+f(void);
+int dec_arg_no_type(x, y)
+{ return 1; }
+
+/* Declaration-list function arguments in declaration. TODO should be illegal? */
+int g(x, y);
+
+/* Declaration-list function arguments in definition. */
+int g(x, y)
+    int x;
+    int y;
+{ return 1; }
+#endif
+
+int main(void) {
+    return EXIT_SUCCESS;
+}
@@ -2,12 +2,20 @@
 
 1.  [Getting started](getting-started.md)
 1.  Examples
-    1.  [min](min.c)
-    1.  [hello_world](hello_world.c)
+    1.  [increment](inc.c)
+    1.  [Build error](build_error.c)
+    1.  [Pass by value](pass_by_value.c)
+    1.  [Work item built-ins](work_item_builtin.c)
+    1.  [Increment vector](inc_vector.c)
+    1.  [Vector built-in](vector_builtin.c)
 1.  Tools
     1.  [clinfo](clinfo.md)
 1.  Theory
     1.  [Introduction](introduction.md)
-    1.  [Concepts](concepts.md)
+    1.  [Implementations](implementations.md)
+    1.  [Alternatives](alternatives.md)
+    1.  [Architecture](architecture.md)
+    1.  [C](c.md)
+    1.  [Host API](host-api.md)
     1.  [Bibliography](bibliography.md)
     1.  [TODO](TODO.md)
@@ -1,3 +1,8 @@
 # TODO
 
-1. Compare speeds of `CL_DEVICE_TYPE_GPU` and `CL_DEVICE_TYPE_CPU`.
+1. synchronization, work_group_barrier https://www.khronos.org/registry/cl/sdk/2.1/docs/man/xhtml/work_group_barrier.html || http://stackoverflow.com/questions/7673990/in-opencl-what-does-mem-fence-do-as-opposed-to-barrier https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/mem_fence.html `mem_fence` (TODO not in OpenCL 2?)
+1. images
+1. local and private memory to optimize things. Done in HandsOnOpenCL exercise 8 chapter 21 of 2011 OpenCL programming guide.
+1. create a bunch of educational and actually useful examples where GPU owns CPU and time them 
+1. understand why kernels / work items / groups are SIMD, even if they seem completely independent. How does it work? They can only be parallel is the same instruction is to be used on all kernels at once? What breaks it's efficiency? Branching clearly does: we could do an `switch (get_global_id())` and have completely different code running on each kernel. Looks like that is correct: https://news.ycombinator.com/item?id=1969631 | http://stackoverflow.com/questions/5897454/conditionals-in-gpu-programming
+1. how much parallelism do GPUs actually have? http://stackoverflow.com/questions/6490572/cuda-how-many-concurrent-threads-in-total | http://gamedev.stackexchange.com/questions/17243/how-many-parallel-units-does-a-gpu-have Depends on what that means, data parallelism? Don't forge that CPU's have 4 wide SIMD nowadays.
@@ -6,30 +6,41 @@ NVIDIA's. More closed since controlled by NVIDIA. Also more popular for the same
 
 <https://www.reddit.com/r/programming/comments/49uw97/cuda_reverse_engineered_to_run_on_nonnvidia/>
 
+<https://en.wikipedia.org/wiki/CUDA> NVIDIA's, only runs in NVIDIA hardware. TODO could AMD implement it legally without paying royalties to NVIDIA?
+
+## OpenMP
+
+<http://stackoverflow.com/questions/7263193/opencl-performance-vs-openmp>
+
 ## RenderScript
 
 Google's choice for Android: <http://stackoverflow.com/questions/14385843/why-did-google-choose-renderscript-instead-of-opencl>
 
 Google somewhat opposes OpenCL, maybe because it was created by Apple?
 
-## Vulkan
+## Metal
 
-<https://en.wikipedia.org/wiki/Vulkan_%28API%29>
+<https://en.wikipedia.org/wiki/Metal_%28API%29>
 
-Also by Khronos.
+Apple's response to Google's RenderScript.
 
-TODO why another?
+## DirectX
 
-- <http://gamedev.stackexchange.com/questions/96014/what-is-vulkan-and-how-does-it-differ-from-opengl>
+Microsoft, Windows, Xbox.
 
-Derived from <https://en.wikipedia.org/wiki/Mantle_%28API%29> by AMD, now abandoned in favor of Vulkan, and will somewhat be the new OpenGL.
+## Cilk
 
-## Metal
+<https://en.wikipedia.org/wiki/Cilk>
 
-<https://en.wikipedia.org/wiki/Metal_%28API%29>
+Intel's
 
-Apple's response to Google's RenderScript.
+## DirectCompute
 
-## DirectX
+<https://en.wikipedia.org/wiki/DirectCompute>
 
-Microsoft, Windows, Xbox.
+Microsoft's
+
+## Unified parallel C
+
+- <https://en.wikipedia.org/wiki/Unified_Parallel_C>
+- OpenGL compute shaders <http://stackoverflow.com/questions/15868498/what-is-the-difference-between-opencl-and-opengls-compute-shader>
@@ -1,8 +1,30 @@
 # Applications
 
+For an application to experience speedup compared to the CPU, it must:
+
+- be highly parallelizable
+- do a lot of work per input byte, because IO is very expensive
+
+## Actual applications
+
 -   Monte Carlo
 
 -   PDEs
 
     - <https://en.wikipedia.org/wiki/Black%E2%80%93Scholes_model>
     - Reverse Time Migration: RTM <http://www.slb.com/services/seismic/geophysical_processing_characterization/dp/technologies/depth/prestackdepth/rtm.aspx>
+
+Matrix multiplication:
+
+- <http://hpclab.blogspot.fr/2011/09/is-gpu-good-for-large-vector-addition.html>
+- <https://developer.nvidia.com/cublas>
+
+Not surprising, since rendering is just a bunch of matrix multiplications, with fixed matrices and varying vectors.
+
+Sparse: <http://stackoverflow.com/questions/3438826/sparse-matrix-multiplication-on-gpu-or-cpu>
+
+Bolt: C++ STL GPU powered implementation by AMD: <http://developer.amd.com/tools-and-sdks/opencl-zone/bolt-c-template-library/>
+
+## Non-applications
+
+Vector addition. Too little work per input byte (1 CPU cycle). <https://forums.khronos.org/showthread.php/7741-CPU-faster-in-vector-addition-than-GPU>, <http://stackoverflow.com/questions/15194798/vector-step-addition-slower-on-cuda> <http://hpclab.blogspot.fr/2011/09/is-gpu-good-for-large-vector-addition.html>
@@ -0,0 +1,94 @@
+# Concepts
+
+Hierarchy from top to bottom:
+
+- Host: the entire system
+- Device group: multiple devices, e.g. one GPU and one CPU.
+- Compute device
+- Compute unit
+- Processing element
+- Work group
+- Work item
+
+## TODO
+
+GPU vs CPU hardware level.
+
+<https://youtu.be/e-2bTxKuS2U?list=PLTfYiv7-a3l7mYEdjk35wfY-KQj5yVXO2&t=319> mentions GPU has no cache.
+
+## Platform
+
+TODO what is a platform?
+
+<http://stackoverflow.com/questions/3444664/does-any-opencl-host-have-more-than-one-platform>
+
+## Compute device
+
+One CPU, one GPU, etc.
+
+## Compute unit
+
+TODO vs core?
+
+Can be obtained with: `clGetDeviceInfo(CL_DEVICE_MAX_COMPUTE_UNITS)`
+
+## Processing element
+
+TODO
+
+## Work group
+
+Contains many work items.
+
+Work items inside the same work group can share local memory, and can synchronize.
+
+Work groups have a maximum size (otherwise the concept wouldn't even exist).
+
+Ideally we would like to have a single work group for all items, as that would allow us to worry less about the location of memory on the Global / Constant / Local / Private hierarchy.
+
+But memory localization on GPUs is important enough that OpenCL exposes this extra level.
+
+Synchronization only works inside a single work groups: http://stackoverflow.com/questions/5895001/opencl-synchronization-between-work-groups
+
+### Local size
+
+Size of the work group.
+
+On CPU: always 1. TODO why?
+
+On GPU; must divide Global size.
+
+### Uniform work group
+
+### Non-uniform work group
+
+Work groups with different sizes.
+
+Application: take care of edge cases of the data, e.g. image edges: <https://software.intel.com/en-us/articles/opencl-20-non-uniform-work-groups>
+
+## Work item
+
+Each work item runs your kernel code in parallel to the other ones.
+
+An work item can be seen as a thread.
+
+Contains private memory, which no other work item can see.
+
+## Local and Private memory
+
+TODO: why use those at all instead of global memory?
+
+- <http://stackoverflow.com/questions/21872810/whats-the-advantage-of-the-local-memory-in-opencl>
+- <http://stackoverflow.com/questions/9885880/effect-of-private-memory-in-opencl>
+
+Might be faster, and global memory is limited.
+
+HandsOnOpencl Example 8 shows how matrix multiplication becomes 10x faster with some local memory usage. Looks like memory access was the bottleneck.
+
+It also shows how we must make an explicit copy to use private memory.
+
+### Local memory
+
+- <http://stackoverflow.com/questions/8888718/how-to-declare-local-memory-in-opencl>
+- <http://stackoverflow.com/questions/2541929/how-do-i-use-local-memory-in-opencl>
+- <http://stackoverflow.com/questions/17574570/create-local-array-dynamic-inside-opencl-kernel>