You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally, I create this project to store some of my notes when I was learning CUDA and CPU concurrency programming. If you are a collage student who is busy with your graduation projects, and you are somehow not allowed to use or tired to config the mainstream BLAS library, maybe it's a good way to have a try on my library. I hope I can help you.
WARNINGS
Currently the source codes only work on Windows.
updates
1. I rewrote the CUDA kernel used in matrix multiplication. The speed is boost dramatically.
2. I rewrote the CUDA kernel used in 2D convolution. The speed is boost dramatically.
3. A new method to improve performance in 2D convolution called "im2col" is implemented, which is suitable for convolution neuro networks.
4. Three memory pools is added, which optimizes the memory management.
5. I also create a thread pool based on the new features in c++11. But it has some drabacks, like, the occupancy only reaches to 80% of CPU when the frequency is relatively low (2.2GHz).
6. New GEMM operators on CPU is added.
7. I rewrote CUDA kernels used in fft1D and fft2D by vectorizing memory access, and using shared memory when it is possible.
8. I increase Tensor and TensorArray classes. At the meantime, I created the corresponding classes operates on GPU to avoid some unnecessary allocation on host and memory copying between host and device.
9. I rewrote NLM (non-local means) kernels, which can read data in uchar directly, and makes full use of shared memory.
Do not hesitate to contact me via [email protected] if you have any comment.