-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast transfer of small tensors #6
Comments
This makes sense. I initially didn't know why pinned cupy tensors were getting faster performance, but a pytorch engineer pointed it out, I updated the 'How it works?' section but the pinned cupy tensors aren't copying faster, they're using a different indexing kernal, which works better for CPUs with a lower number of cores. In either case, for smaller tensors, there's probably not that much indexing going on, so it would make sense that Pytorch's pinned CPU tensors are faster. More details here In your case, I would imagine the number of CPU cores wouldn't make too much of a difference, but out of curiosity, how many CPU cores are you working with? |
@Santosh-Gupta, thanks for your reply. The Jetson Nano platform has a quad-core ARM A57 64-bit CPU.
So it seems the best use case for SpeedTorch is CPU<->GPU transfer of slices of big tensors? After I transfer the tensor to GPU I am creating one variable per row like
In this comment though, talking about a DataGadget (cpu-pinned??) object they mention:
So I am left wondering if somehow I could still use SpeedTorch to speed-up this particular characteristic. I am not sure if you were able to look at my code but, is the following script a sensible way of getting the best performance? In def __init__(self):
#initialize CPU and GPU tensors
# this will receive data in GPU
self.data_gpu = torch.zeros(18,3).to('cuda')
self.data = np.zeros((18,3))
# this will receive all CPU data and transfer to GPU
self.data_cpu = SpeedTorch.DataGadget(data, CPUPinn = True ) # is this using CPU memory?
self.data_cpu.gadgetInit()
# this will receive GPU data and transfer to CPU
data2 = np.zeros((6,3))
self.return_cpu = SpeedTorch.DataGadget(data2, CPUPinn = True ) # is this using CPU memory?
self.data_cpu.gadgetInit()
# inside a callback @ 100Hz
def control_law(self, new_data): #new_data is an 18x3 np.array
# update data in pinned CPU memory, would this be still using pinned memory?
self.data_cpu.CUPYcorpus = cp.asarray(new_data)
#transfer from CPU to GPU
self.data_gpu[:] = self.data_cpu.getData(indexes = (slice(0,18),slice(0,3)))
#slice all GPU data
a = self.data_gpu[0]
b = self.data_gpu[1]
c = self.data_gpu[2:5]
# etc etc etc
h = self.data_gpu[17]
# do various linear algebra operations in GPU with torch cuda tensors
res = self.do_linear_algebra(a,b,....h)
#transfer back to pinned CPU memory
self.return_cpu.insertData(dataObject = res, indexes = (slice(0,6),slice(0,3)))
# continue processing
return self.return_cpu.CUPYcorpus.asnumpy() Previous to this I have modified SpeedTorch's CUPYLive.py to accept numpy array as input when initializing: def gadgetInit(self):
if self.CPUPinn == True:
cupy.cuda.set_allocator(my_pinned_allocator)
if(type(self.fileName) == np.ndarray):
self.CUPYcorpus = cupy.asarray(self.fileName)
else:
self.CUPYcorpus = cupy.load( self.fileName)
if self.CPUPinn == True:
cupy.cuda.set_allocator(None) Thank you very much for your comments! Getting to know SpeedTorch has allowed me to better understand the interaction between CPU and GPU :) |
So this would have 4 cores?
Yes, but I am also wondering if even you copy the whole tensor, that would also needed to go though indexing operations, and thus a Speedup. This is something I'll need to test.
Yeah, they're talking about cpu-pinned DataGadget
With the new modification, I imagine this would still be using pinned CPU memory. I would be surprised if it wasn't. But I haven't explicitly tested this so I can't 100% sure.
I haven't tested this, but I don't believe this would be pinned memory. If Hmmm, since the data dimension is 18,3, and 16,3, I imagine that the indexing will not be too heavy, and there may not be that much of a speedup. Particularly when the slices are only a few rows. But perhaps the facebook engineer in that one link can give better insight. Either way, it's worth a test. If you do, would love to hear the results. |
@Santosh-Gupta Thanks for your reply back. Yes, that would be 4 cores in the cpu. I will try to confirm if the |
Another approach you might want to consider is using the PyCuda and Numba indexing kernals, using a similar approach, disguising CPU pinned tensors as GPU tensors. I didn't have a chance to try this approach. |
Hello again,
After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch's gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.
So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch's cpu pinned tensors have faster transfer times compared to SpeedTorch
DataGadget
cpu pinned object. See the graph below, were both histograms corresponds toCPU->GPU->CPU
transfer of a 6x3 matrix. The pink histogram corresponds to pytorch's cpu pinned tensor and the turquoise one to Speedtorch's DataGadget operations.(Units: millisecond)
For my use case, it seems Pytorch's pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.
You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).
Thank you very much!
The text was updated successfully, but these errors were encountered: