Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

How much memory can I allocate ? BruteForce

Help
laterafter
2011-06-02
2012-12-21
  • laterafter
    laterafter
    2011-06-02

    Hello

    I'm experiencing some problems with my application which uses OpenCL. Second day I'm trying to work out what is wrong but without any success :/

    What I'm trying to achieve is to brute force passwod for WEP packet which is 1512 bytes long (this is cipher length).
    For every work item I need to run RC4 on this cipher and then CRC32. So for RC4 I need space for S state table which is 256 bytes long, keystream which is as long as cipher (1512) and for plaintext output (also 1512 bytes).

    For example I want to run kernel 4 times. With work items count set to 2 I need such arrays:

    S
    keystream
    plaintext

    Every work item will use only some part of these arrays. Work item in which get_local_id(0) will return 0 will use first part of arrays. For second workitem get_local_id(0) will return 1 and another part of arrays will be used.
    So in this way I ran 2 computations. Running another pass with offset=2 I will run another 2 computations and now I can of course reuse my arrays S, keystream and plaintex.

    Problem is that I got some strange values for computing hash of ciphered packet data (for proper password). When I will kernel and modify code so only proper password is checked (without brute forcing) - values are right.
    I think maybe there is a problem with memory allocation - I allocate too much and my code just won't work and I can never find any problematic line which is the source of this issue.

    What is also problematic - I have memory leaks in my applications. After some minutes of computations (yes, it's also too slow I think) task manager says my working exe program consumes more than 1GB of memory! What I noticed - smaller workSize value gives bigger memory leak.

    How should I set workSize value? Is this code OK for allocating memory?
    ComputeBuffer<byte> keystreamBuffer = new ComputeBuffer<byte>(_context, ComputeMemoryFlags.AllocateHostPointer, cipherBody.Length * workGroupSize);
    Why have I memory leaks? Is there any way I can debug it? Basically I need to perform a lot of computations - it's brute force so even for small passwords it's like 10 millions of combinations to check. Despite it's not good "cracking" idea to use brute force, technically it sould work very efficent on GPU I though.

    Congratulations you're already here! :)
    I'm waiting for any ideas.

     
  • nythrix
    nythrix
    2011-06-02

    The problem might be with AllocateHostPointer. The OpenCL spec writes this about CL_MEM_ALLOC_HOST_PTR (which is the original name):

    This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.

    I'm not completely sure what this means, i.e. does OpenCL allocate RAM? What happens if you remove that (use ReadWrite instead)?
    workSize can be set with

    ComputeCommandQueue.Execute(ComputeKernel kernel, long[] globalWorkOffset, long[] globalWorkSize, long[] localWorkSize, ICollection<ComputeEventBase> events)
    
     
  • laterafter
    laterafter
    2011-06-02

    1. Asking how should I set workSize value I meant how to select "optimal" value. I know how to set it technically.

    2. I tried ReadWrite (default flag for clCreateBuffer) but it didn't change speed problems neither memory leak.

    3. But no matter with which flags compute buffers are created  - does it have any influence on every ComputeCommandQueue.Execute() command? I still don't know how to locate my memory leaks. If I have in my kernel definition of value like "int size = 1" is this can be the source of memory leak? Values like arrays (one of them of size 25 and one of them of size 256) I marked with __constant qualifier.

    4. The strange thing is that the more computations are done, the bigger RAM consumption increment in every second is. At the very beginning in every second task manager shows my application eats some KB of RAM more. But after a few minutes of computations in every second applications eats something about 1MB more and more!

    Should I try to rewrite my application in C++ without using Cloo to exclude possible Cloo issues?

     
  • nythrix
    nythrix
    2011-06-02

    Theoretically speaking, the optimal value should be set by the drivers, if you don't specify it yourself through queue.execute. My guess is this value equals the number of compute units in the device. However, don't quote me on this.
    It's easier to use an empty kernel and/or comment out different Cloo calls and observe the changes in RAM consumption than write a C++ equivalent of your app. That should tell you a lot.

     
  • laterafter
    laterafter
    2011-06-02

    Yes but I really ran out of ideas how to improve my Cloo application. Kernel runs only Rc4 and CRC on some data. In C# code I only prepare kernel, buffers and kernel arguments in standard way I think. So what is going on with my app that is a) so slow, b) consumes RAM, c) doesn't compute correctly as its equivelant for CPU written in C++….. 

     
  • nythrix
    nythrix
    2011-06-02

    Have you tried running the kernel on the CPU?

     
  • laterafter
    laterafter
    2011-06-02

    No I haven't. This can be acually a pretty simple but good idea. Should I then set the workSize to 1?

    Could you also tell me which properties of Cloo devies are most important ? I mean there are many properties of my platforms object (Nvidia Quadro). For sure information which I can find in these objects could be used to set up better values like workSize.

    Should I belive what Cloo is saying about maximum work size for my device? It says 256, but I'm able to run kernels with bigger workSize value (yes but diffrent thing is if these kernels produce right results).

     
  • nythrix
    nythrix
    2011-06-02

    You can set it to 1 although it's not required.

    I haven't really used those properties to modify a running program. I prefer to leave the guess-work to the drivers. It is therefore hard to point out the most important ones. I'd probably go with memory sizes, work group sizes, extension lists and device limits (texture sizes, max allocation limits etc.). All of these should be quite reliable. I don't want to go into much detail but a bug in the property querying mechanisms would manifest itself all over the place, becoming thus quickly obvious. You can check the ComputeObject.cs file for more details on these (to be documented) mechanisms:
    http://cloo.git.sourceforge.net/git/gitweb.cgi?p=cloo/cloo;a=blob;f=Cloo/Source/ComputeObject.cs;h=6a149f7dcc04a004f2b87da79f5e119bde1c521b;hb=HEAD

    Which workSize you mean? Global or local?

     
  • laterafter
    laterafter
    2011-06-02

    Basically I need to perform nearly 1 000 000 computations. For now I just set workSize (local workSize) to 512. So I need to perform 1 000 000 / 512 passes of Queue.Execute(). And with every pass I increase offset parameter (offset+=workSize). Doing this I just moving slowly with my computations forward.

    OK last thing I should say - why I need so many computations? I want to brute force 5 letters long password with a-z letters. After getting get_global_id(0) value I compute current password and then I'm passing this password to RC4 method which computes plaintext from cipher.
    Later I run CRC method to check if guessed password was correct (see WEP packet structure…. ).

    My RC4 method uses some arrays I mentioned at the beginning. Every working item uses its own part of these arrays. How? Every time I need to use S, keystream or plaintext array I'm adding offset which looks like: get_local_id(0)*arraySize.

    So is it anyhow possible not to specify workSize and offset for Queue.Execute() method in this case?

     
  • nythrix
    nythrix
    2011-06-02

    It's a bit hard for me to visualize this since I'm no expert in the field. But I'll try to give a hand anyway. However, keep in mind that I might be completely wrong.
    For now I just set workSize (local workSize) to 512. So I need to perform 1 000 000 / 512 passes of Queue.Execute().
    Umm. That's about 1950 executions. But that's the size of globalWorkSize not local! Could you confirm this one?
    And with every pass I increase offset parameter (offset+=workSize). Doing this I just moving slowly with my computations forward.
    This makes sense.
    I'm not completely sure how to tackle with the rest. I did a quick wikipedia read but security algorithms have never been my cup of tea.

    The only required *Work* argument is globalWorkSize. Both localWorkSize and globalWorkOffset can be null in which case the drivers will substitute with appropriate values (mostly localWorkSize = ComputeUnits and globalWorkOffset = 0). In your case only localWorkSize would be null.

     
  • nythrix
    nythrix
    2011-06-02

    Sorry. I meant that globalWorkSize = 512.

     
  • laterafter
    laterafter
    2011-06-03

    But how can I assure that code which is in kernel or in functions called by kernel will have independent space to perform computations?

    My current approach looks like this:
    Example table "S":

    S - will be used by workItem where get_local_id(0)=0
    S - will be used by workItem where get_local_id(0)=0
    S - will be used by workItem where get_local_id(0)=0

    S - will be used by workItem where get_local_id(0)=0
    S - will be used by workItem where get_local_id(0)=1
    S - will be used by workItem where get_local_id(0)=1

    S - will be used by workItem where get_local_id(0)=1

    So there is no "overlapping". Each work item has it's own space.
    I hope you understand my idea.

     
  • laterafter
    laterafter
    2011-06-04

    I rewrote the code in C++. Generally the problem is still persistent (but I think there is no memory leak this time - with every run of clEnqueueNDRangeKernel() my programs RAM consumption isn't growing).

    I think I'm getting closer to find the source of the problem with my app. When I'm running my kernel with workSize=1 everything is fine and with every run of clEnqueueNDRangeKernel() and then clEnqueueReadBuffer() values in arrays (S, keystream, CRC hash) are right. But When I increase size of workSize to something bigger (even 2 is enough) values in arrays changes after every run of clEnqueueNDRangeKernel() and then clEnqueueReadBuffer(). WHY ? Since the password is the same values in arrays should stay the same!

    And whats more - even values in array "S" changes (thats why final computations have wrong values). Values in array S depend only on value of rootPassword (which for tests I set as a fixed value which is not changed during next runs of clEnqueueNDRangeKernel()). So how can I explain that value S doesn't stay the same during runs of clEnqueueNDRangeKernel() ?

    This is how I create buffer and pass argument to kernel:

    cl_mem SBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_WRITE, sizeof(uchar)*workSize*N, NULL, &errcode);
    assert(errcode==CL_SUCCESS);
    clSetKernelArg(OpenCLVectorAdd, 6, sizeof(cl_mem), (void*)&SBuffer); 
    uchar *s = new uchar[N*SIZE];
    clEnqueueReadBuffer(GPUCommandQueue, SBuffer, CL_TRUE,0,SIZE*N*sizeof(uchar),s,0, NULL, NULL);
    

    In OpenCL kernel this argument is of type "__global uchar* S" and I compute offset for every workItem using equalation offset=get_local_id(0)*SLength. Since I want to run (for example) workSize=2 workItems at the "same time" this offset value can be 0*SLength or 1*SLength. One workItem will work withS cells and second workItem will work with S cells.

    I also checked code with substituting get_local_id(0) to some random value "123" when running kernel with workSize=1. Values were computed correctly.

    Any ideas what is still wrong with the code?

     
  • laterafter
    laterafter
    2011-06-05

    Problem found. I didn't take under consideration how number of groups will affect calculating offset for arrays inside kernel. Executing my kernel I also didn't explicitly specify number of work items in workgroup. It means that our driver/OpenCL implementation will take care of it. Running kernel with global_work_size parameter bigger than our graphic card can hold (see CL_DEVICE_MAX_WORK_GROUP_SIZE) results we'll have more than one workgroup.

    In my case there was more than one group and I didn't properly calculated offset for arrays in kernel.