In one of the first notes in this blog I have already spoken about the encryption details used in different versions of Microsoft Office. Certainly it was interesting for me whether it is possible to apply technology of general-purpose computations on video cards (CUDA) to speed up the process of key recovery for MS Office. The most important bottleneck was the RC4 algorithm which is basic in almost all Office versions encryption. In the Internet I saw responses that RC4 implementation on CUDA was almost more slowly than on CPU which meant that RC4 algorithm did not fit for optimized GPU implementation.
Really, the key characteristic of RC4 is the permutation table of 256 bytes which actively mixes up during key scheduling and at the encryption itself. As the operations with CUDA’s global and local memory take in hundreds times more time than with registers, so ‘naive’ GPU realization has really appeared more slowly, than CPU (the pseudo-code of RC4 key scheduling is shown below):
for i from 0 to 255
ššš S[i] := i
j := 0
for i from 0 to 255
ššš j := (j + S[i] + key[i mod keylength]) mod 256
But the idea on RC4 algorithm optimizing is obvious and came to many researchers’ mind. As Adrian Boeing writes in the article, ‘The largest performance gain came from placing the RC4 permutation array into the GPU’s shared memory’. Really, shared memory is very fast and its use is recommended in CUDA programs optimization. Mr. Boeing has received speed up in about 8 times, I have received similar results:
|CPU (Core 2 Duo, 1.86 GHz, one core)||GPU (NVIDIA GTX 260, 1.35 GHz, 216 SP)|
Table. Key scheduling speed of RC4 algorithm on CPU and GPU, key/sec
Surely, the CPU code has been also strongly optimized (nearly of 900 clock ticks/keys), therefore if to take the “naive” RC4 realization (about 2.500 ticks/keys) it is possible to receive not in 8, but in 25 times speed-up that is not accurately.
Thus, it is possible with the use of modern GPU to find 40-bit key of MS Office files (i.e. to crack any password, regadless of its length and complexity) less than for 1 day (and it is shown in the new versions of GuaWord and GuaExcel programs, searching a key for MS Word and MS Excel files accordingly). For the comparison, the first versions of these programs working on Pentuim II/333, spent for about 70 days for a 40-bit key, and last version on Core 2 Duo will spend about 8 days on one core and, accordingly, 4 and 2 days on 2- and the 4-core machine. On GPU we will receive the speed equal to the 8-core processor!
So, RC4 algorithm can be speed up almost in 8 times with using the power of graphic cards with CUDA technology though it ranks below leaders on speed up factor – MD-like hashes.
UPD. Another utility that uses GPU acceleration for 40-bit RC4 key searching is the new version of GuaPDF software which is designed to decrypt PDF files.