|
Message-ID: <20071113023520.GA28999@openwall.com> Date: Tue, 13 Nov 2007 05:35:20 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: bitslice MD5 On Tue, Nov 13, 2007 at 01:44:51AM +0000, Larry Bonner wrote: > thank you for sharing, i'm sure many, including myself will > study/experiment, and attempt to optimize it further - test concept > with other algorithms such as md4/sha-1 OK. Let me emphasize that this is almost certainly less efficient than straightforward vectorized implementations on CPUs that support 32-bit vector elements (and a reasonable set of operations on such elements, including 32-bit addition). BTW, I just found this presentation: http://www.hyperelliptic.org/SPEED/slides/Osvik_cell-speed.pdf On page 23, it has a very brief mention that bitslice MD5 "is slow" on Cell SPUs: "128 adds require 94 instructions." To me, 94 instructions is not a lot, but perhaps a straightforward 4-way SIMD implementation is more efficient. > you say that it would be possible to omit the bit rotations, Yes, and I've almost omitted them in the second revision of the code that I've posted - now it's up to the compiler to omit them in the resulting machine code (but it'd have to unroll and inline everything, which might result in code not fitting in L1 I-cache). > would it also be possible to pre-compute constant additions and store > these in memory? rather than process at runtime. You must be confused. There are no constant additions. Rather, there are additions of 32-bit "scalar" constants to "vector" variables. Clearly, you can't fully pre-compute them, but you (or the compiler) can omit the branches and dead code. Specifically, in add32c() we have: if (c & 1) { ... } else { ... } c >>= 1; Once this is fully inlined and unrolled, you'd leave one of the two code paths for every iteration, dropping the other (not taken) code path. You'd also drop the right shift (in fact, you'd drop "c" entirely). For a simpler optimization (that won't achieve a lot), you can notice that there's never a carry on the very first iteration of add32*(), so you can implement that iteration separately (with slightly simpler code). Maybe the compiler already does that. Similarly, you can split the loop in add32c() in two, exiting from the first loop when "c" is zero (instead of when "i" is zero) and not bothering with "c" in the second loop (for the last few remaining bits of other operands). I've already suggested combining all three add32*() into one function. However, a simpler optimization to start with could be combining the four MD5 functions with add32c(). You'd get four instances of add32c() (can use cpp macros to keep the source file small), each of which will run slightly faster than f() followed by add32c() does. I'm sure there are many other possible optimizations, including architecture-specific ones (e.g., use bitwise operations other than those directly available from C code). Finally, you would not need to implement the entire MD5 compression function. For example, the code in JtR's MD5_std.c: MD5_body() assumes that the last 32-bit word in the input block is always zero. When cracking a single instance of MD5 rather than MD5-based crypt(3), even more assumptions can be made (for reasonably short passwords), and not all rounds of the MD5 compression function need to be computed (some can be pre-computed and some reversed). But these optimizations are not in any way specific to bitslice implementations. Good luck! -- Alexander Peslyak <solar at openwall.com> GPG key ID: 5B341F15 fp: B3FB 63F4 D7A3 BCCC 6F6E FC55 A2FC 027C 5B34 1F15 http://www.openwall.com - bringing security into open computing environments -- To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply to the automated confirmation request that will be sent to you.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.