|
Message-ID: <20110204213651.GA19428@openwall.com> Date: Sat, 5 Feb 2011 00:36:51 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: 1.7.6-jumbo-11 adds MSCash2 with OpenMP Hi, I've just released JtR 1.7.6-jumbo-11: http://www.openwall.com/john/#contrib The changes since -jumbo-9 are: The x86-64-specific NTLM cmp_all() bug has been fixed. The bug was discovered and the patch was proposed by bartavelle (thanks!) The bug could result in some NTLM hashes not getting cracked where they should have been. More info here: http://www.openwall.com/lists/john-users/2010/12/17/7 http://www.openwall.com/lists/john-users/2010/12/17/9 I enhanced the self-tests such that the NTLM bug above would be detected by them now. This ended up detecting another bug: in -jumbo-11, "md5-gen" fails self-test for the 5th test hash/password when built with -mmx or -sse2 targets (but not when using SSE2 on x86-64). Moreover, after this failed test, the very next "format" being tested results in a segfault. I left this issue without a fix in -jumbo-11, hoping that JimF (the author of the "generic MD5" code in the jumbo patch) will take a look. ;-) The patch adding support for MSCash2 (Domain Cached Credentials of modern Windows systems) contributed by S3nf (thanks!) has been merged: http://www.openwall.com/lists/john-users/2010/12/26/1 I made minor changes to the MSCash2 code. First, in -jumbo-10 (which only existed for 5 hours before being moved to historical/) I changed MS_NUM_KEYS in mscash2_fmt.c from 64 to 1. I think the setting of 64 was blindly inherited from mscash_fmt.c, but it made no sense for the slow hash that MSCash2 is, and it resulted in slow benchmarking (could be tens of seconds for MSCash2 alone). Then, I actually made use of the code's support for handling of multiple passwords at once to introduce optional OpenMP parallelization. :-) In -jumbo-11, MS_NUM_KEYS is set to 1 in default builds, but it is set to 24 when building with OpenMP support enabled (in the Makefile as it is documented in doc/README for the official 1.7.6). Indeed, I also added "#pragma omp parallel" directives and made necessary adjustments to variable declarations and the code. Trying to actually run the code with multiple threads uncovered what looked like a minor bug in mscash2_fmt.c: PBKDF2_DCC2(). The line: out[16] ^= temp[16]; was probably included in error (processing 17 bytes instead of 16). Removing this line made OpenMP-enabled builds work. I also improved the code to only process up to the supplied "count" of candidate passwords, not always MS_NUM_KEYS (which was wasteful during self-tests, with some uses by "single crack" mode, and when processing the very last bunch of candidate passwords in any mode). Finally, I set MIN_KEYS_PER_CRYPT to 1 in all cases, although it'd be better to adjust it from init() to match the actual number of threads, like BF_fmt.c does. This is something to improve in a later revision (patches for this are welcome; please test those with "single crack" mode). Here are a couple of benchmark results from a Core i7 920 2.67 GHz server under some load. Without OpenMP: Benchmarking: M$ Cache Hash 2 [Generic 1x]... DONE Raw: 94.0 c/s real, 94.0 c/s virtual With OpenMP (8 threads; the CPU is quad-core with SMT): Benchmarking: M$ Cache Hash 2 [Generic 1x]... DONE Raw: 362 c/s real, 47.5 c/s virtual Both builds were made with gcc 4.5.0. Indeed, this is very far from optimal. Further speedup is possible by re-arranging data layout such that each thread's temporary and result array elements are further apart from other thread's (not on the same memory page). This may probably be achieved by replacing the many arrays with an array of structs. Also, the code is currently not optimized and makes no use of SSE2, so the single process performance can probably be improved a lot (and then per-thread performance will improve as well). While at it, I similarly parallelized the original mscash_fmt.c, because this was just as easy to do. Its MS_NUM_KEYS is now set to 96 (was 64), and MIN_KEYS_PER_CRYPT is set to 1 (the same further enhancement that I described above is desirable here as well). Without OpenMP, I get: Benchmarking: M$ Cache Hash [Generic 1x]... DONE Many salts: 15112K c/s real, 15112K c/s virtual Only one salt: 6043K c/s real, 6104K c/s virtual With OpenMP-enabled build and "OMP_NUM_THREADS=3 GOMP_SPINCOUNT=10000", this improves to: Benchmarking: M$ Cache Hash [Generic 1x]... DONE Many salts: 30755K c/s real, 10389K c/s virtual Only one salt: 11575K c/s real, 3858K c/s virtual Going beyond 3 threads provides almost no further improvement. I blame the thread-unfriendly data layout for this (see above), as well as frequent switches between sequential and parallel execution resulting from this hash type being so fast (even for 96 instances of the hash). Maybe someone else in here will play with this mscash* stuff further, improving the data layout and making MIN_KEYS_PER_CRYPT (or rather the corresponding struct field) depend on the number of threads. As usual, feedback is welcome. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.