|
Message-ID: <20111123155924.GA4554@openwall.com> Date: Wed, 23 Nov 2011 19:59:24 +0400 From: Solar Designer <solar@...nwall.com> To: announce@...ts.openwall.com, john-users@...ts.openwall.com Subject: John the Ripper 1.7.9 Hi, I've released John the Ripper 1.7.9 today. Please download it from the usual location: http://www.openwall.com/john/ (A -jumbo based on 1.7.9 will be available a bit later.) This release completes the DES speedup work sponsored by Rapid7: http://www.openwall.com/lists/announce/2011/06/22/1 Most importantly, functionality of the -omp-des* patches has been reimplemented in the main source code tree, improving upon the best properties of the -omp-des-4 and -omp-des-7 patches at once. Thus, there are no longer any -omp-des* patches for 1.7.9. I'd like to thank Nicholas J. Kain for his help in figuring out the cause of a performance regression with the -omp-des-7 patch, which has helped me avoid this issue in the reimplementation. I would also like to thank magnum and Anatoly Pugachev for their help in testing of 1.7.8.x development versions. The new speeds on Core i7-2600K 3.4 GHz (actually 3.5 GHz due to Turbo Boost) are: 1 thread: Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE Many salts: 5802K c/s real, 5861K c/s virtual Only one salt: 5491K c/s real, 5546K c/s virtual 8 threads (on 4 physical cores): Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE Many salts: 22773K c/s real, 2843K c/s virtual Only one salt: 18284K c/s real, 2291K c/s virtual 1 thread: Benchmarking: LM DES [128/128 BS AVX-16]... DONE Raw: 71238K c/s real, 71238K c/s virtual 4 threads: Benchmarking: LM DES [128/128 BS AVX-16]... DONE Raw: 108199K c/s real, 27117K c/s virtual DES-based crypt(3) scales pretty well, whereas LM is too fast for that - but we get decent speeds anyway. I'll include more benchmarks below (including for a 64-way machine). There are many other enhancements in 1.7.9 as well. Here's a summary from doc/CHANGES with additional comments for this announcement (I put those in braces): * Added optional parallelization of the MD5-based crypt(3) code with OpenMP. (Yes, this is similar to the change introduced in 1.7.8-jumbo-8 by magnum, but it's also different and both changes will co-exist in a -jumbo rebased on 1.7.9. 1.7.8-jumbo-8's MD5-crypt code provides better speed on typical x86/SSE2 machines, whereas 1.7.9's is more portable.) * Added optional parallelization of the bitslice DES code with OpenMP. (This is what I started this message with.) * Replaced the bitslice DES key setup algorithm with a faster one, which significantly improves performance at LM hashes, as well as at DES-based crypt(3) hashes when there's just one salt (or very few salts). (This is the 1.7.8-fast-des-key-setup-3 patch reimplemented in a portable fashion, as well as optimized for specific architectures, including with assembly code for x86-64/SSE2, x86/SSE2, and x86/MMX. The patch is no longer needed.) * Optimized the DES S-box x86-64 (16-register SSE2) assembly code. (This achieves about a 3% speedup at bitslice DES on Core 2'ish CPUs.) * Added support for 10-character DES-based tripcodes (not optimized yet). (This was originally a proof of concept patch I posted in response to a message on john-users, then it made its way into -jumbo, and now into the main tree. Optimizing it is a next step.) * Added support for the "$2y$" prefix of bcrypt hashes. (These are treated the same as "$2a$" in JtR.) * Added two more hash table sizes (16M and 128M entries) for faster processing of very large numbers of hashes per salt (over 1M). (John may kind of waste memory when you load over a million of hashes now - it may even trade an extra gigabyte for a very slight speedup. The rationale here is that computers do have gigabytes of RAM these days and we'd better put it to use. If you need to be loading millions of hashes, yet don't want to let John trade RAM for speed like this, use "--save-memory=2".) * Added two pre-defined external mode variables: "abort" and "status", which let an external mode request the current cracking session to be aborted or the status line to be displayed, respectively. (There are usage examples in the default john.conf included in 1.7.9.) * Made some minor optimizations to external mode function calls and virtual machine implementation. (Just slightly faster external mode processing.) * The "--make-charset" option now uses floating-point rather than 64-bit integer operations, which allows for larger CHARSET_* settings in params.h. (This deals with the common request where people want incremental mode to use a larger character set and/or password length. This is now easier to do, being able to adjust the CHARSET_* settings almost arbitrarily - e.g., the full 8-bit character set and lengths up to 16 (and even more) may be enabled at once. A rebuild of John and regeneration of .chr files are still needed after such changes, though.) * Added runtime detection of Intel AVX and AMD XOP instruction set extensions, with optional fallback to an alternate program binary. (Previously, when an -avx or -xop build was run on a CPU not supporting these instruction set extensions or under an operating system not saving/restoring the registers on context switches, the program would crash. Now it prints a nice "Sorry ..." message, or it can even transparently invoke a fallback binary. The latter functionality is made use of in john.spec for the RPM package of John in Owl: http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/ ) * In OpenMP-enabled builds, added support for fallback to a non-OpenMP build when the requested thread count is 1. (OpenMP-enabled builds are often suboptimal when running just one thread, which they may sometimes have to e.g. because the system actually has only one logical CPU. Now a binary package of John, such as Owl's, is able to make such builds transparently invoke a non-OpenMP build for slightly better performance. This is in fact currently made use of in john.spec on Owl available at the URL above.) * Added relbench, a Perl script to compare two "john --test" benchmark runs, such as for different machines, "make" targets, C compilers, optimization options, or/and versions of John the Ripper. (This was introduced in 1.7.8-jumbo-8 and announced with a lot of detail previously: http://www.openwall.com/lists/announce/2011/11/09/1 1.7.9 includes a slightly newer revision of the script (correcting an issue reported by JimF) and it has the script documented in doc/OPTIONS, as well as in a lengthy comment in the script itself.) * Additional public lists of "top N passwords" have been merged into the bundled common passwords list, and some insufficiently common passwords were removed from the list. (Most importantly, RockYou top 1000 and Gawker top 250 lists were used to make John's password.lst hopefully more suitable for website passwords while also keeping it suitable for operating system passwords. Common passwords that rank high on multiple lists are listed closer to the beginning of password.lst, whereas some passwords that were seen on operating system accounts but turned out to be extremely uncommon on websites are now moved to the end of the list.) * Many minor enhancements and a few bug fixes were made. (Obviously, I can't and shouldn't document every individual change here.) Now some more benchmarks. Core i7-2600K, stock clock rate and Turbo Boost settings (so should be 3.5 GHz here), 8 threads: Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE Many salts: 22773K c/s real, 2843K c/s virtual Only one salt: 18284K c/s real, 2291K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS AVX-16]... DONE Many salts: 741376 c/s real, 93020 c/s virtual Only one salt: 626566 c/s real, 79104 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 66914 c/s real, 8343 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 4800 c/s real, 606 c/s virtual Benchmarking: LM DES [128/128 BS AVX-16]... DONE Raw: 88834K c/s real, 11146K c/s virtual (4 threads was faster for LM here.) Dual Xeon E5420 (8 cores total, running 8 threads), 2.5 GHz: Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 20334K c/s real, 2546K c/s virtual Only one salt: 15499K c/s real, 1936K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 654869 c/s real, 82001 c/s virtual Only one salt: 558284 c/s real, 69785 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 85844 c/s real, 10727 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 5135 c/s real, 642 c/s virtual Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 54027K c/s real, 6753K c/s virtual For comparison, a non-OpenMP build on the same machine (using one core): Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 2787K c/s real, 2787K c/s virtual Only one salt: 2676K c/s real, 2676K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 89472 c/s real, 88586 c/s virtual Only one salt: 87168 c/s real, 86304 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 10768 c/s real, 10768 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 658 c/s real, 658 c/s virtual Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 38236K c/s real, 38619K c/s virtual Comparing these with relbench: Number of benchmarks: 7 Minimum: 1.41299 real, 0.17486 virtual Maximum: 7.97214 real, 0.99619 virtual Median: 7.29602 real, 0.91353 virtual Median absolute deviation: 0.67612 real, 0.08267 virtual Geometric mean: 5.60660 real, 0.70207 virtual Geometric standard deviation: 1.77219 real, 1.78048 virtual Excluding LM and single salt benchmarks: Number of benchmarks: 4 Minimum: 7.29602 real, 0.91353 virtual Maximum: 7.97214 real, 0.99619 virtual Median: 7.55772 real, 0.95035 virtual Median absolute deviation: 0.25385 real, 0.03054 virtual Geometric mean: 7.59208 real, 0.95215 virtual Geometric standard deviation: 1.03971 real, 1.03654 virtual So for password security auditing (thus running on many salts at once) against these four hash types (the crypt(3) varieties), we get a median and mean speedup of over 7.5x when using John's OpenMP parallelization on an 8-core machine without SMT. (It is important to note that the machine was under no other load, though. Unfortunately, OpenMP tends to be very sensitive to other load.) SPARC Enterprise M8000 server, 8 SPARC64-VII CPUs at 2880 MHz, Sun Studio 12.2 compiler. These are quad-core CPUs with 2 threads per core, so 32 cores and 64 threads total (actually running 64 threads here): Benchmarking: Traditional DES [64/64 BS]... DONE Many salts: 25664K c/s real, 756852 c/s virtual Only one salt: 11066K c/s real, 728273 c/s virtual Benchmarking: BSDI DES (x725) [64/64 BS]... DONE Many salts: 1118K c/s real, 24811 c/s virtual Only one salt: 694930 c/s real, 24535 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 156659 c/s real, 3075 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64]... DONE Raw: 9657 c/s real, 242 c/s virtual Benchmarking: LM DES [64/64 BS]... DONE Raw: 16246K c/s real, 5860K c/s virtual Some of these speeds are impressive, yet they're comparable to a much smaller x86 machine (between 4 and 16 cores on different ones of these tests). Clearly, the lack of wider than 64-bit vectors in SPARC hurts bitslice DES speeds, big-endianness is unfriendly to MD5 (and vice versa), and LM does not scale at all (maybe a result of higher latency interconnect between the CPUs here, although I am just guessing). The Blowfish speed is pretty good, though - a dual Xeon X5650 machine (12 cores, 24 threads total) achieves a similar speed (with 1.7.8 here, but 1.7.9 should be the same at this): Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 10800 c/s real, 449 c/s virtual So 32 SPARC cores are similar to 12 x86 cores on this one. The MD5 speed on the SPARC is roughly twice that of an 8-core x86 machine benchmarked above, so it could correspond to a 16-core, but if we recall that -jumbo has faster code, which would not run on a SPARC (no equivalent to 128-bit SSE2 vectors there), it's not so great. To be fair, a lesser speedup could be obtained with 64-bit VIS vectors (if anyone bothers implementing that), though. And, of course, applying OpenMP to individual relatively low-level loops only is not a very efficient way to parallelize John; I opted for it for now in part because it allowed to preserve the usual end-user behavior of John - almost like when it's running on a single CPU. Unlike the x86 systems benchmarked above, this same SPARC server would likely achieve a much better combined performance with many individual non-OpenMP John processes. The real to virtual time ratios are still significantly below 64, which indicates that there's idle CPU time left. Yet I think it is good that the main tree's code is portable, allowing to put such beasts to use if they would otherwise be idle. ;-) Also, this serves well to get John's code changes tested and to improve its overall quality for the more common systems. Some bugs were in fact discovered and fixed prior to the 1.7.9 release due to such testing. It also makes John reusable as an OpenMP benchmark. As usual, any feedback is very welcome. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.