|
Message-ID: <20100630213807.GA17386@openwall.com> Date: Thu, 1 Jul 2010 01:38:07 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: OpenMP vs lots of cores On Wed, Jun 30, 2010 at 09:55:42AM +0200, Magnum, P.I. wrote: > I haven't had the time to try the OpenMP stuff but I sure will. Please do. > How far will it scale in the current versions? I mean, what if you had > 96 cores on a single host? It won't use all of them, right? It should. I am sometimes hard-coding the number 96 (or 192) because it is a multiple of 32 and 3 - so it'll work reasonably well for 2, 3, 4, 6, 8, 12, 16, 24, or 32 logical CPUs. This was in fact tested for 32 logical CPUs and 32 threads here: http://www.openwall.com/lists/john-users/2010/05/16/2 It scaled perfectly (considering that the CPU was quad-core, but capable of running 8 threads per core simultaneously). It should similarly scale to 32 distinct CPU cores. Maybe I should be using 192 to accommodate 8-core versions of UltraSPARC T2 (64 logical CPUs per chip). Of the hashes currently parallelized with OpenMP (including with recent patches), DES-based crypt(3) does not have its key setup parallelized. This means that it won't scale well for the single-salt case - in fact, for that special case there's only significant benefit when going to 2 or 3 threads. But attacking a single salt is not the case I primarily care about. As you have seen, for the multi-salt case 8 threads work well, and more should too. In general, only some parts get parallelized, whereas some others remain sequential. When the hash type, input file, and/or invocation options are such that much time is spent in the sequential parts, the partial parallelization won't provide its "full" benefit. Luckily, for slow hash types - such as bcrypt - only a negligible portion of time is spent on generating new candidate passwords to try, etc. - and even going to 96 threads for the parallel parts does not change this picture. So you may expect to get almost a 96x speedup on a 96-core system - vs. a single-thread run of the same build. That's under no unrelated load. The speedup vs. a non-OpenMP build will usually be less, because thread safety of the code usually has its cost - on the order of 10%. For not-so-slow hash types, but with multiple salts, the same effect is achieved due to the salts - e.g., in my test runs (that I posted the output of in here) there were around 1450 different salts, which means that only one candidate password needed to be generated (and "set up" for use by the parallelized hashing code) per 1450 hashes computed. So the candidate passwords generation and key setup cost was negligible, even if not parallelized and even with a lot of threads handling the parallelized hashing part. For fast and saltless hashes, things are a lot worse. Key setup (if applicable) will need to be parallelized as well (such as for LM hashes), and even then scalability will be very limited as long as candidate passwords are only generated by a single thread. My expectation is that with some effort this could be reasonable on 4- or 8-core, but not far beyond that. To get further, multiple candidate password streams will have to be used (similar to what the MPI patch does). This is one of the reasons why I am focusing on slow and/or salted hashes now, for which good scalability with the OpenMP approach may be achieved. > Or would it, given enough what? Hashes? Salts? It would use all cores it has "thread slots" for (96 or 192 in the current code). It needs enough candidate passwords for that, but you'll almost always have those (or you won't care about speeding things up). It doesn't strictly need many salts - it will use all cores anyway - but the "overhead" of non-parallelized parts is lower with more salts. > I suppose even if it can use all of them, > the overhead will at some point make it ineffective, In some cases, yes. I've provided some examples above. There are also cases (already supported) that should scale to 96+ cores well. > perhaps even just slow it down. It is possible to come up with such examples. For example, the current LM hash code actually often becomes slower with more threads running - it was not yet parallelized properly, though (and I am unsure if I will do that or not, especially speaking of OpenMP in particular). For LM hashes, key setup is the costly operation, whereas the current code only parallelizes the actual DES encryptions, which are relatively cheap. So that's more of an implementation shortcoming (I just did not care about those hashes yet - I parallelized the bitslice DES code with crypt(3) hashes in mind, not LM). I imagine that examples not relying on shortcomings of a specific implementation could be identified as well. > Obviously though, MPI and MP could be used together. Say we have 96 > cores and fire off 12 MPI nodes using 8 MP threads each. Naturally some > testing would be needed to find the best mix. I'd expect it to make sense to use different approaches at different levels - e.g., use the MPI patch (or whatever will hopefully replace it) to distribute the workload across multiple hosts, but JtR's OpenMP support (or whatever will replace it) within individual hosts. The rationale for using a mix like this would be to manage fewer nodes from the console. It'd be painful to explicitly treat a single UltraSPARC T2 CPU as 32 or 64 nodes. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.