|
Message-ID: <20060511054248.GA27977@openwall.com> Date: Thu, 11 May 2006 09:42:48 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Performance tuning A couple of weeks ago, I wrote: > There's SSE-as-available-in-64-bit-mode. I have > not benchmarked it yet, but I really expect no performance difference as > long as I use the same 8 registers. I've benchmarked it now. Confirmed - it behaves exactly the same in 32-bit and 64-bit modes. > For the other 8, I expect the performance to be either the same or > even worse. I've also benchmarked this. There's no significant performance difference between the first 8 and the other 8 SSE registers, but the x86-64 architecture code size is indeed different (the number of micro-ops to cache might be the same). > As it relates to the slowdown with SSE on an AMD64 processor: > > Benchmarking: Traditional DES [64/64 BS MMX]... DONE > Many salts: 785664 c/s real, 785664 c/s virtual > Only one salt: 721472 c/s real, 721472 c/s virtual > > Benchmarking: Traditional DES [128/128 BS SSE]... DONE > Many salts: 573516 c/s real, 573516 c/s virtual > Only one salt: 537164 c/s real, 537164 c/s virtual This mystery is now solved, sort of. There's no such slowdown with SSE2 instructions. While SSE and SSE2 bitwise ops have exactly the same performance on Intel P4s (many different ones I've tried), SSE is a lot slower than SSE2 on AMD. My _guess_ is that this has to do with AMD processors maintaining some floating-point state for the "single precision floats" that form the vectors with SSE. I don't know whether Intel P4s don't do that (after all, it doesn't make sense to do bitwise ops on actual floats) or whether they manage to do it with no slowdown - or my guess might be entirely wrong, after all. The SSE2 benchmark for the same system is: Benchmarking: Traditional DES [128/128 BS SSE2]... DONE Many salts: 951193 c/s real, 951193 c/s virtual Only one salt: 827776 c/s real, 827776 c/s virtual > If I somehow allocate a substantial amount of my time to further work on > John, which is not the case currently, these architecture-specific > optimizations would not be a priority. Well, I managed to find some hours over the last 3 days - and I intended to spend those on pushing the SSE code that I already had into a version I could release. However, I ended up doing more than that... I also did experiment with 16-registers SSE2 code as generated by a Perl script I wrote and as generated by gcc 4.1.0 (out of a specially modified C source file). This does look promising, but so far the performance is a little bit worse than that for the MMX-code-derived 8-registers SSE2 code that is currently in 1.7.1. However, my Perl script does absolutely no instruction scheduling (rather, I concentrated on optimal register allocation, on reducing the instruction count, and on avoiding operand combinations that are not supported). With a proper instruction scheduler, this should slightly outperform the current 8-registers SSE2 code. The number of instructions generated per S-box is about 10% smaller with 16 registers. As it relates to SSE2 code generated by gcc 4.1.0, it is surprisingly good. gcc has improved a lot in this area. But my special-purpose Perl script is better. ;-) -- Alexander Peslyak <solar at openwall.com> GPG key ID: B35D3598 fp: 6429 0D7E F130 C13E C929 6447 73C3 A290 B35D 3598 http://www.openwall.com - bringing security into open computing environments
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.