|
Message-ID: <20100520022514.GA18335@openwall.com> Date: Thu, 20 May 2010 06:25:14 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: C compiler generated SSE2 code On Wed, May 19, 2010 at 10:40:21AM +0200, bartavelle@...quise.net wrote: > I ran benchs with your code using gcc 4.3.2, 4.5.0, clang2 and icc, with > various settings (DES_BS_VECTOR 2,4, VECTOR34 on or off). > > Code is slightly faster with gcc-4.5.0 than with vanilla, You mean, it is slightly faster than x86-64.S from JtR 1.7.5? Yes, this matches my experience (on a Core i7). Before I tried this, I expected that such speedup could be possible due to S-box instructions getting mixed with "outside" ones, which would avoid some data dependency stalls in the S-boxes. (This was too cumbersome to address with mere cpp macros in the .S file, so I knowingly omitted this optimization.) However, from a brief look at the gcc-generated code, I think the speedup is actually due to instruction scheduling more suitable for the Core 2 family. The code in x86-64.S was "brute-scheduled" for early 2006's AMD CPUs and Intel's P4 Xeon Nocona. Core 2 was just about to appear. Maybe I could re-schedule the code now, but my old "selfopt" program fails to do it on a Core 2'ish CPU (I've tried just one so far) because the TSC granularity turned out to be as bad as 10 cycles - not enough precision for optimizing individual S-boxes. So a different approach would need to be used now (maybe multiple instances of the code such that the number of clock cycles is much greater). Anyway, since the S-box expressions are about to be replaced and since I am considering the use of intrinsics (especially along with OpenMP), I am not spending time on that yet. Of the weird vector sizes (beyond 128 bits), I was getting slight speedup with 192-bit vectors as SSE2+native (but not SSE2+MMX) on P4 Xeon Nocona. So far this is the only exception - others became slower. But I did not do enough testing on AMD CPUs. If you find that certain "mixed" vectors perform better than plain SSE2 ones on certain CPUs, please let me know. On Wed, May 19, 2010 at 03:03:04PM +0200, bartavelle@...quise.net wrote: > As for the ICC compiled code, it seems that for index 0-15 everything is > computed right, and it fails between 16 and 127. > > A working fix is to replace the line : > > typedef __m128i vtype; > > by > > typedef unsigned long long vtype __attribute__ ((vector_size (16))); > > (int doesn't work for some reason) I think a better fix would be to adjust the initializer for "ones" to match your compiler's representation of __m128i. From the above, my guess is that __m128i is 16 x 8-bit with icc, so the initializer only sets two bytes to all 1's and the rest to all 0's. Alternatively, the initializer may be dropped and a memset() introduced into the code - to set "ones" to all 1's in a portable way. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.