john-users - Re: C compiler generated SSE2 code

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100520022514.GA18335@openwall.com>
Date: Thu, 20 May 2010 06:25:14 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: C compiler generated SSE2 code

On Wed, May 19, 2010 at 10:40:21AM +0200, bartavelle@...quise.net wrote:
> I ran benchs with your code using gcc 4.3.2, 4.5.0, clang2 and icc, with 
> various settings (DES_BS_VECTOR 2,4, VECTOR34 on or off).
> 
> Code is slightly faster with gcc-4.5.0 than with vanilla,

You mean, it is slightly faster than x86-64.S from JtR 1.7.5?  Yes, this
matches my experience (on a Core i7).  Before I tried this, I expected
that such speedup could be possible due to S-box instructions getting
mixed with "outside" ones, which would avoid some data dependency stalls
in the S-boxes.  (This was too cumbersome to address with mere cpp
macros in the .S file, so I knowingly omitted this optimization.)
However, from a brief look at the gcc-generated code, I think the
speedup is actually due to instruction scheduling more suitable for the
Core 2 family.  The code in x86-64.S was "brute-scheduled" for early
2006's AMD CPUs and Intel's P4 Xeon Nocona.  Core 2 was just about to
appear.  Maybe I could re-schedule the code now, but my old "selfopt"
program fails to do it on a Core 2'ish CPU (I've tried just one so far)
because the TSC granularity turned out to be as bad as 10 cycles - not
enough precision for optimizing individual S-boxes.  So a different
approach would need to be used now (maybe multiple instances of the
code such that the number of clock cycles is much greater).  Anyway,
since the S-box expressions are about to be replaced and since I am
considering the use of intrinsics (especially along with OpenMP), I am
not spending time on that yet.

Of the weird vector sizes (beyond 128 bits), I was getting slight
speedup with 192-bit vectors as SSE2+native (but not SSE2+MMX) on
P4 Xeon Nocona.  So far this is the only exception - others became
slower.  But I did not do enough testing on AMD CPUs.  If you find that
certain "mixed" vectors perform better than plain SSE2 ones on certain
CPUs, please let me know.

On Wed, May 19, 2010 at 03:03:04PM +0200, bartavelle@...quise.net wrote:
> As for the ICC compiled code, it seems that for index 0-15 everything is
> computed right, and it fails between 16 and 127.
>
> A working fix is to replace the line :
>
> typedef __m128i vtype;
>
> by
>
> typedef unsigned long long vtype __attribute__ ((vector_size (16)));
>
> (int doesn't work for some reason)

I think a better fix would be to adjust the initializer for "ones" to
match your compiler's representation of __m128i.  From the above, my
guess is that __m128i is 16 x 8-bit with icc, so the initializer only
sets two bytes to all 1's and the rest to all 0's.

Alternatively, the initializer may be dropped and a memset() introduced
into the code - to set "ones" to all 1's in a portable way.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.