john-users - Re: Anyone want to benchmark AVX2 code for bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130627031135.GB15136@openwall.com>
Date: Thu, 27 Jun 2013 07:11:35 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com,
	"Sc00bz64@...oo.com" <sc00bz64@...oo.com>
Subject: Re: Anyone want to benchmark AVX2 code for bcrypt

On Wed, Jun 26, 2013 at 09:09:27AM -0700, Sc00bz64@...oo.com wrote:
> So not using AVX2 is faster.

Ouch.

One thing to check, though: is the s[] array in bcryptAVX2() 256-bit
aligned?  It is possible that the stack is only 128-bit aligned.  I'd
try aligning s[] in a more reliable manner, although my guess is that
for gather loads this won't matter.

> One reason might be that it runs out of L1 cache (needs more than 32.5 KiB but there's only 32 KiB of L1) and has to hit L2.

You could try interleaving two instances where each would use 128-bit
vectors with _mm_i32gather_epi32(), etc.  This should help hide the
latencies on these loads, including those resulting from occasional L1
cache misses (when one of the two instances is stalled waiting for L2
cache read, the other can typically proceed further).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.