Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180912114951.GA5022@openwall.com>
Date: Wed, 12 Sep 2018 13:49:51 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: good program for sorting large wordlists

On Wed, Sep 12, 2018 at 01:14:22PM +0200, JohnyKrekan wrote:
> Thanx for infos, after I have raised the memory sizes and the space for 
> temp, the sort went well. Iwas sorting it to know how many duplicates (when 
> ignoring the character case) are in the superwpa wordlist. The original 
> file size was approx 10.7 gb, after sorting it was 7.05 gb, so 4 gb was 
> taken by the same words with modified character case.

It's a case where you don't need to sort.  You could use:

./unique -v output.lst < input.lst

or e.g.:

tr 'A-Z' 'a-z' < input.lst | ./unique -v output.lst

Testing this on JtR's bundled password.lst:

$ tr 'A-Z' 'a-z' < password.lst | ./unique output.lst
Total lines read 3559 Unique lines written 3422

If you're interested in sizes in bytes as well, use "ls -l" or "wc -c"
on the two files.

For tiny wordlists like password.lst, "sort -u" is more convenient in
that it can output to a pipe, so you can do:

$ tr 'A-Z' 'a-z' < password.lst | sort -u | wc -l 
3422

But for large wordlists "sort" may be slower, even with the "-S" and
"--parallel" options.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.