Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0bcd1dde-4de6-59a9-b6b6-a0e1c8a509e2@jeffunit.com>
Date: Sun, 16 Sep 2018 12:51:54 -0700
From: jeff <jeff@...funit.com>
To: john-users@...ts.openwall.com
Subject: Re: good program for sorting large wordlists

On 9/16/2018 12:28, Albert Veli wrote:
> Hi!
>
> On Tue, Sep 11, 2018 at 5:19 PM JohnyKrekan <krekan@...nykrekan.com> wrote:
>
>> Hello, I would like to ask whether someone has experience with good tool
>> to sort large text files with possibilities such as gnu sort. I am using it
>> to sort wordlists but when I tried to sort 11 gb wordlist, it crashed while
>> writing final output file after writing around 7 gb of data  and did not
>> delete some temp files.
>
> If you don't succeed with other methods, one thing that has worked for me
> is splitting the wordlist into smaller parts and sorting each one
> individually. Then you can merge the sorted lists together using for
> instance mli2 from hashcat-utils. That will put the big list in sorted
> order. But the parts must be sorted first, before merging.
>
> This is only necessary for very large wordlists. Like in your case.
>
> PS I think there is a tool in hashcat-utils for splitting too. Don't
> remember the name. Maybe gate.
>
For whatever reason, I have a collection of large, sorted wordlists.
I have 9 over 10gb, and the biggest one is 123gb.

As mentioned above, I use split to split the files into manageable pieces.
I typically say something like 'split -l 100000000' or so.

I then use gnu sort on each piece.
gnu sort will sort files that are quite large, using temp files if needed.
It is still a good a good idea to have a reasonable amount of physical 
memory; my machine has 32 gb.

Then I take the sorted files and merge them using a program I wrote 
called multi-merge, which merges
one or more sorted files.

Then I use uniq on the sorted file, to remove duplicates.

This process can take awhile, but you will end up with a sorted, unique 
wordlist.

I also wrote a bunch of other programs to manipulate wordlists. In my 
experience, large wordlists often
contain quite a bit of junk, such as files with really long 
lines,sometimes 10k to over 100k bytes.
I have a program to truncate long lines, sample lines of big files, 
remove non-ascii lines, etc.
I also use emacs to look at the contents of files. It can edit 
multi-gigabyte files, though it is slow.
Sometimes long lines are many password separated by ',' or ';' or some 
other separator.

jeff

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.