|
Message-ID: <0bcd1dde-4de6-59a9-b6b6-a0e1c8a509e2@jeffunit.com> Date: Sun, 16 Sep 2018 12:51:54 -0700 From: jeff <jeff@...funit.com> To: john-users@...ts.openwall.com Subject: Re: good program for sorting large wordlists On 9/16/2018 12:28, Albert Veli wrote: > Hi! > > On Tue, Sep 11, 2018 at 5:19 PM JohnyKrekan <krekan@...nykrekan.com> wrote: > >> Hello, I would like to ask whether someone has experience with good tool >> to sort large text files with possibilities such as gnu sort. I am using it >> to sort wordlists but when I tried to sort 11 gb wordlist, it crashed while >> writing final output file after writing around 7 gb of data and did not >> delete some temp files. > > If you don't succeed with other methods, one thing that has worked for me > is splitting the wordlist into smaller parts and sorting each one > individually. Then you can merge the sorted lists together using for > instance mli2 from hashcat-utils. That will put the big list in sorted > order. But the parts must be sorted first, before merging. > > This is only necessary for very large wordlists. Like in your case. > > PS I think there is a tool in hashcat-utils for splitting too. Don't > remember the name. Maybe gate. > For whatever reason, I have a collection of large, sorted wordlists. I have 9 over 10gb, and the biggest one is 123gb. As mentioned above, I use split to split the files into manageable pieces. I typically say something like 'split -l 100000000' or so. I then use gnu sort on each piece. gnu sort will sort files that are quite large, using temp files if needed. It is still a good a good idea to have a reasonable amount of physical memory; my machine has 32 gb. Then I take the sorted files and merge them using a program I wrote called multi-merge, which merges one or more sorted files. Then I use uniq on the sorted file, to remove duplicates. This process can take awhile, but you will end up with a sorted, unique wordlist. I also wrote a bunch of other programs to manipulate wordlists. In my experience, large wordlists often contain quite a bit of junk, such as files with really long lines,sometimes 10k to over 100k bytes. I have a program to truncate long lines, sample lines of big files, remove non-ascii lines, etc. I also use emacs to look at the contents of files. It can edit multi-gigabyte files, though it is slow. Sometimes long lines are many password separated by ',' or ';' or some other separator. jeff
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.