Public password dumps in ELK

Original text by Marc Smeets

Passwords, passwords, passwords: end users and defenders hate them, attackers love them. Despite the recent focus on stronger authentication forms by defenders, passwords are still the predominant way to get access to systems. And due to the habit of end users reusing passwords, and the multitude of public leaks in the last few years, they serve as an important attack vector in the red teamer’s arsenal. Find accounts of target X in the many publicly available dumps, try these passwords or logical iterations of it (Summer2014! might very well be Winter2018! at a later moment) on a webmail or other externally accessible portals, and you may have got initial access to your target’s systems. Can’t find any accounts of your target in the dump? No worries, your intel and recon may give you private email addresses that very well may be sharing the password with the target’s counter parts.

Major public password dumps

Recently, two major password dumps got leaked publicly: and Leakbase (goes also by the name BreachCompilation). This resulted in many millions of username-password combinations to be leaked. The leaks come in the form of multiple text files, neatly indexed in alphabetical order for ‘quick’ lookup. But the lookup remains extremely slow, especially if the index is done on the username instead of the domain name part of the email address. So, I wanted to re-index them, store them in a way that allows for quick lookup, and have the lookup interface easily accessible. I went with ELK as it ticks all the boxes. I’ll explain in this blog how you can do the same. All code can be found on our GitHub.

A few downsides of this approach

Before continuing I want to address a few short comings of this approach:

  • Old data: one can argue that many of the accounts in the dumps are many years old and therefore not directly useful. Why go through all the trouble? Well, I rather have the knowledge of old passwords and then decide if I want to use them, then not knowing them at all.
  • Parsing errors: the input files of the dump are not nicely formatted. They contain lots of errors in the form of different encodings, control characters, inconsistent structure, etc. I want to ‘sanitize’ the input to a certain degree to filter out the most commonly used errors in the dumps. But, that introduces the risk that I filter out too much. It’s about finding a balance. But overall, I’m ok with losing some input data.
  • ELK performance: Elasticsearch may not be the best solution for this. A regular SQL DB may actually be better suited to store the data as we generally know how the data is formatted, and could also be faster with lookups. I went with ELK for this project as I wanted some more mileage under my belt with the ELK stack. Also, the performance still is good enough for me.

Overall process flow

Our goal is to search for passwords in our new Kibana web interface. To get there we need to do the following:

  1. Download the public password dumps, or find another way to have the input files on your computer.
  2. Create a new system/virtual machine that will serve as the ELK host.
  3. Setup and configure ELK.
  4. Create scripts to sanitize the input files and import the data.
  5. Do some post import tuning of Elasticsearch and Kibana.

Let’s walk through each step in more detail.

Getting the password dumps

Pointing you to the downloads is not going to work as the links quickly become obsolete, while new links appear. Search and you will find. As said earlier, both dumps from and Leakbase became public recently. But you may also be interested in dumps from ‘Anti Public’, LinkedIn (if you do the cracking yourself) and smaller leaks lesser broadly discussed in the news. Do note that there is overlap in the data from different dumps: not all dumps have unique data.

Whatever you download, you want to end up with (a collection of) files that have their data stored as username:password. Most of the dumps have it that way by default.

Creating the new system

More is better when talking about hardware. Both in CPU, memory and disk. I went with a virtual machine with 8 cores, 24GB ram and about 1TB of disk. The cores and memory are really important during the importing of the data. The disk space required depends on the size of the dumps you want to store. To give you an idea: storing Leakbase using my scripts requires about 308GB for Elasticsearch, Exploit.In about 160GB.

Operating system wise I went with a rather default Linux server. Small note of convenience here: setup the disk using LVM so you can easily increase the disk If you require more space later on.

Setup and configure ELK

There are many manuals for installation of ELK. I can recommend @Cyb3rWard0g’s HELK project on GitHub. It’s a great way to have the basics up and running in a matter of minutes.

git clone

There are a few things we want to tune that will greatly improve the performance:

  • Disable swap as that can really kill Elasticsearch’s performance:sudo swapoff –a remove swap mounts in /etc/fstab
  • Increase the JVM’s memory usage to about 50% of available system memory:
    Check the jvm options file at /etc/elasticsearch/jvm.options and /etc/logstash/jvm.options, and change the values of –Xms and –Xmx to half of your system memory. In my case for Elasticsearch:-Xmx12g -Xms12gDo note that Elasticsearch and Logstash JVMs work independently and therefor their values together should not surpass your system memory.

We also need to instruct Logstash about how to interpret the data. I’m using the following configuration file:

root@office-elk:~# cat /etc/logstash/conf.d/passworddump2elk.conf
input {
    tcp {
        port   => 3515
        codec  => line

   dissect {
     mapping => { "message" => "%{DumpName} %{Email} %{Password} %{Domain}" }
      remove_field => [ "host", "port" ]

output {
    if " _dissectfailure" in [tags] {
       file {
          path => "/var/log/logstash/logstash-import-failure-%{+YYYY-MM-dd}.log"
          codec => rubydebug
   } else {
         hosts => [ "" ]
         index => "passworddump-%{+YYYY.MM.dd}"

As you can see in the dissect filter we expect the data to appear in the following format:

DumpName EmailAddress Password Domain

There is a very good reason why we also want the Domain part of the email address as a separate indexable field: as red teamers you tend to search for accounts tied to specific domains/companies. Searching for <anything>@domainname is an extremely CPU expensive search to do. So, we spend some more CPU power on the importing to have quicker lookups in the end. We also store the DumpName as this might become handy in some cases. And for troubleshooting we store all lines that didn’t parse correctly using the “if ” _dissectfailure” in [tags]”.

Note: using a dissect vs a grok filter gives us a small performance increase. But if you want you can accomplish the same using grok.

Don’t forget to restart the services for the config to become active.

Sanitize and import the data

The raw data that we get form the dumps is generally speaking organized. But it does contain errors, e.g. parts of html code as password, weird long lines, multitudes of spaces, passwords in languages that we are not interested in, non printable control characters.

I’ve chosen to do the following cleaning actions in my script. It’s done with simple cut, grep, awk commands, so you can easily tune to your preference:

  • Remove spaces from the entire line
    This has the risk that you lose passwords that have a space in it. In my testing I’ve concluded that the far majority of spaces in the input files are from initial parsing or saving errors, and only a tiny fraction could perhaps be a (smart) user that had a space in the password.
  • Convert to all ASCII
    You may find the most interesting character set usage in the dumps. I’m solely interested in full ASCII. This is a rather bold way to sanitize that works for me. You may want to do differently.
  • Remove non-printable characters
    Again you may find the most interesting characters in the dumps. During my testing I kept encountering control characters that I never even heard about. So I decided to remove all non-printable all together. But whatever you decide to change, you really want to get rid of all the control characters.
  • Remove lines without proper email address format
    This turns out to be a rather efficient way of cleaning. Do note that this means that the occasional will also be purged.
  • Remove lines without a colon, empty username or empty password
    In my testing this turned out to be rather effective.
  • Remove really long lines (60+ char)
    You will purge the occasional extremely long email addresses or passwords, but in my testing this appeared to be near 0. Especially for corporate email addresses where most of the times a strict naming policy is in place.
  • Remove Russian email addresses (.ru)
    The amount of .ru email addresses in the dumps, and the rather unlikely case that it may be interesting, made me purged them all together. This saves import time, lookup time and a considerable amount of disk space.

After the purging of irrelevant data, we still need to reformat the data in a way that Logstash can understand. I use a AWK one liner for this, that’s explained in the comments of the script. Finally, we send it to the Logstash daemon who will do the formatting and sending along to Elasticsearch.

It’s important to note that even though we rather heavily scrutinized the input data earlier on, the AWK reformatting and the Logstash importing still can filter out lines that contain errors.

Overall, we lose some data that might actually be used when sanitizing differently. When using my scripts on the Leakbase dump, you end up with 1.1 billion records in Elasticsearch while the import data roughly contains 1.4 billion records. For the Exploit.In dump its about 600 out of 800 million. It’s about finding a balance that works for you. I’m looking forward to your pull requests for better cleaning of the import data.

To kick off the whole importing you can run a command like:

for i in $(find ./BreachCompilation/data/*.txt -type f); do ./ $i LeakBase; done

The 2nd parameter (LeakBase) is the name I give to this dump.

Don’t be surprised if this command takes the better part of a day to complete, so run it in a screen session.

Post import tuning

Now that we are importing data to Elasticsearch, there is a small performance tuning that we can do: remove the indexing of the ‘message’ field. The message field contains the entire message that was received via Logstash. An index requires CPU power and uses (some) disk space. As we already index the sub-fields this is rather useless. You can also choose to not store it at all. But the general advice is to keep it around. You may never know when it becomes useful. With the following command we keep it, but we remove the indexation.

curl -XPUT http://localhost:9200/_template/passworddump_template -d '
 "template" : "passworddump-*",
 "mappings" : {
  "logs": {
   "properties" : {
    "message" : {
    "type":"text", "store":"yes", "index":"false", "fields":{"keyword":{"type":"keyword","ignore_above":256}}

Now the only thing left to do is to instruct Elasticsearch to create the actual index patterns. We do this via the Kibana web interface:

  • click ‘Management’ and fill in the Index pattern: in our case passworddump-*.
  • Select the option ‘I don’t want to use the Time Filter’ as we don’t have any use for searching on specific time periods of the import.

Important note: if you create the index before you have altered the indexation options (previous step), your indexation preferences are not stored; its only set on index creation time.  You can verify this by checking if the ‘message’ field is searchable; it should not. If it is, remove this index, store the template again and recreate the index.

There are a few more things that can make your life easier when querying:

  1. Change advanced setting discover:sampleSize to a higher value to have more results presented.
  2. Create a view with all the relevant data in 1 shot:
    By default Kibana shows the raw message to the screen. This is isn’t very helpful. Go to ‘Discover’, expand one of the results and hit the ‘Toggle column in table’ button next to the fields you want to be displayed, e.g. DumpName, Email, Password and Domain).
  3. Make this viewing a repeatable view
    Now hit ‘Save’ in the top bar, give it a name, and hit save. This search is now saved and you can always go back to this easy viewing.
  4. Large data search script
    For the domains that have more than a few hits, or for the cases where you want to redirect it to a file for easy importing to another tool, Kibana is not the easiest interface. I’ve created a little script that can help you. It requires 1 parameter: the exact Domain search string you would give to Elasticsearch when querying it directly. It returns a list of username:password for your query.

Good luck searching!

Publicly accessible .ENV files

( Original text by BinaryEdge )

Deployment is something a lot of companies still struggle with. We talked about the issue with Kubernetes being deployed insecurely a few weeks ago in a blogpost and how the kubernetes pods are being hijacked to mine for cryptocurrency.

This week we look at something different but still related to deployments and exposing things to public that should not be.

One tweet from @svblxyz (whom we would also like to thank for all the help given to us on reviewing this post and giving tips on things to add) showed us an interesting google dork which made us wonder, what does this look like for IP adresses vs domain/services focused (as google search is).View image on Twitter

View image on Twitter



Don’t put your .env files in the web-server directory …2,7829:15 PM — Sep 26, 20181,950 people are talking about thisTwitter Ads info and privacy

So we launched a scan using our distributed platform, as simple as:

> curl -d '{
      "description": "HTTP Worldscan .env",
      "type": "scan",
      "options": [{
        "targets": ["XXXX"],
        "ports": [{
            "modules": ["http"],
            "port": "80",
            "config": { "http_path": "/.env" }
      }' -H 'X-Token:XXXXXX'

After this we started getting the results and of course multiple issues can be identified on these scans:

  • Bad Deployments — The .ENV files being accessible is something that shouldn’t happen — there are companies exposing this type of file fully readable with no authentication.
  • Weak credentials — Lots of services with a username/password combo using weak passwords.

Credentials and Tokens

Lots different types of Service Tokens were found:

  • AWS — 38 tokens
  • Mangopay — 9 tokens
  • Stripe — 89 tokens
  • Pusher — 1600 Tokens

Other tokens found include:

  • PlugandPlay
  • Paypal
  • Mailchimp
  • Facebook
  • PhantomJS
  • Mailgun
  • Twitter
  • JWT
  • Google
  • WeChat
  • Shopify
  • Nexmo.
  • Bitly
  • Braintree
  • Twilio
  • Recaptcha
  • Ucloud
  • Firebase
  • Mandrill
  • Slack
  • Shopzcoin

Many of these systems involve financial records/ payments.

But we also found access configurations to Databases, which potentially contain customer data, such as:

  • DB_PASSWORD keys: 1161
  • REDIS_PASSWORD keys: 801
  • MySQL credentials: 946 (username/password combos).

Looking at the passwords being used the top 3 we see they all consist of weak passwords:

1 — secret — 93
2 — root — 33
3 — adminadmin — 24

Other weak passwords found are:

  • password
  • test123
  • foobar

When exposed tokens go super bad…


Something that is also very dangerous is situations like the CVE-2018-15133 where if the APP_KEY is leaked for the Laravel app, allows an attacker to execute commands on the machine where the Laravel instance is running.

And our scan found: 300 APP_KEY Tokens related to Laravel.

One important note to be taken into account, we looked only at port 80 internet wide for our scan. The exposure on this can easily be much higher as other web apps will surely be exposing more .env files!

Acoustic Audio Patterns Could Be Giving Away Your Passwords, Learned by Neural Nets

( Original text by nugget )

In an age where Facebook, Google, Amazon, and many others are amassing an immense amount of data, what could be more concerning than drastic advancements in artificial intelligence? Thanks to neural networks and deep learning, tons of decision problems can be solved by simply having enough labelled data. The implications of this big data coexisting with the artificial intelligence to harvest information from it are endless. Not only are there good implications, including self-driving cars, voice recognition like Amazon Alexa, and intelligent applications like Google Maps, but there are also many bad implications like mass user-profiling, potential government intrusion, and, you guessed it, breaches into modern cryptographic security. Thanks to deep learning and neural networks, your password for most applications might just be worthless.

Since the cryptographic functions used in password hashing are currently secure, many attacks attempt to acquire the user’s password before it even reaches the database. For more information, see this article on password applications and security. For this reason, attacks using keyloggers, dictionary attacks, and inference attacks based on common password patterns are common, and these attacks actually work quite frequently. Now, however, deep learning has paved the way for a new kind of inference attack, based on the sound of the keys being typed.


Investigations into audio distinguishing between keystrokes are not new. Scientists have been exploring this attack vector for many years. In 2014, scientists used digraphs to model an alphabet based on keystroke sounds. In addition, statistical techniques are used in conjunction with the likely letters that are being typed (determined by the sound patterns) to create words that have a statistical likelihood of being typed based on the overall typing sound. This “shallow learning” approach is a good example of a specific set of techniques developed for a specific task in data science research.

Approaches like this one were used for years in fields like image feature recognition. The results were never groundbreaking, because it is very difficult for humans to create a perfect model for a task that has a massive amount of considerable variables. However, deep learning is now in the picture, and has been for some time. Image recognition with deep learning is so good it is almost magical, and it certainly is scary. This means that this task of matching keystroke sounds with the keystrokes themselves might just be possible.

Implications of Neural Networks

Nowadays, training models using audio for keyboard stroke recognition is a successfully performed task. Keystroke sound has been successfully used as a bio-metric authentication factor. While this fact is cool, the deeper meanings are quite scary. With the ability to train a massive neural network on the plethora of labelled keystroke sound data available on the web, a high-accuracy model can be created with little effort that predicts keystrokes based on audio with high accuracy.

Combined with other inference approaches, you could be vulnerable any time anyone is able to record you type your password. In fact, according to this article by, with some small information, such as keyboard type and typist style, attackers have a 91.7% accuracy of determining keystrokes. Without this information, they still have an impressive 41.89% accuracy. Even with this low keystroke accuracy, attackers may still be able to determine your password as the small accuracy could still clue them into your password style, e.g. using children’s or pet’s names in your passwords. Once attackers have an idea of your password style, they can massively reduce the password space, as stated in this article. With a reduced possible password space, brute force and dictionary attacks become extremely viable. Essentially, with advancements in deep learning, the audio of you typing your password is definitely a vulnerable vector of attack.

What you can do to protect yourself

The main vulnerability of this attack lies in the VOIP software widely used by companies and individuals alike to communicate. When you use software like Skype, your audio is obviously transmitted to your call partners. This audio, clearly, includes audio of you typing. This typing could be deciphered using machine learning and inference attacks, and any attacker on the call could decipher some or all of what is being typed. Of course, some of this typed text may include passwords or other sensitive information that an attacker may want. Other vulnerabilities include any situation where someone may be able to covertly record your keystrokes. For example, someone may record you typing in person by using their phone without you knowing.

So, to protect yourself, be sure that you have a second factor authenticating you in important security applications. Most login interfaces such as Gmail offer 2-factor authentication. Be sure that this is enabled, and your password will not be the only factor in your login. This reduces the risk of attackers obtaining your password. Additionally, of course, using good password practiceswill make it harder for inference attacks to supplement deep learning in acquiring your password. Finally, you could certainly reduce the risk of audio-based attacks by not typing in passwords when on VOIP calls.


Certainly, there isn’t much to do to mitigate the risk of your typing audio being eavesdropped. The implications that deep learning have on audio-based password attacks are definitely scary.  It’s a fact that neural networks might mean that your password is worthless, and they’re only getting stronger. The future of artificial intelligence will change not only modern authentication systems, but it will change society in ways we can’t even imagine. The only thing we can do in response is be aware and adapt.

If you have any questions or comments about this post, feel free to leave a comment or contact me!

Password Hashes — How They Work, How They’re Hacked, and How to Maximize Security

( Original text by Cassandra Corrales )

According to Dashlane, the average user has at least 90 online accounts. We trust these accounts to protect highly sensitive information about our social lives, browsing habits, shopping history, finances and more. The only thing between your information and a malicious attacker is your password. That’s a lot of responsibility for a a few characters of (sometimes) arbitrarily chosen text. So what exactly goes into making passwords secure?

How Password Hashes Work

Most passwords are hashed using a one-way hashing function. Hashing functions take the user’s password and use an algorithm to turn it into a fixed-length of data. The result is like a unique fingerprint, called the digest, that cannot be reversed to find the original input. So, even if someone gets access to the database storing your hash password, there is no key to decrypt it back to its original form.

In general, here’s how hashing systems work when you log in to an account:

  1. You enter your password
  2. A hashing function converts your password into a hash
  3. The generated hash is compared to the hash stored in the database
  4. If the the generated hash and the stored hash match, you’re granted access to the account. If the generated hash doesn’t match, you get a login error.
How hash functions work. The digest will be stored in the database. Image from:

Hacking Hashes

Although hashes aren’t meant to be decrypted, they are by no means breach proof. Here’s a list of some popular companies that have had password breaches in recent years:

Popular companies that have experienced password breaches in recent years.

What techniques do hackers use to hack the allegedly un-hackable? Here are some of the most common ways that password hashes are cracked:

  • Dictionary Attacks
  • Brute Force Attacks
  • Lookup Tables
  • Reverse Lookup Tables

*Note the difference between lookup tables and reverse lookup tables. Lookup tables begin with the precomputed password guess hashes, while reverse lookup tables begin with the table of password hashes from the user accounts database.

  • Rainbow Tables

Rainbow tables are very similar to reverse lookup tables, except rainbow tables use reduction functions to make significantly smaller lookup tables. The result is a trade-off, where rainbow tables are slower, but require less storage space.

How to Maximize Password Security — As a User:

  1. Start with a strong password
  • The longer the password, the better. A lengthy password is less vulnerable to brute force attacks. Sentences are good.
  • Use random words. Less association between the words in your password makes it less vulnerable to dictionary attacks
  • Mix in different characters and numbers. Again, this makes you slightly less vulnerable to dictionary attacks.

2. Change up your password from time to time and from app to app

  • If a password breach happens with one account, that password hash has been cracked and needs to be changed for every account it’s used on.

How to Maximize Password Security — As a Developer:

  1. Stay away from SHA-1 or MD5 hashing functions

SHA-1 and MD5 are outdated and have already been targeted by numerous table attacks. They are fast cryptographic functions and are therefore easier to hack.

Better hashing function options are computationally expensive and therefore more difficult to hack. These are some better hashing algorithms that will minimize password security risks in your application:

  • Argon2 — Winner of the password hashing competition. Uses a lot of memory, so it’s difficult to attack.
  • PBKDF2 — Has no known vulnerabilities after 15 years of extensive use, although it is lower on memory use.
  • scrypt — Very safe, but may have some limitations because it was not designed for password storage.
  • bcrypt — An adaptive hashing function, can be configured to remain slow and therefore resistant to attacks.

2. Always add Salt

A salt is a random string you can add to the password before hashing. This will transform the password into a completely different string and will thus generate a different hash each time.

Resulting outputs when you hash the password “hello” with different salts. Image from:

The “better” hashing algorithms listed above all add salts, but if you need to use another hashing function, don’t forget the salt.