Don't be caught out by data duplication

May 13, 2021

When we founded Darkbeam back in the summer of 2017, our first data collector was designed to crawl the internet and look for entities of interest. By far the most popular of those entities, was the email and hash (or cleartext password in some cases), which many of our clients would use to assess whether any of their accounts were vulnerable to unauthorised access. As we expanded our data ecosystem, we added several other collectors to consume APIs and various other sources, as well as beginning to look for databases of leaked credentials.

Our initial API products would allow our clients to query our database for emails belonging to their domains, and we would return a Universal Resource Identifier (URI) indicating where a specific entity had been seen. As time progressed, and as we added more data sources however, we began to notice a number of changes taking place in the data. To begin with, it became obvious that the commercial ecosystem that existed around this data, was a lot more complex than a lot of our rivals’ sales and marketing pitches suggested. The data wasn’t just being, “sold on the dark web”, it was being shared everywhere - and there was also a significant reseller market, which served to give the data a limited shelf life.

We also began to notice a lot of duplicate data being discovered by our crawlers, with leaked data being bundled into ever larger collections. As this became more common, the URI we provided to clients began to lose its value. We were also having to store duplicate email and hash pairs, purely to record the numerous locations these entities were being exposed.

As an experiment, we recently took over 500GB of compressed data, and indexed just the email and hash entity, ignoring the URI. The first 50GB gave us around 3 billion records, with the remaining 450GB adding approximately 400 million records. This illustrates the amount of duplicate data that exists within many of the database collections currently circulating around the internet.

Whilst we initially envisioned ourselves identifying sites offering leaked data, and potentially serving takedown notices to assist clients, we quickly began to expand our data collection and analysis suite to cover as many external data points as we could find. As we moved into 2021, we deployed the latest version of our backend API, and are now adding both breadth and depth to our collectors. Our aim is to enable our clients to see their digital infrastructure as it appears to a potential attacker, and also to assess their suppliers and partners. We are also working on a new app which will be deployed later this year, and contains updates and improvements in response to feedback from our current clients and associates.

For those that are curious, the average size of email address within our database is approximately 20 characters, and the average password length is just over 8 characters.


Click here to get a free trial of the Darkbeam platform (no credit card details required). 


Steve Tyrens

Subscribe Here!