Some of the best-laid plans of mice and men often go awry – or become hopelessly corrupted. That's a maxim that could apply to the “dark web” - that underground dungeon of information that has become adopted as an online home by hackers, con artists, scammers and other “undesirables.”
The information in that repository – conversations, notes, databases - could go a long way to making organisations more secure, as it could contain important tactics and techniques used by hackers to attack IT systems, or provide clues on the next big target. Unfortunately, getting access to that repository is extremely difficult, even for well-seasoned IT personnel. It's the kind of work that is perhaps best off-loaded to outside organisations that specialise in dealing with the dark web.
What's become known colloquially as the dark web didn't necessarily start out that way. Getting hold of it – connecting, searching and indexing it – is difficult, because it does not use the “regular” world wide web to connect with users and other sites, but rather the Tor router, a worldwide overlay network managed by volunteers that uses more than 7,000 relays to conceal a user's location and online activities. Sites that use Tor, an acronym for “the onion router,” are much like “regular” sites, containing content, data, music, videos, and anything else one would put on a website. The difference is that getting access to that content requires using the Tor browser, which is used to access these hidden sites.
Tor has been a real boon for residents of countries where open Internet use is discouraged, or outright banned. Just this year, Iran began cracking down on Internet use, in the wake of the mass anti-government protests, blocking YouTube, Facebook, Twitter, Instagram, and other services. As a result, experts say that use of Tor has skyrocketed in that country, as Iranians, eager to maintain Internet contact with the rest of the world, tell the story of the protests, use Tor to evade being detected by the government. The same holds true in other countries where the authorities have attempted to repress Internet usage, such as China and Russia. Tor, by the way, was actually developed by the US government and released to the public in 2002, and the Tor Project received much of its funding from the government for many years.
What works for dissidents, however, works just as well for criminals, and it would be very useful for IT teams to get an inside look into the sites that these criminals are using on the Tor network, if only to get an insight into the way hackers work – what their plans are, how they decide whom to attack, what their preferred targets are, etc. The problem is tracking down these sites – finding the ones that contain the information that cyber-security personnel need in order to find the answers to those questions.
When we want to gather information on the surface web, we deploy crawlers that gather information from sites (based on the permission, of course, of robots.txt files). As dark web sites use the same HTML as surface websites, there's no reason we shouldn't be able to use crawlers to gather data from those sites, as well.
Except for the fact that finding dark web sites – programming the crawlers to search specific sites – is extremely complicated and difficult. Unlike a surface web address, dark web sites don't use HTTP addresses, but instead use onion addresses (“onion” because of the many layers that need to be peeled back in order to find the site). For example, scihub22266oqcxt.onion (SciHub, a scientific research site with over 50 million papers, hosted for free). As the Internet Corporation for Assigned Names and Numbers describes it, Tor software creates a 16-character hostname by first computing a hash of the public key of that key pair and then converting the first 80 bits of this hash from a binary value to ASCII to make the resulting 16 characters conform to the "letter digit hyphen" requirement for the System () protocol.
In addition, finding new sources of data on the dark web is difficult, because sites there generally don't connect to each other via a large number of links, as on the surface web. That means that crawling is going to be much more limited, and thus difficult, unless you deploy crawlers that can deal with these unique sites. Plus, the data is often behind password-protected sources, where a proper system is needed to automatically login and collect it.
The entire connection process is foreign to those who are not experienced in it. And, the many layers of encryption afforded by the relays used to direct Tor traffic – combined with the lack of a logical DNS-style address system that a crawler can understand, and the slow performance – means that the vast majority of IT teams are not equipped with the skills to deal with searching the dark web. It's a difficult, even Sisyphean, task for most IT departments.
Contributed by Ran Geva, CEO of Webhose
*Note: The views expressed in this blog are those of the author and do not necessarily reflect the views of SC Media UK or Haymarket Media.