06 November, 2013

surfing the Deep Web

The Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the Hidden Web) is World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines like Google. It should not be confused with the dark Internet, the computers that can no longer be reached via Internet, or with a Darknet distributed filesharing network, which could be classified as a smaller part of the Deep Web. [1] 

The Deep Web is a complex concept. It is essentially two categories of data. The first is basically any information that is not easy to obtain through standard searching, which could be Twitter or Facebook posts, links buried many layers down in a dynamic page, or results that sit so far down the standard search results that typical users will never find them. The second category is the larger of the two and represents a vast repository of information that is not accessible to standard search engines. It is comprised of content found in websites, databases, and other sources. Often it is only accessible through a custom query directed at individual websites, which cannot be accomplished by a simple “surface web” search. [2] Some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .docx, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate within their organization or to disseminate topical information and work product to customers and potential clients.

[Fig] the Deep Web size compared to the Surface Web size, taken from :

Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines cannot "see" or retrieve content in the Deep Web ; those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden. The Deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. But a direct query is a "one at a time" laborious way to search.

Surface Web : Parts of the internet that can be found via link crawling techniques – meaning it is linked data and can be found via a link from the homepage of a domain; Google can find this data. Deep Web : Portions of the internet that cannot be accessed by a link crawling search engine like Google. The only way a user can access this portion of the internet is by doing a directed query into web search form to access content within a database that is not linked data. In layman’s terms, a search that is within a particular website. [7]

To put it in context, the Deep Web isn’t found in a single location. It consists of both structured and unstructured content ; a huge amount of which is found in databases. This content has often been compiled by experts, researchers, analysts and through automated processing systems at an array of institutions throughout the world. All of the content is housed in different systems, with different structures, at physical locations that can be as far apart as New York and Hong Kong. It’s almost impossible to measure the size of the Deep Web. While some early estimates put the size of the Deep Web at 4,000-5,000 times larger than surface web, the changing dynamic of how information is accessed and presented means that the Deep Web is growing exponentially and at a rate that defies quantification. [2]

“The Deep Web has existed for more than a decade but came under the spotlight last month after police shutdown the Silk Road website - the online marketplace dubbed the 'eBay of drugs' - and arrested its creator. But experts warn this has done next to nothing to stem the rising tide of such illicit online exchanges, which are already jostling to fill the gap now left in this unregulated virtual world. Meanwhile, even as the Silk Road was trundling to a halt, already hundreds of other websites were springing up in its place, peddling anything from drugs to stolen identities, illegal weapons to sickening child pornography and even explosives. In June it emerged one such site, called Atlantis, was even offering its wares in an advert posted on YouTube. Hiring a hitman has never been easier. Nor has purchasing cocaine or heroin, nor even viewing horrific child pornography. Such purchases are now so easy, in fact, that they can all be done from the comfort of one's home at the click of a button... and there's almost nothing the police can do about it.” [3]

However is the above hype all that justified ? There seems to be a grave misundestanding of what the Deep Web actually is and what information is accessible on the Internet through encryption procedures. All recent cases that have “shocked” reporters and users alike are reffering to TOR sites, ie sites that are not accessible through the standard search engines, but that happens because they are encrypted via the TOR protocol.

TOR (originally TOR, an acronym for The Onion Router, a usage now abandoned) is free software for enabling online anonymity. TOR directs Internet traffic through a free, worldwide, volunteer network consisting of more than four thousand relays to conceal a user's location or usage from anyone conducting network surveillance or traffic analysis. Using TOR makes it more difficult to trace Internet activity, including "visits to Web sites, online posts, instant messages, and other communication forms", back to the user and is intended to protect the personal privacy of users, as well as their freedom and ability to conduct confidential business by keeping their internet activities from being monitored. [4] Read the “history” section of the Wikipedia article and you will realise that TOR is neither secure, nor so much “independent” in nature as most believe. TOR is not “on the deep web”, as most people suggest.

While Tor is used by everyone from law enforcement to Syrian dissidents to protect valuable information, it is a double-edged sword. Many experts warn that groups ranging from the Russian mafia to international drug cartels are looking closely at the lessons learned from the Silk Road. It took the FBI more than two years of investigative work to find Ulbricht. They don’t have the resources to compete with Silicon Valley in hiring, or the tools—a long-hoped for modernization of the law governing online wiretapping is on ice in Congress thanks to Edward Snowden. [5]



“The so called Darknet is a part of the internet encrypted and partially hidden from indexing, but it still runs on the physical network infrastructure and uses the TCP and IP protocols for transmission and identification. Far more interesting is the new physical networks being created, using preexisting power, cable and telecommunication lines, with new data transfer and node id protocols. Totally separate networks using a multitude of unique languages and rules. Three primitive forms run in LA ,New york and possibly London. Inter- network communication is provided by translator servers, with local protocol knowledge.These will surly be the the real darknet of the not so distant future.” [3, comments page]

In your quest of surfing the Deep Web, you may try the following search companions, however be advised that they mostly deal with TOR sites. The article source offers also a comparison with regards to the performance of these search engines. [6]

Evil Wiki : Without a doubt, this is the single best entry point into the world of Tor. The well-maintained website provides an organized list of links to hidden services with explanations and even reviews. It’s not meant to be used as a search engine, but it often is.
TorSearch : A new search engine that has garnered some buzz in publications like VentureBeat. It operates in much the same way as Google, with a link-crawling spider that will forever build its arsenal.
Google : With proxy tools like Onion.to, Google actually crawls much of the Deep Web in a roundabout way. And because it’s so popular, it’s the first tool that almost anyone who hears about the Deep Web uses.
DuckDuckGo : Similar to Google but with one significant difference, DuckDuckGo offers anonymous search, a feature in keeping with Tor’s powers of anonymity. It’s no surprise that it’s popular among the Tor crowd.
Torch : An older Deep Web search engine, Torch has existed for a long time but little fanfare.
Hidden Wiki : The Hidden Wiki is a website that uses hidden services available through the Tor network. The site has a collection of links to other .onion sites, and encyclopedia articles in a wiki format.


Further reading :
Bergman, Michael K., White Paper, “The Deep Web: Surfacing Hidden Value”, http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.