Deep Web, The Dark Side Of The Internet
The deep web, also called the invisible web, described in the architecture of the web, the part of the Net not indexed by the major search engines known.
Understanding the Web and the Internet
A quick definition of these two concepts is necessary before getting to the heart of the matter. The Internet is a network of computer networks, made up of millions of both public and private networks.
Information is transmitted over it using HTTP or HTTPS data transfer protocols that enable a variety of services, such as e-mail, peer-to-peer or the World Wide Web, more commonly known as the Web.
In other words, the Web is one application among many that use the Internet and its millions of networks as a physical medium and means of transport, just like e-mail.
It is an information network made up of billions of documents scattered on millions of servers around the world and linked to each other according to the principle of hypertext.
The web is often compared to a spider’s web because the hypertext links linking documents together could be likened to the threads of a web and the documents to the nodes where these threads intersect.
And the web is itself composed of two parts: the visible web and the invisible web, more commonly called the Deep Web.
But to understand what the Deep Web really is, we should first talk about the visible web, indexing robots, the opaque web, and deep resources.
The visible web, also called the surface web, is the internet content that can be accessed via classic search engines such as Mozilla Firefox, Internet Explorer, Google Chrome, Yahoo, Bing, etc… It, therefore, includes all sites and pages indexed and referenced by these search engines.
For example, when you type “dailytechmonde” on Google, you will find a direct link to a website.
In other words, a page indexed on a referenced website. In order to offer you this page, the search engine in question has searched a database it has previously created by indexing all possible web pages.
It has thus, long before, tried to understand the content of all these pages in order to be able to propose them to the user when he carries out a keyword search. I talk about keywords because that’s what we use most of the time with the different search engines.
To discover new pages and constantly update their databases, search engines use certain programs, the famous crawlers, and indexing robots that follow hyperlinks.
We can also talk about “crawlers” or “bots“, which is a simple contraction of the term “robots”. Once a website is indexed by these robots, its content can then be found on demand.
But despite significant material resources, crawlers are not able to follow all the theoretically visible links that the Web contains.
In order to study the behavior of crawlers when faced with sites containing a large number of pages, a team of German researchers has, for example, created a website with more than 2 billion pages.
This website was based on a binary structure and is very deep, it took at least 31 clicks to get to some pages. They left the website online for one year without any changes.
And the results showed that the number of indexed pages for this site, in the best case, did not exceed 0.0049%. This part of the web, theoretically indexable, but not indexed in fact by the engines is nicknamed the “opaque web”, which is located right between the visible web and the deep web.
So, the visible web can be indexed and it is the case. The opaque web can be indexed, but it is not.
The Deep web cannot be indexed
In order for a website to be indexed by crawlers, then placed in the database by indexing robots and thus be referenced by search engines, it must comply with certain standards.
These standards concern the format, the content or the accessibility of the robots on the site. Namely that a website can have at the same time pages that do not comply with these standards and pages that comply with them, in which case only the latter will be referenced.
All websites that are directly accessible via search engines, therefore, respect a minimum of these standards. The referenced pages of all these sites form what is called the visible web: the part of the web that respects these standards. But it would represent only 4% of the total web.
The remaining 96% is the so-called deep resources: pages that do exist on the web but are not referenced by search engines for many reasons.
Starting with the failure to comply with established standards, but not only. These deep resources, which would, therefore, represent 96% of the entire web, form what is called the “Deep Web”, also called the invisible web, the hidden web or the deep web.
I use the conditional percentage because this ratio varies according to the studies that have been carried out. For example, according to some specialists in 2008, the Deep Web would, in fact, represent only 70% of the Web, or about a trillion non-indexed pages at the time.
A July 2001 study conducted by the BrightPlanet company estimated that the Deep Web could contain 500 times more resources than the visible web.
According to Chris Shermann and Gary Price in their book “The Invisible Web”, the visible web would represent 3 to 10% of the Web, so 90 to 97% for the Deep Web. According to a Canadian researcher at the end of 2013, it would be more in the order of 10% for the visible web and 90% for the Deep Web.
And according to a study published in the journal Network, any search on Google would simply provide 0.03% of the information that exists online. So 1 page out of 3,000 existing pages.
The percentage that stands out most often is still 4% for the visible web and 96% for the Deep Web. Just keep in mind that the visible web is actually only a tiny part of the entire web.
And that’s why the iceberg metaphor is often used as a representation. The emerged part represents the visible web, and the submerged part, the famous deep resources that make up the Deep web.
Moreover these resources, in addition to being large, are often of very good quality, because the compression of files is less consequent. But back to indexing.
There is a multitude of sites, pages, and documents, which the classical search engines cannot reference. Either because they simply don’t have access to these pages, or because they can’t understand them.
There are a multitude of reasons, but if we were to list the main ones, they would be :
* Unrelated content.
* The script content.
* The non-indexable format.
* Content that is too large.
* Private content.
* Limited access content.
* The Internet of Things.
* Dynamic content.
* Content under a non-standard domain name.
It goes without saying that some websites combine several of these factors. As far as unrelated content is concerned, some site pages are simply not linked to each other by hyperlinks, and therefore cannot be discovered by indexing robots that only follow hyperlinks. This is called pages without backlinks.
As for the non-indexable format, the Deep Web is also made up of resources using data formats that are incomprehensible to search engines.
This has been the case in the past, for example, with the PDF format, or those of Microsoft Office, such as Excel, Word or PowerPoint. The only format initially recognized by robots was the native language of the web, namely HTML.
But search engines are gradually improving to index as many formats as possible. Today, they are able to recognize in addition to HTML, PDF, those of Microsoft Office, and since 2008, pages in flash format.
As far as oversized content is concerned, traditional search engines only index between 5 and 60% of the content of sites accumulating large databases.
This is the case, for example, of the National Climatic Data Center with its 370,000 GB of data, or the NASA site with its 220,000 GB.
The engines therefore partially index these voluminous pages. Google and Yahoo, for example, stop indexing from 500 KB.
As for private content, some pages are inaccessible to robots, due to the will of the website administrator.
The use of the file “robots.txt” inserted in the code of a site, allows to authorize the indexing only of certain pages or documents of the site and thus to protect its copyright.
For example, if you do not want some of the images or photos on your site to appear on Google Images, or to limit visits and keep the site from too frequent access.
But it is not uncommon that a robots.txt put at the root of a website completely blocks the indexing and SEO of the entire site. Indeed, some people deliberately choose not to reference their site to privatize the information.
The only way to access their page is therefore to know the URL of their page in its entirety. The site developer can then choose to distribute the address to a few people in a specific community, for example on a forum like Reddit or 4chan, and these people can then circulate it by word of mouth. This is exactly the same operation as Discord servers, for example.
This is what is more commonly known as the private web, which is a category related to the Deep Web, and which is quite similar to the Dark Net.
As far as limited access content is concerned, some websites require authentication with a login and a password to access the content. This is more commonly known as the proprietary web.
This is the case, for example, of some sub-forums, or some sites with paid archives, such as online newspapers, which sometimes require a subscription. Some sites also require you to fill out a captcha, or Turing test, to prove that you are human and thus access the content.
Still, other sites sometimes require you to fill in a search criteria form to be able to access a specific page. This is the case, for example, of sites that use databases.
As far as the Internet of Things is concerned, also known as the IoT, the Internet of Things is the grouping or rather the network of all connected physical objects with their own digital identity and capable of communicating with each other.
From a technical point of view, the IoT consists of the direct digital identification of one of these objects, thanks to a wireless communication system, which can be either Wifi or Bluetooth.
However, some of them have a URL, even though they are in HTTP, but are not indexed by traditional search engines, because on the one hand, it would be useless. And on the other hand, it could lead to certain excesses.
But some specialized search engines such as Shodan, don’t care about these drifts and allow you to do much more in-depth searches, especially in the Internet of Things.
You can then stumble upon specialized pages for connecting to connected objects. For example, with real-time vehicle tracking, or even unprotected video devices. It can just as easily be surveillance cameras, such as private webcams that do not require a password for access.
So you understand the problems that can arise. I take this opportunity to advise you to always unplug your webcam when you are not using it.
And if it is included in your laptop, at least put something on it to hide the camera. In which case the microphone of your webcam will always be operational, don’t forget it. That’s why it’s always better to unplug it when you can, rather than just hide the lens.
As far as dynamic content is concerned, websites contain more and more dynamic pages. However, in this case, the navigation hyperlinks are generated on demand and differ from one visit to another.
Basically, the content of the pages fluctuates according to several parameters and the links change according to each user, thus preventing indexing.
For example, let’s say you want to take a ticket to go from Paris to Marseille. You type SNCF on Google, go to the site, then to the search page, and enter your information in a form, such as the names of the cities, your ranking, your age group, days, times, etc.
Once confirmed, you then arrive on a well-defined SNCF page, generated thanks to filters in its database, following the information you have provided.
This page which shows you very specific train timetables with available fares, you can’t find it directly by doing a Google search with keywords, we agree.
It is, therefore, a page that is not indexed by any search engine. I imagine that all of you have already done this kind of SNCF search at least once. Well, congratulations! You were on the deep web at the time.
Finally, as far as content under a non-standard domain name is concerned, these are websites with a domain name whose DNS resolution is not standard, with for example a root that is not registered with ICANN.
The Internet Corporation for Assigned Names and Numbers. In other words, the society for the assignment of domain names and numbers on the Internet.
The roots of domain names known by ICANN are.COM,.FR,.CO,.GOV and many others following the countries. But there are non-standard domain names that are only accessible via specific DNS servers.
The Domain Name System, the domain name systems, are services that allow a domain name to be translated into several types of information associated with it. In particular the IP address of the machine bearing this name.
The best known and most interesting example is the .onion root, which can only be resolved via the Tor Browser on the Tor network. I’m talking about the famous Dark Net, which provides access to much of the less accessible side of the Deep Web, the Dark Web.
In any case, you just have to understand that there are many, many cases where traditional search engines are unable to list a site or at least some of its pages.
All these inaccessible pages, at least in a direct way via search engines are therefore called the Deep Web resources and form what is called the Deep Web.
The average user, therefore, navigates every day on a minor part of the Web, the visible Web. From time to time, he or she may surf the Deep Web without realizing it, as with the example of the SNCF reservation.
After I took this example, but there are plenty of other cases where you are surfing the Deep Web.
For example, when you check your emails on your Gmail, you are on the Deep Web.
When you consult your customer area on your telephone operator’s website, you are on the Deep Web.
When you’re viewing a shared document on Google Drive, you’re on the Deep Web.
If you’re in a company that has an internal network, often called the intranet, and you go there, you’re on the Deep Web.
When you talk to your friends on a Discord server, you are on the Deep Web.
When you check your bank accounts online, you are on the Deep Web.
The Deep Web is your mailbox, your administration spaces, your company’s internal network, dynamic web pages and a lot of other things.
And the Deep Web is likely to become a much larger part of the web in the years to come, as the Cloud becomes more and more important.
All the articles and reports that say that you only surf the visible web every day are therefore wrong. Of course, the visible web is surely the one you use the most. But I imagine, for example, that you check your email every day, so you go to the Deep Web every day.
The Deep Web has nothing good or bad as some might think. It’s just a technical specificity. There is no dark side of the Net, just areas ignored by some engines.
The problem, as you will have understood, is that a lot of articles and reports confuse the Deep Web and the Dark Web. They talk about the Dark Web by calling it the Deep Web, but it’s not the same thing.
As a result, the Deep Web is wrongly demonized by the media and the general public gets a completely biased image of it.
The difference between the Deep Web and the Dark Web
When I listed the main reasons why some web pages are not indexed, I mentioned those with a non-standard domain name. In other words, URLs that do not end in.COM,.FR,.CO,.GOV and so on, depending on the country.
Sites that are not referenced by the classic search engines, because their domain name is not registered with the ICANN. The majority of them were created to voluntarily avoid any referencing. And their URLs can only be translated between quotation marks via specific DNS servers.
The best-known example is the .onion root, which can only be resolved via the Dark Net Tor, allowing access to much of the least accessible side of the Deep Web, the Dark Web.
Thus, the so-called Dark Web is a sub-part of the Deep Web and is the set of pages that can only be accessed by having a direct .onion link to the Dark Net Tor.
Again, there’s nothing good or bad about this. It’s just a technical specificity. And why do I also want to differentiate the Dark Net from the Dark Web? Because the Dark Web is about content and the Dark Net is about infrastructure.
In other words, the technical ways in which this content is created and made available. In other words, there is not just one Dark Net, but several.
So let me summarize. The Internet is a network of computer networks, made up of millions of networks, both public and private, which circulate all kinds of data.
The World Wide Web, or the Web if you prefer, is one application among many that use the Internet as a physical medium and a means of transport to find this data.
The Web has two distinct parts: the visible web and the invisible web, more commonly known as the Deep Web.
The Deep Web exists for a number of reasons that we have seen. And one of them concerns special domain names.
The networks grouping together these sites with these special domain names are called the Dark Nets. And the content that we find on these Dark Nets is called the Dark Web.