It’s been a only few days after I opened this renewed website. Like everyone else, I’m anxious to know who is visiting my website. The featured image of this Blog post is the logo of paid service called WhoisVISITINNG site. Almost all the webhosting providers give you an access through their web based CPanel (Control Panel) a very graphical interface about visitors to your website. Is it really reflecting information you would like to know? The answer is unfortunately “No”. Once your website became a target of outside hackers who try to intrude into your website, they can disguise themselves using automated attacking program to change every time they come to your website which country they are coming from (including their hosting providers) and what sort of web browser they are using. It’s not a pleasant situation, however, you have to live with it.
Only true reflection of the visitors profile is your server access log which is provided by your web hosting provider. Usually, you will see 2 (two) types of activities you can read from the server log. 1. Get action which means visitor to your website is trying to acquire your website page information, 2. Put action which means visitor to your website is trying to send some command to your website.
Get command is usually harmless as it only reads the information of your website, where as Put command is the one you have to pay your attention for. Put command includes someone is trying to access your website (in case of WordPress – example it to try login to the inside of your website using WordPress login screen) by sending some information from their end.
You also notice Get request from *.bot or *.spider access to your website. This bot or spider is classified as “Web Crawler” which is to browse the World Wide Web (www), typically for the purpose of Web indexing. At the end, it will determine how is your website listed in their Search Engine.
In case of my new website, I have already noticed some bots visited my site for indexing purposes. That include yandex.com/bots, baidu.com/search/spider.html (Chinese Baidu search engine) and Google bot etc.
You usually welcome Google bot visit which determines at the end of the day a listing order of Google search engine.
It’s a bit long, however, I would like to show you how Google bot works as follows;
For webmasters: Googlebot and your site
How Googlebot accesses your site
For most sites, Googlebot shouldn’t access your site more than once every few seconds on average. However, due to network delays, it’s possible that the rate will appear to be slightly higher over short periods.
Googlebot was designed to be distributed on several machines to improve performance and scale as the web grows. Also, to cut down on bandwidth usage, we run many crawlers on machines located near the sites they’re indexing in the network. Therefore, your logs may show visits from several machines at google.com, all with the user-agent Googlebot. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server’s bandwidth. Request a change in the crawl rate.
Blocking Googlebot from content on your site
It’s almost impossible to keep a web server secret by not publishing links to it. As soon as someone follows a link from your “secret” server to another web server, your “secret” URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. Similarly, the web has many outdated and broken links. Whenever someone publishes an incorrect link to your site or fails to update links to reflect changes in your server, Googlebot will try to download an incorrect link from your site.
If you want to prevent Googlebot from crawling content on your site, you have a number of options, including using robots.txt to block access to files and directories on your server.
Once you’ve created your robots.txt file, there may be a small delay before Googlebot discovers your changes. If Googlebot is still crawling content you’ve blocked in robots.txt, check that the robots.txt is in the correct location. It must be in the top directory of the server (for example, www.example.com/robots.txt); placing the file in a subdirectory won’t have any effect.
If you just want to prevent the “file not found” error messages in your web server log, you can create an empty file named robots.txt. If you want to prevent Googlebot from following any links on a page of your site, you can use the nofollow meta tag. To prevent Googlebot from following an individual link, add the rel=”nofollow” attribute to the link itself.
Here are some additional tips:
Test that your robots.txt is working as expected. The Test robots.txt tool on the Blocked URLs page lets you see exactly how Googlebot will interpret the contents of your robots.txt file. The Google user-agent is (appropriately enough) Googlebot.
The Fetch as Google tool in Search Console helps you understand exactly how your site appears to Googlebot. This can be very useful when troubleshooting problems with your site’s content or discoverability in search results.
Making sure your site is crawlable
Googlebot discovers sites by following links from page to page. The Crawl errors page in Search Console lists any problems Googlebot found when crawling your site. We recommend reviewing these crawl errors regularly to identify any problems with your site.
If your robots.txt file is working as expected, but your site isn’t getting traffic, here are some possible reasons why your content is not performing well in search.
Problems with spammers and other user-agents
The IP addresses used by Googlebot change from time to time. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot). You can verify that a bot accessing your server really is Googlebot by using a reverse DNS lookup.
Googlebot and all respectable search engine bots will respect the directives in robots.txt, but some nogoodniks and spammers do not. Report spam to Google.
Google has several other user-agents, including Feedfetcher (user-agent Feedfetcher-Google). Since Feedfetcher requests come from explicit action by human users who have added the feeds to their Google home page and not from automated crawlers, Feedfetcher does not follow robots.txt guidelines. You can prevent Feedfetcher from crawling your site by configuring your server to serve a 404, 410, or other error status message to user-agent Feedfetcher-Google. More information about Feedfetcher.
You maybe amazed to read the above statement from Google “It’s almost impossible to keep a web server secret by not publishing links to it. As soon as someone follows a link from your “secret” server to another web server, your “secret” URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log” but it’s true. As such, you have to be very careful what contents you store on your hosting server.
There are many ways to protect your website whilst you welcome visitors you would like to have and would like to reject the connection to unwanted visitors.
It’s a bit technical but if you are interested, the following link is a good reference to control web traffic to your website.
If you start feeling unusual behaviours of your beloved own website (like slowness etc.), please consult with your friendly computer specialist to identify the issues.