Wednesday, 28 January 2015

NOFOLLOW DOES NOT MEAN DO NOT CRAWL!

NOFOLLOW DOES NOT MEAN DO NOT CRAWL!

By Strictly-Software

I have heard it said by "SEO Experts" and other people that to prevent excess crawling of a site you can add rel="nofollow" to your links and this will stop GoogleBOT from crawling those links.

Whilst on the surface of it this does seem to make logical sense, I mean the attribute value does say "nofollow" not "follow if you want" it isn't. BOTS will ignore the nofollow and still crawl the links if they want to.

The nofollow attribute value is not meant for blocking access to pages and preventing your content from being indexed or viewed by search engines. Instead, the nofollow attribute is used to stop SERPS like GoogleBOT from having any "link juice" from the main page leak out to the pages they link to.

As you should know Google still uses PageRank, even though it is far less used than in years gone by. In the old days it was their prime way of calculating where a page was displayed in their index and how one page was related to another in terms of site authority.

The original algorithm for Page Rank and how it is calculated is below.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


An explanation for it can be found here. Page Rank Algorithm Explained.

The perfect but totally unrealistic scenario is to have another site with a very high Page Rank value e.g 10 (the range goes from 1 to 10) and to have that sites high PR page (e.g their homepage) have a single link on it that goes to your site - without a nofollow value in the rel attribute of the link.

This tells the SERP e.g GoogleBOT that this high ranking site THINKS your site is more important than it in the great scheme of the World Wide Web.

Think of a pyramid with your site/page ideally at the top with lots of high PR pages and sites all pointing to it, passing their link juice upwards to your site. If your page then doesn't have any links on it at all then no link juice you have obtained from inbound links will be "leaked out".

The more links there are on a page the less PR value is given to each link and the less "worthy" your site becomes in theory.

So it should be noted that the nofollow attribute value isn't meant for blocking access to content or preventing content to be indexed by GoogleBOT and other search engines.



Instead, the nofollow attribute is used by sites to stop SERP BOTS like GoogleBOT from passing "authority" and PR value to the page it is linking to.

Therefore GoogleBOT and others could still crawl any link with rel="nofollow" on it.

It just means no Page Rank value is passed to the page being linked to.

Sunday, 25 January 2015

Returning BAD BOTS to where they came from

Banning BAD BOTS to where they came from

By Strictly-Software

Recently in some articles I mentioned some .htaccess rules for returning "BAD BOTS" e,g crawlers you don't like such as IE 6 because no-one would be using it anymore and so on.

Now the rule I was using was suggested by a commenter in a previous article and it was to use the REMOTE_ADDRESS IP parameter to do this.

For example in a previous article (which I have now changed) about banning IE 5, 5.5 and IE 6, I originally suggested using this rule for banning all user-agents that were IE 5, 5.5 or IE 6.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://%{REMOTE_ADDR} [L,R=301]

Now this rewrite rule uses the ISAPI parameter {REMOTE_ADDR} which holds the originating IP address from the HTTP request to send anyone with IE 6 or below back to it.

It is the IP address you would normally see in your servers access logs when someone visits.

Problems with this rule

Now when I changed the rules on one of my own sites to this rule and then started testing it at work for a work site by using a user-agent switcher add-on for Chrome I ran into the problem that every time I went to my own site I was sent back to my companies gateway router page.

I had turned the switcher off but for some reason either a bug in the plugin, a cookie or session variable must have caused my own site to believe I was still on IE 6 and not the latest Chrome version. So everytime I went to my site with this rule I was kicked back to my companies gateway routers page.

Therefore after a clean up and a think and talk with my server techie guy he told me I should be using localhost instead of the REMOTE_ADDR IP address .The reason was that a lot of traffic, hackers, HACKBOTS, Spammers and so on would be hitting the Gateway page for their ISP for potential hacking,

These ISP's might get a but pissed off with your website sending their gateway routers page swathes of traffic that could potentially harm them,

Therefore to prevent getting letters in the post that you are sending swathes of hackers to your homes or phones ISP gateway - as a lot of phones or tablets use proxies for their browsers anyway - is to send them back to their own localhost or 127.0.0.1.

Also instead of using a 301 permanent redirect rule you should use a 302 temporary redirect rule instead as that is the more appropriate code to use,

Use this rule instead

Therefore the rule I now recommend for anyone wanting to ban all IE 5, 5.5 and 6 traffic is below.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]

This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302 rewrite rule. You can obviously add other rules in with BOTS and SQL/XSS injection hacks as well

This is a more valid rule as it's not a permanent redirect for the traffic such as if a page has changed it's name. Instead it's down to an invalid parameter or value in the HTTP Request that the user is being redirected to the new destination with a redirect.

If the user changed it's user-agent or parameters then it would get to the site and not be redirected with a 301 OR a 302 status code but instead get a 200 OKAY status code.

So remember, whilst an idea might seem good at first until you fully test it and ensure it doesn't cause problems it might not be all that it seems.

Wednesday, 14 January 2015

ETSY SHOP OPEN FOR BUSINESS!

ETSY SHOP OPEN FOR BUSINESS!

By Strictly-Software

My Etsy shop is OPEN again - if you run a WordPress site and want some tools to automate your system then check out: https://www.etsy.com/uk/shop/StrictlySoftware

I didn't know the items in my shop EXPIRED after so long so the shop was empty to viewers for the last month and a bit but now you can buy your tools from etsy or my own site for plugins: http://www.strictly-software.com/plugins (please click on some adverts and help me raise some cash)

Also my facebook page: https://www.facebook.com/strictlysoftware has information about these tools that you should read if you have purchased any of them.

It has help articles, guides on support, possible issues and fixes and much more - feel free to comment and like the page!

The basic idea behind these plugins is:

Run a site all year, 24/7 without having to do anything apart from some regular maintenance like cleaning tags that are not used very much and OPTIMIZING your database table.

So an RSS / XML feed contains your content (e.g news about something) and this goes into WordPress at scheduled times (Cron jobs or WebCrob jobs) using WordPress plugins like RSSFeeder or WP-O-Matic then as the articles are saved Strictly AutoTags adds the most relevant tags to it by using simple pattern matching like finding the most frequently used "important" words in the article e.g words in the Title, Headers, Strong tags or just Capitalised Words such as names like John Smith.

This means if John Smith became famous over night you wouldn't have to add a manual tag in for him or wait for a 3rd party plugin to add in the word to their own database so that it can be used.

Then once your article is tagged. You can choose to have the most popular tags converted into links to tag pages (pages containing other articles with the same tag) or just bold them for SEO - or do nothing.

You can set certain tags to be "TOP TAGS" which will rank them higher than all other tags. These should be tags related to your site e.g a bit like the old META Keywords.

You can also clean up old HTML, convert text tags to real clickable ones and set up a system where if a tag such as ISIS is found the tag Middle East is used instead. This is all explained on the Strictly AutoTags page on my site.

Then if you also purchase Strictly Tweet BOT PRO as well you can use those new post tags as #hashtags in your tweets and you can set your system up to either tweet to multiple twitter accounts with different formats and tags or tweet to the same account with different wording dependant on the wording in your article.



E.G if your article was about the Middle East wars you could say only post the Tweet if the article contains this word "Middle East" OR "Syria" or you could say only post if it contains the words "ISIS" AND "War".

The TweetBOT then lets you ensure the post is cached (if you are using a WordPress Caching System) by making it live first and making an HTTP request to it so it gets cached. Then it waits a custom defined number of seconds before any Tweets are sent out.

You can then specify a number of seconds between each Tweet that is sent out to prevent Twitter Rushes e.g. Where 50 BOTS all hit your site at the same time.

You can ensure no Tweets are sent out if they contain certain words, add tracking links e.g Google before the link is minimised by Bit.ly.

A simple PIN number process lets you connect your Twitter Account to your TweetBOT Account.

A dashboard keeps you informed of recent Tweets sent out, any errors from Twitter like "duplicate tweet", or if your Bit.ly account isn't working.

Plus a test button lets you test the system without sending a Tweet by taking the last post, running your settings through it such as shortening the link and post and checking all Twitter accounts are working and connected properly.

If you then link your Twitter account up to your Facebook page like I have with my Horse Racing site http://www.ukhorseracingtipster.com/ and my Twitter account @ukhorseracetips with my Facebook page facebook.com/Ukhorseracingtipster you get social media and SEO impact for free!





Check out the new live shop on Etsy for plugins and coupons if you need me to set the plugin up for your site:https://www.etsy.com/uk/shop/StrictlySoftware

You may need help due to your sites special settings or requirements so a coupon will let you help you set it up correctly for you.

Saturday, 10 January 2015

2 Quick Ways To Reduce Traffic On Your System

2 Quick Ways To Reduce Traffic On Your System

By Strictly-Software

Slowing The 3 Major SERP BOTs Down To Reduce Traffic

If you run a site with a lot of pages, good rankings, or a site that tweets out a lot e.g whenever a post comes online then you will probably get most of your traffic from getting crawled by the big 3 crawlers:

I know that whenever I check my access_log on my server to find out the top visiting IP addresses with a command like

grep "Jan/2015" access_log | sed 's/ - -.*//' | sort | uniq -c | sort -nr | less

I always find the top IP's are the main 3 Search Engines own BOTS (SERP = Search Engine Results Page), so I call their BOTS SERP BOTS.

GoogleBot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bing: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yahoo: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Without asking these BOTS to come NOW by doing things like refreshing your sitemap and pinging these SERPS

Or tweeting out links then they will crawl your site at their own time and choosing and with nothing to tell them to slow down they will crawl at their own speed. This could be once every second and if so it could cause your site performance issues.

However thing about it logically, if you post a news article or a job advert then in reality it only needs to be crawled once by each SERP BOT for it to be indexed. 

You don't really want it to be crawled every day and on every visit by these BOTS as the content HASN'T changed so there is really no need for the visit.

Now I don't know a way of telling a BOT to only crawl a page only if it's new content or it's changed in some way even if you had a sitemap system that only put in pages that were new or edited as the BOTS will still just vist your site and crawl it.

If you cannot add rel="nofollow" on internal links that point to duplicate content which doesn't actually 100% mean the BOT won't crawl it anyway then there are some things you can try if  you find that your site is having performance problems or is under pressure from heavy loads.


Crawl-Delay

Now this only used to be supported by BingBOT and then some smaller new search engines like Blekko

However in recent months after some testing I noticed that all most major SERP BOTS apart from GoogleBOT now obey the command. To get Google to reduce their crawl rate you can use Webmaster Tools to set their crawl rate from the control panel.

For instance on one of my big news sites I have a Crawl-Delay: 25 setting and when I check my access log for those user-agents there is a 25 second (roughly) delay between each request.

Therefore extending this value will reduce your traffic load by the major visitors to your site and is easily done by adding it to your Robot.txt file e.g.

Crawl-delay: 25

Banning IE 6

Now there is no logical reason in the world for any REAL person to be using this user-agent.

This Browser was probably the worst ever Browser in history due to the quirks within it that made web developers jobs so hard. Even just between IE 5.5 and IE 7 there are so many differences with IE 6 and is the reason IE 8 and 9 had all the settings for compatibility modes and browser modes.

It is also the reason IE is going to scrap support for IE 7-9 because of all this hokerery pokery they introduced just to handle the massive differences between IE 6 and their new standard compliant browsers.

Anyone with a Windows computer nowadays should be on at least IE 10. Only if your still on XP and haven't done any Windows Updates since about 5 years ago would you be a real IE 6 user.

Yesterday at work I ran a report on the most used Browsers that day. 

IE 6.0 came 4th!

It was below the 3 SERP BOTS I mentioned earlier and above the latest Chrome version.

On more detailed inspection of my custom logger/defence system that analyses the behaviour of visitors rather than just assuming that because your agent is IE 6 you are actually human could I see these visitors were all BOTS. 

I check for things like whether they could run JavaScript by using JavaScript to log that they can in the same way as I do Flash. These users had no JavaScript or Flash support and the rate they went through pages was way too fast for a human controller.

The only reason I can think people are using this user-agent is because they are script kiddies who have downloaded an old crawling script and the default user-agent is IE 6 and they haven't changed it.

Either they don't have the skill or they are just lazy. However by banning all IE 6 visitors with a simple .htaccess rule like this you can reduce your traffic hugely.

RewriteRule %{HTTP_USER_AGENT} (MSIE\s6\.0|MSIE\s5\.0|MSIE\s5\.5) [NC]
RewriteRule .* http://127.0.0.1 [L,R=302]


This Rewrite rule bans IE 5, 5.5 and IE 6.0 and sends the crawler back to the localhost on the users machine with a 302  rewrite rule.

No normal person would be using these agents. There maybe some Intranets using VBScript as a client side scripting language from the 90's but no modern site is designed with IE 6 in the designers mind. Therefore most sites you find will not hanlde IE 6 very well therefore like Netscape Navigator they are an old browser so don't worry about site support for it. Therefore by banning it you will find your traffic going down a lot by banning just IE 6 and below.

So two simple ideas to reduce your traffic load. Try them and see how much your site improves.