Monday, 30 December 2013
Static HTML file in WordPress cause of major issues!
Right I've solved this major pain in the ass!
Turning caching OFF, clearing all my .htaccess rules and much more as you will see if you read the post on WordPress.
Why not just take the users settings for permalinks and make .htaccess rules for each major branch.
Problem with WAMP Server on localhost with .htaccess file
Problem with WAMP Server on localhost with .htaccess file
If you have read my article Troubleshooting WAMP server on Windows 7 installations you will know that I run both IIS and WAMP on the same Windows computer side by side.I let IIS run on the normal 127.0.0.1 (Localhost IP loopback address) and I change the PHP.ini file to let Apache run on 127.0.0.1:8888 (same IP but a different port no).
This enables me to run and test PHP files on the same PC without having to toggle IIS on/off before each run of a PHP file
However I also use my WAMP folder C:/www/wamp/test.php as a place to quickly test PHP files or to download files from my Wordpress or PHP sites to debug and test various parts quickly on my local PC with full debugging so I don't effect the live site.
However one of things you may have run into is when WAMP is running but you are getting an error page when you go to to localhost:8888 and just get an error page like this:
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, admin@localhost and inform them of the time the error occurred, and anything you might have done that may have caused theerror. More information about this error may be available in the server error log.
So you have no idea what the problem is. IIS is off, you haven't changed PHP.ini config and nothing seems wrong.
Debugging
First thing to do is check your local Apache error log file e.g at C:/wamp/logs/apache_error.log to see if anything stands out.
It might show nothing or it might show something related to the last time you ran the page like this.
[Mon Dec 30 15:41:24 2013] [alert] [client 127.0.0.1] C:/wamp/www/.htaccess: Invalid command 'ExpiresActive', perhaps misspelled or defined by a module not included in the server configuration
Solution
Now the problem is basic stupidity. I was testing and modifying a sites .htaccess file by downloading it from the site by FTP to the wamp/www folder so I could analyse it and play about with it.
However I had forgotten to change the name of the file back from .htaccess to $.htaccess or delete it altogether. Therefore when WAMP loads with the www folder being the home directory it will automatically load in any .htaccess file and rules it finds.
If the file is full of irrelevant, invalid or old commands this means that the WAMP system on your Windows PC cannot compute and analyse them all and therefore you get the Internal Server Error message.
Just delete or rename your .htaccess file to $.htaccess or BACKUP_htaccess (as Windows won't actually let you rename a file starting with a period e.g .htacess_backup giving the "You must type a filename" error message as it sees the period at the part before the file extension and believes no name exists for the file.
However after renaming or removing the file then this problem should be resolved.
It's happened a few times to me now so I thought I would write it down in case I forgot next time DOH!
To read more about setting up WAMP Server alongside IIS on a Windows machine read my article on it at: Trouble Shotting WAMP Server on Windows 7 Machines.
Wednesday, 20 November 2013
Twitter Changing Their API Again and Parsing Twitter Tweet Responses
Twitter Changing Their API Again and Parsing Twitter Tweet Responses
If you haven't noticed by now you should do the next time you try and use Twitter on the twitter.com website.Not only are they re-working their API they are cracking down on the sending of links in DM messages.
I personally have had an account suspended within the last week then re-activated 3 days later and I still have had no explanation why this happened from their support team.
Also this happened directly after I was in contact with them about the problems they were having falsely identifying double encoded links as phishing or spam and directing users of my account to a "Twitter thinks this page might be dangerous" page!
Even though their development team claimed the links I was using were okay and safe if you clicked them you would be redirected to this page by mistake.
I think the problem was due to the fact that to get a link into 140 chars you have to first encode it with a link shortener like bit.ly and then Twitter always encode ALL links with their own t.co shortener so they can track the clicks.
I reckon they were having issues identifying the final destination URL and were therefore flagging double encoded links as spam/phishing/hacks etc - despite their claims they weren't.
However now it seems Twitter has changed their DM API. Not only have I found that it constantly refreshes as I am typing causing me to lose my message but if I receive a message as I type it also refreshes causing the same lost wording.
It seems more like a badly written interactive Messenger service now with an AJAX timer to reload new content than the good old DM service it used to be. They still maybe working on it so hopefully these bugs will get fixed soon.
A few people have complained to me they cannot now send links in D or M messages (Direct Messages) and I have found the same problem. It seems Twitter are cracking down on companies that log people who follow and unfollow you as well as treating the majority of links in DM's as spam, whether they are or not.
Anyway bitching time is now over.
What I wanted to write about was that I had to release new versions of my WordPress plugins Strictly AutoTags and Strictly TweetBot the other day.
I was finding on sites with a lot of posts that the Tweets from the TweetBot (which are hooked into the onpublish action/event) were going out before tagging could be completed. Therefore only the default #HashTags were being used.
Therefore if you wanted to use categories or post tags as your #HashTags in the Tweet they were not available at the point of Tweeting.
Therefore I changed the Strictly AutoTags plugin to fire an event once tagging had been completed in the SaveAutoTags method which enabled Strictly TweetBot to run then instead of on publish. I also pass in the Post ID so the event listener knows which post to send Tweets for e.g:
do_action('finished_doing_tagging', $object->ID);
The code in the TweetBot automatically checks for the existence of the AutoTag plugin by looking for a function that I use within the plugin that will only exist if the plugin is active.
If this is found the TweetBot hooks the listener to fire off Tweets into this event instead of the standard on publish event.
As I was explaining to a WordPress developer the other day, I prefer the use of the terms events and listeners due to my event driven coding in JS or C# etc whereas hooks and actions are more WordPress driven.
However as I was re-writing this code I also noticed that in the admin panel of Strictly TweetBots which shows a history of the last Tweets sent and any error messages returned from Twitter that I was not getting back my usual "Duplicate Tweet" error when I tried sending the same tweet multiple times.
By the looks of things the JSON response from Twitter when using OAuth to send a Tweet has changed slightly (when I don't know) and now I have to parse the object and containing array for any error messages rather than checking just for a string as I used to be able to do.
If anyone is interested, or works with OAuth to send Twitter messages, the code to parse the JSON response to collect error messages is below.
Please notice the ShowTweetBotDebug debug function which outputs arrays, objects and strings to the page if needed.
/*
* My Show Debug Function
*
* @param $msg string, object or array
*/
function ShowTweetBotDebug($msg)
{
if(!empty($msg))
{
if(is_array($msg)){
print_r($msg);
echo "<br />";
}else if(is_object($msg)){
var_dump($msg);
echo "<br />";
}else if(is_string($msg)){
echo htmlspecialchars($msg) . "<br>";
}
}
}
$twitterpost = "Some tweet about something";
ShowTweetBotDebug("post this exact tweet = $twitterpost");
// Post our tweet - you should have set up your oauth object before hand using the standard classes
$res = $oauth->post(
'statuses/update',
array(
'status' => $twitterpost,
'source' => 'Strictly Tweetbot'
)
);
ShowTweetBotDebug("response from Twitter is below");
//output errors using vardump to show whole object
ShowTweetBotDebug($res);
// parse error messages from Twitter
if($res){
// do we have any errors?
if(isset($res->errors)){
ShowTweetBotDebug("we have errors");
// if we do then it will be inside an array
if(is_array($res->errors)){
ShowTweetBotDebug("we have an array of errors to parse");
foreach($res->errors as $err){
ShowTweetBotDebug("Error = " . $err->message);
}
}
}else{
ShowTweetBotDebug("tweet sent ok");
}
}else{
ShowTweetBotDebug("Could not obtain a response from Twitter");
}
This code might come in handy to other people who need to parse Twitter responses and just remember - Twitter is ALWAYS changing their API which is a real pain the ass to any developers using their API or trying to make a business model from it.
It seems a lot of companies have had to shut down due to the recent changes to Direct Messages so who knows what they will do in the future?
Wednesday, 23 October 2013
4 simple rules robots won't follow
4 simple rules robots won't follow
Job Rapists and Content Scrapers - how to spot and stop them!
I work with many sites from small blogs to large sites that receive millions of page loads a day. I have to spend a lot of my time checking my traffic log and logger database to investigate hack attempts, heavy hitting bots and content scrappers that take content without asking (on my recruitment sites and jobboards I call this Job Raping and the BOT that does it a Job Rapist).
I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.
These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.
I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.
These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.
4 Rules For BOTS to follow
1. Look at the Robots.txt file and follow the rules
If you don't even bother looking at this file (and I know because I log those that do) then you have broken the most basic rule that all BOTS should follow.
If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.
To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)
How to write code to parse a Robots.txt file (including the sitemap directive).
If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.
To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)
How to write code to parse a Robots.txt file (including the sitemap directive).
2. Identify yourself correctly
Whilst it may not be set in stone, there is a "standard" for BOTS to identify themselves correctly in their user-agents and all proper SERPS and Crawlers will supply a correct user-agent.
If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)
They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.html, http://www.bing.com/bingbot.htm and http://tweetedtimes.com
On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.
At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.
Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.
If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!
If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)
They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.html, http://www.bing.com/bingbot.htm and http://tweetedtimes.com
On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.
At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.
Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.
If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!
3. Set up a Reverse DNS Entry
I am now using the "standard" way of validating crawlers against the IP address they crawl from.
This involves doing a reverse DNS lookup with the IP used by the bot.
If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.
I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.
I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.
This involves doing a reverse DNS lookup with the IP used by the bot.
If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.
I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.
I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.
4. Job Raping / Scraping is not allowed under any circumstances.
If you are crawling my system then you must have permission from each site owner as well as me to do this.
I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.
This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.
Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.
I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.
This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.
Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.
I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging.
It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.
If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!
It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.
If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!
Labels:
403,
Ban,
Blank User-Agents,
Content Scrapper,
crawler,
Forbidden,
Forward IP,
Googlebot,
IP address,
Job Rapist,
Lookup,
Reverse DNS,
robots,
robots.txt,
user-agent
Location:
Southern Europe
Thursday, 12 September 2013
SEO - Search Engine Optimization
My two cents worth about Search Engine Optimisation - SEO
Originally Posted - 2009
UPDATED - 12th Sep 2013
SEO is big bucks at the moment and it seems to be one of those areas of the web where there seem to be lots of snake oil salesmen and "SEO experts" who will promise no 1 positioning on Google, Bing and Yahoo for $$$ per month.
It is one of those areas that I didn't really pay much attention to when I started web developing mainly because I was not the person paying for the site and relying on leads coming from the web. However as I have worked on more and more sites over the years its become blatantly apparent to me that SEO comes in two forms from a development or sales point of view.
There are the forms of SEO which are basically good web development practise and will come about naturally from having a good site structure, making the site usable and readable as well as helping in terms of accessibility.
Then there are the forms which people will try and bolt onto a site afterwards. Either as an after thought or because an SEO expert has charged the site lots of money, promised the impossible, and wants to use some dubious link-sharing schemes that are believed to work.
Cover the SEO Basics when developing the site
Its a lot harder to just "add some Search Engine Optimization" in once a site has been developed especially if you are developing generic systems that have to work for numerous clients.
I am not an SEO expert and I don't claim to be otherwise I would be charging you lots of money for this advice and making promises that are impossible to be kept. However following these basic tips will only help your sites SEO.
Make sure all links have title tags on them and contain worthy content rather than words like "click here". The content within the anchor tags matter when those bots come a crawling in the dead of night.
You should also make sure all images have ALT attributes on them as well as titles and make sure the content of both differ. As far as I know Googlebot will rate ALT content higher than title content but it cannot hurt to have both.
Make sure you make use of header tags to differentiate out important sections of your site and try to use descriptive wording rather than "Section 1" etc.
Also as I'm sure you have noticed if you have read my blogs before I wrap keywords and keyword rich sentences in strong tags.
I know that Google will also rank emphasised content or content marked as strong over normal content. So as well as helping those readers who skim read to view just the important parts, it also tells Google which words are important on my article.
Write decent content and don't just fill up your pages with visible or non-visible spammy keywords.
In the old days keyword density mattered when ranking content. This was calculated by removing all the noise words and other guff (CSS, JavaScript etc) and then calculating what percentage of the overall page content were relevant keywords?
Nowadays the bots are a lot cleverer and will penalise content that does this as it looks like spam.
Also its good for your users to have good readable content and you shouldn't remove words between keywords as it makes it more unreadable and you will lose out on the longer 3, 4, 5 word indexable search terms (called long-tail in the SEO world).
Saying this though its always good to remove filler from your pages. For example by putting your CSS and Javascript code into external files when possible and removing large commented out sections of HTML.
You should also aim to put your most important content at the top of the page so its the first thing crawled.
Try moving main menus and other content that can be positioned by CSS to the bottom of the file. This is so that social media sites and other BOTS that take the "first image" on an article and use it in their own social snippets don't accidentally use an advertisers banner instead of your logo or main article picture.
The same thing goes for links. If you have important links but they are in the footer such as links to site-indexes then try getting them higher up the HTML source.
I have seen Google recommend that 100 links a per page is the maximum to have per page. Therefore having a homepage that has your most important links at the bottom of the HTML source but 200+ links above them e.g links to searches even if not all of them are visible then this can be harmful.
If you are using a tabbed interface to switch between tabs of links then the links will still be in the source code and if they are loaded in by JavaScript on demand then that's no good at all as a lot of crawlers don't run JavaScript.
Items such as ISAPI URL rewriting are very good for SEO plus they are nicer URLs for sites to display.
For example using a site I have just worked on as an example http://jobs.professionalpassport.com/companies/perfect-placement-uk-ltd is a much nicer URL to view a particular company profile than the underlying real URL which could also be accessed as http://jobs.professionalpassport.com/jobboard/cands/compview.asp?c=6101
If you can access that page by both links and you don't want to be penalised for duplicate content then you should specify which link you would want to be indexed by specifying your canonical link.You should also use your Robots.txt file to specify that the non re-written URL's are not to be indexed e.g.
META tags such as the keywords tag are not considered as important as they once were and having good keyword rich content in the main section of the page is the way to go rather than filling up that META with hundreds of keywords.
The META Description will still be used to help describe your page on search results pages and the META Title tag is very important to describe your page's content to the user and BOT.
However some people are still living in the 90's and seem to think that stuffing their META Keywords with spam is the ultimate SEO trick when in reality that tag is probably ignored by most crawlers nowadays.
Set up a Sitemap straight away containing your sites pages ranked by their importance, how often they change, last modified date etc. The sooner you do this the quicker you site will be getting indexed and gaining site authority. It doesn't matter if it's not 100% ready yet but the sooner it's in the indexes the better.
Whilst you can do this through Googles webmaster tools or Microsofts Bing you don't actually need to use their tools and as long as you use a sitemap directive in your robots.txt file BOTS will find it e.g
You can also use tools such as the wonderful SEOBook Toolbar which is an add-on for Firefox which has combined numerous other free online SEO tools into one helpful toolbar. It lets you see your Page Ranking and compare your site to competitors on various keywords across the major search engines.
Also using a text browser such as Lynx to see how your site would look to a crawler such as yahoo or google.is a good trick to see how BOTS would view your site as it will skip all the styling and JavaScript.
There are many other good practises which are basic "musts" in this day and age and the major SERP'S are moving more and more towards social media when it comes to indexing sites and seeing how popular they are.
You should set up a Twitter account and make sure each article is published to it as well as engaging with your followers.
A Facebook Fan page is also a good method of getting people to view snippets of your content and then find your site through the world most popular social media website.
Making your website friendly for people viewing it on tablets or smart phones is also good advice as more and more people are using these devices to view Internet content.
The Other form of SEO, Black Magic Optimization
The other form of Search engine optimization is what I would call "black magic SEO" and it comes in the form of SEO specialists that will charge you lots of money and make impossible claims about getting you to the number one spot in Google for your major keywords and so on.
The problem with SEO is that no-one knows exactly how Google and the others calculate their rankings so no-one can promise anything regarding search engine positioning.
There is Googles Page Ranking which is used in relation to other forms of analysis and it basically means that if you have a site with a high PR that links to your site that does not link back to the original site then it tells Google that your site has higher site authority than the linking site.
If your site only links out to other sites but doesn't have any links coming in from high page ranked relevant sites then you are unlikely to get a high page rank yourself. This is just one of the ways which Google will use to determine how high to place you in the rankings when a search is carried out.
Having lots of links coming in from sites that have nothing whatsoever to do with your site may help drive traffic but will probably not help your PR. Therefore engaging in all these link exchange systems are probably worth jack nipple as unless the content that links to your site is relevant or related in some way its just seen as a link for a links sake i.e spam.
Some "SEO specialists" promote special schemes which have automated 3 way linking between sites enrolled on the scheme.
They know that just having two unrelated sites link to each other basically negates the Page Rank so they try and hide this by having your site A linking to site B which in turn links to site C that then links back to you.
The problem is obviously getting relevant sites linking to you rather than every tom dick and harry.
Also advertising on other sites purely to get indexed links from that site to yours to increase PR may not work due to the fact that most of the large advert management systems output banner adverts using Javascript therefore although the advert will appear on the site and drive traffic when people click it you will not get the benefit of an indexed link. The reason being that when the crawlers come to index the page containing the advert the banner image and any link to your site won't be there.
Anyone who claims that they can get you to the top spot in Google is someone to avoid!
The fact is that Google and the others are constantly changing the way they rank and what they penalise for so something that may seem dubious that works currently could actually harm you down the line.
For example in the old days people would put hidden links on white backgrounds or position them out of site so that the crawlers would hit them but the users wouldn't see which worked for a while until Google and the others cracked down and penalised for it.
Putting any form of content up specifically for a crawler is seen as dubious and you will be penalised for doing it.
Google and BING want to crawl the content that a normal user would see and they have actually been known to mask their own identity ( IP and User-Agent ) when crawling your site so that they can check whether this is the case or not.
My advice would be to stick to the basics, don't pay anybody who makes any kind of promise about result ranking and avoid like the plague any scheme that is "unbeatable" and promises unrivalled PR within only a month or two.
Originally Posted - 2009
UPDATED - 12th Sep 2013
SEO is big bucks at the moment and it seems to be one of those areas of the web where there seem to be lots of snake oil salesmen and "SEO experts" who will promise no 1 positioning on Google, Bing and Yahoo for $$$ per month.
It is one of those areas that I didn't really pay much attention to when I started web developing mainly because I was not the person paying for the site and relying on leads coming from the web. However as I have worked on more and more sites over the years its become blatantly apparent to me that SEO comes in two forms from a development or sales point of view.
There are the forms of SEO which are basically good web development practise and will come about naturally from having a good site structure, making the site usable and readable as well as helping in terms of accessibility.
Then there are the forms which people will try and bolt onto a site afterwards. Either as an after thought or because an SEO expert has charged the site lots of money, promised the impossible, and wants to use some dubious link-sharing schemes that are believed to work.
Cover the SEO Basics when developing the site
Its a lot harder to just "add some Search Engine Optimization" in once a site has been developed especially if you are developing generic systems that have to work for numerous clients.
I am not an SEO expert and I don't claim to be otherwise I would be charging you lots of money for this advice and making promises that are impossible to be kept. However following these basic tips will only help your sites SEO.
Make sure all links have title tags on them and contain worthy content rather than words like "click here". The content within the anchor tags matter when those bots come a crawling in the dead of night.
You should also make sure all images have ALT attributes on them as well as titles and make sure the content of both differ. As far as I know Googlebot will rate ALT content higher than title content but it cannot hurt to have both.
Make sure you make use of header tags to differentiate out important sections of your site and try to use descriptive wording rather than "Section 1" etc.
Also as I'm sure you have noticed if you have read my blogs before I wrap keywords and keyword rich sentences in strong tags.
I know that Google will also rank emphasised content or content marked as strong over normal content. So as well as helping those readers who skim read to view just the important parts, it also tells Google which words are important on my article.
Write decent content and don't just fill up your pages with visible or non-visible spammy keywords.
In the old days keyword density mattered when ranking content. This was calculated by removing all the noise words and other guff (CSS, JavaScript etc) and then calculating what percentage of the overall page content were relevant keywords?
Nowadays the bots are a lot cleverer and will penalise content that does this as it looks like spam.
Also its good for your users to have good readable content and you shouldn't remove words between keywords as it makes it more unreadable and you will lose out on the longer 3, 4, 5 word indexable search terms (called long-tail in the SEO world).
Saying this though its always good to remove filler from your pages. For example by putting your CSS and Javascript code into external files when possible and removing large commented out sections of HTML.
You should also aim to put your most important content at the top of the page so its the first thing crawled.
Try moving main menus and other content that can be positioned by CSS to the bottom of the file. This is so that social media sites and other BOTS that take the "first image" on an article and use it in their own social snippets don't accidentally use an advertisers banner instead of your logo or main article picture.
The same thing goes for links. If you have important links but they are in the footer such as links to site-indexes then try getting them higher up the HTML source.
I have seen Google recommend that 100 links a per page is the maximum to have per page. Therefore having a homepage that has your most important links at the bottom of the HTML source but 200+ links above them e.g links to searches even if not all of them are visible then this can be harmful.
If you are using a tabbed interface to switch between tabs of links then the links will still be in the source code and if they are loaded in by JavaScript on demand then that's no good at all as a lot of crawlers don't run JavaScript.
Items such as ISAPI URL rewriting are very good for SEO plus they are nicer URLs for sites to display.
For example using a site I have just worked on as an example http://jobs.professionalpassport.com/companies/perfect-placement-uk-ltd is a much nicer URL to view a particular company profile than the underlying real URL which could also be accessed as http://jobs.professionalpassport.com/jobboard/cands/compview.asp?c=6101
If you can access that page by both links and you don't want to be penalised for duplicate content then you should specify which link you would want to be indexed by specifying your canonical link.You should also use your Robots.txt file to specify that the non re-written URL's are not to be indexed e.g.
Disallow: /jobboard/cands/compview.asp
META tags such as the keywords tag are not considered as important as they once were and having good keyword rich content in the main section of the page is the way to go rather than filling up that META with hundreds of keywords.
The META Description will still be used to help describe your page on search results pages and the META Title tag is very important to describe your page's content to the user and BOT.
However some people are still living in the 90's and seem to think that stuffing their META Keywords with spam is the ultimate SEO trick when in reality that tag is probably ignored by most crawlers nowadays.
Set up a Sitemap straight away containing your sites pages ranked by their importance, how often they change, last modified date etc. The sooner you do this the quicker you site will be getting indexed and gaining site authority. It doesn't matter if it's not 100% ready yet but the sooner it's in the indexes the better.
Whilst you can do this through Googles webmaster tools or Microsofts Bing you don't actually need to use their tools and as long as you use a sitemap directive in your robots.txt file BOTS will find it e.g
Sitemap: http://www.strictly-software.com/sitemap_110908.xml
You can also use tools such as the wonderful SEOBook Toolbar which is an add-on for Firefox which has combined numerous other free online SEO tools into one helpful toolbar. It lets you see your Page Ranking and compare your site to competitors on various keywords across the major search engines.
Also using a text browser such as Lynx to see how your site would look to a crawler such as yahoo or google.is a good trick to see how BOTS would view your site as it will skip all the styling and JavaScript.
There are many other good practises which are basic "musts" in this day and age and the major SERP'S are moving more and more towards social media when it comes to indexing sites and seeing how popular they are.
You should set up a Twitter account and make sure each article is published to it as well as engaging with your followers.
A Facebook Fan page is also a good method of getting people to view snippets of your content and then find your site through the world most popular social media website.
Making your website friendly for people viewing it on tablets or smart phones is also good advice as more and more people are using these devices to view Internet content.
The Other form of SEO, Black Magic Optimization
The other form of Search engine optimization is what I would call "black magic SEO" and it comes in the form of SEO specialists that will charge you lots of money and make impossible claims about getting you to the number one spot in Google for your major keywords and so on.
The problem with SEO is that no-one knows exactly how Google and the others calculate their rankings so no-one can promise anything regarding search engine positioning.
There is Googles Page Ranking which is used in relation to other forms of analysis and it basically means that if you have a site with a high PR that links to your site that does not link back to the original site then it tells Google that your site has higher site authority than the linking site.
If your site only links out to other sites but doesn't have any links coming in from high page ranked relevant sites then you are unlikely to get a high page rank yourself. This is just one of the ways which Google will use to determine how high to place you in the rankings when a search is carried out.
Having lots of links coming in from sites that have nothing whatsoever to do with your site may help drive traffic but will probably not help your PR. Therefore engaging in all these link exchange systems are probably worth jack nipple as unless the content that links to your site is relevant or related in some way its just seen as a link for a links sake i.e spam.
Some "SEO specialists" promote special schemes which have automated 3 way linking between sites enrolled on the scheme.
They know that just having two unrelated sites link to each other basically negates the Page Rank so they try and hide this by having your site A linking to site B which in turn links to site C that then links back to you.
The problem is obviously getting relevant sites linking to you rather than every tom dick and harry.
Also advertising on other sites purely to get indexed links from that site to yours to increase PR may not work due to the fact that most of the large advert management systems output banner adverts using Javascript therefore although the advert will appear on the site and drive traffic when people click it you will not get the benefit of an indexed link. The reason being that when the crawlers come to index the page containing the advert the banner image and any link to your site won't be there.
Anyone who claims that they can get you to the top spot in Google is someone to avoid!
The fact is that Google and the others are constantly changing the way they rank and what they penalise for so something that may seem dubious that works currently could actually harm you down the line.
For example in the old days people would put hidden links on white backgrounds or position them out of site so that the crawlers would hit them but the users wouldn't see which worked for a while until Google and the others cracked down and penalised for it.
Putting any form of content up specifically for a crawler is seen as dubious and you will be penalised for doing it.
Google and BING want to crawl the content that a normal user would see and they have actually been known to mask their own identity ( IP and User-Agent ) when crawling your site so that they can check whether this is the case or not.
My advice would be to stick to the basics, don't pay anybody who makes any kind of promise about result ranking and avoid like the plague any scheme that is "unbeatable" and promises unrivalled PR within only a month or two.
Tuesday, 10 September 2013
New Version of Strictly AutoTags - Version 2.8.6
Strictly Auto Tags 2.8.6 Has Been Released!
Due to the severe lack of donations plus too many broken promises of "I'll pay if you just fix or add this" I am stopping to support the Strictly Auto Tags plugin.
The last free version, 2.8.5 is up on the WordPress repository: wordpress.org/plugins/strictly-autotags
It fixes a number of bugs and adds some new features such as:
- Updated storage array to store content between important content, bold, strong, headers and links etc. So they don't get tagged inside e.g put bolded words inside an existing h4 etc.
- Changed storage array to run "RETURN" twice to handle nested code because of previous change.
- Fixed bug that wasn't showing the correct value for the minimum number of tags that a post must have before deeplinking to their tag page in admin.
- Fixed bug in admin to allow noise words to have dots in them e.g for links like youtube.com
- Added more default noise words to the list.
- Cleaned code that wasn't needed any-more due to changes with the way I handle href/src/title/alt attributes to prevent nested tagging.
- Removed unnecessary regular expressions which are not needed now.
Is going to be a "donate £40+" and get a copy version.
I am going to be sexing this plugin up into more of an SEO, text spinning, content cleaning, auto-blogging tool full of sex and violence in the future and I am running it on my own sites at the moment to see how well it does.
New features in 2.8.6 include.
Set a minimum length of characters for a tag to be used.
Set equivalent words to be used as tags. I have devised a "mark up" code for doing this which will allow you to add as many tag equivalents as you want. For example this is a cut down current version from one of my sites to show you an example using Edward Snowden (very topical at the moment!).
As you can see from that example you can use the same words multiple times and give them equivalent tags to use. So if the word Snowden appears a lot I will also tag the word "Police State", "Edward Snowden", "Whistleblower" as well as Snowden.
This feature is designed so that you can use related words as tags that maybe more relevant to peoples searches.
Set equivalent words to be used as tags. I have devised a "mark up" code for doing this which will allow you to add as many tag equivalents as you want. For example this is a cut down current version from one of my sites to show you an example using Edward Snowden (very topical at the moment!).
[NSA,Snowden,Prism,GCHQ]=[Police State]|[Snowden,Prism]=[Edward Snowden]|[Prism,XKeyscore,NSA Spying,NSA Internet surveillance]=[Internet Surveillance]|[TRAPWIRE,GCHQ,NSA Spying,Internet surveillance,XKeyscore,PRISM]=[Privacy]|[Snowden,Julian Assange,Bradley Manning,Sibel Edmonds,Thomas Drake]=[Whistleblower]
As you can see from that example you can use the same words multiple times and give them equivalent tags to use. So if the word Snowden appears a lot I will also tag the word "Police State", "Edward Snowden", "Whistleblower" as well as Snowden.
This feature is designed so that you can use related words as tags that maybe more relevant to peoples searches.
I also have added a feature to convert textual links that may appear from importing or scraping into real links for example www.strictly-software.com will become a real link to that domain e.g http://www.strictly-software.com.
I have added the new attributes data and their derivatives e.g data-description or data-image-description , basically anything that has data- at the front of it, into my list of attributes to store and then replace after auto-tagging to prevent nested tags being added inside them.
I will be extending this plugin lots in the future but only people prepared to pay for it will be able to get the goodies. I am so fed up of open-source coding there is little point in me carrying on working my ass off for free for other peoples benefit any more.
If you want a copy email me and then I will respond.
You can then donate me the money and I will send you a unique copy.
Any re-distribution of the code will mean hacks, DDOS, viruses from hell and Trojans coming out your ass for years!
Testing Server Load Before Running Plugin Code On Wordpress
Testing Server Load Before Running Plugin Code On Wordpress
UPDATED - 10th Sep 2013
I have updated this function to handle issues on Windows machines in which the COM object might not be created due to security issues.
If you have a underpowered Linux server and run the bag of shite that is the WordPress CMS system on it then you will have spent ages trying to squeeze every bit of power and performance out of your machine.
You've probably already installed caching plugins at every level from WordPress to the Server and maybe even beyond....into the cloud....all stuff normal websites shouldn't have to do but it seems WordPress / Apache / PHP programmers love doing.
A fast optimised database, queries that return data in sets (not record by record) and some static pages for content that doesn't change constantly should be all you need but it seems that this is not the case in the world of WordPress!
Therefore if you have your own server or virtual server and the right permissions you might want to consider implementing some code in important plugins that prevents the job you intend running causing more performance problems if the server is already over loaded.
You can do this by testing for the current server load, setting a threshold limit and then only running the code you want if the server load is below that limit.
Of course security is key so lock down permissions to your apps and only let admin or the system itself run the code - never a user and never by a querystring that could be hacked!
The code is pretty simple.
It does a split for Windows and non Windows machines and then it checks for a way to test the server load in each branch.
For Windows it has two methods, one for old PHP code and one for PHP 5+.
In the Linux branch it tests for access to the /proc/loadavg file which contains the current load average on LINUX machines.
If it's not there it tries to access the shell_exec function (which may or may not be locked down due to permissions - up to you whether you allow access or not) and if it can run shell commands it calls the "uptime" function to get the current server load from that.
You can then call this function in whatever plugin or function you want and make sure your server isn't already overloaded before running a big job.
I already use it in all my own plugins, the Strictly Google Sitemap and my own version of the WP-O-Matic plugin.
A simple server load testing function that should work across Windows and Linux machines for load testing.
UPDATED - 10th Sep 2013
I have updated this function to handle issues on Windows machines in which the COM object might not be created due to security issues.
If you have a underpowered Linux server and run the bag of shite that is the WordPress CMS system on it then you will have spent ages trying to squeeze every bit of power and performance out of your machine.
You've probably already installed caching plugins at every level from WordPress to the Server and maybe even beyond....into the cloud....all stuff normal websites shouldn't have to do but it seems WordPress / Apache / PHP programmers love doing.
A fast optimised database, queries that return data in sets (not record by record) and some static pages for content that doesn't change constantly should be all you need but it seems that this is not the case in the world of WordPress!
Therefore if you have your own server or virtual server and the right permissions you might want to consider implementing some code in important plugins that prevents the job you intend running causing more performance problems if the server is already over loaded.
You can do this by testing for the current server load, setting a threshold limit and then only running the code you want if the server load is below that limit.
Of course security is key so lock down permissions to your apps and only let admin or the system itself run the code - never a user and never by a querystring that could be hacked!
The code is pretty simple.
It does a split for Windows and non Windows machines and then it checks for a way to test the server load in each branch.
For Windows it has two methods, one for old PHP code and one for PHP 5+.
In the Linux branch it tests for access to the /proc/loadavg file which contains the current load average on LINUX machines.
If it's not there it tries to access the shell_exec function (which may or may not be locked down due to permissions - up to you whether you allow access or not) and if it can run shell commands it calls the "uptime" function to get the current server load from that.
You can then call this function in whatever plugin or function you want and make sure your server isn't already overloaded before running a big job.
I already use it in all my own plugins, the Strictly Google Sitemap and my own version of the WP-O-Matic plugin.
/**
* Checks the current server load
*
* @param boolean $win
* @return string
*
*/
function GetServerLoad(){
$os = strtolower(PHP_OS);
// handle non windows machines
if(substr(PHP_OS, 0, 3) !== 'WIN'){
if(file_exists("/proc/loadavg")) {
$load = file_get_contents("/proc/loadavg");
$load = explode(' ', $load);
return $load[0];
}elseif(function_exists("shell_exec")) {
$load = @shell_exec("uptime");
$load = explode(' ', $load);
return $load[count($load)-3];
}else {
return false;
}
// handle windows servers
}else{
if(class_exists("COM")) {
$wmi = new COM("WinMgmts:\\\\.");
if(is_object($wmi)){
$cpus = $wmi->InstancesOf("Win32_Processor");
$cpuload = 0;
$i = 0;
// Old PHP
if(version_compare('4.50.0', PHP_VERSION) == 1) {
// PHP 4
while ($cpu = $cpus->Next()) {
$cpuload += $cpu->LoadPercentage;
$i++;
}
} else {
// PHP 5
foreach($cpus as $cpu) {
$cpuload += $cpu->LoadPercentage;
$i++;
}
}
$cpuload = round($cpuload / $i, 2);
return "$cpuload%";
}
}
return false;
}
}
A simple server load testing function that should work across Windows and Linux machines for load testing.
Saturday, 31 August 2013
Displaying Apache Server Status HTML
Display Your Apache Server Status As HTML Webpage
Diagnosing and fixing Apache issues can be a fucking nightmare (excuse the french). However one thing you might want to do is allow yourself access to view key Apache stats on a webpage rather than have to SSH in and use the console.
If you read some articles on the web to set this up it's as simple as adding the following lines to your main apache2.conf or httpd.conf file.
Depending on your server you will need to go into /etc/apache2/apache2.conf or etc/httpd/conf/httpd.conf and then adding the following lines.
# Allow server status reports, with the URL of http://dv-example.com/server-status # Change the ".dv-example.com" to match your domain to enable. ExtendedStatus onSetHandler server-status Order Deny,Allow Deny from all Allow from mywebsite.com
Or if you wanted to only allow your own IP address to access the page you could do something like this.
# Allow server status reports, with the IP addresss you want to allow access ExtendedStatus onSetHandler server-status Order Deny,Allow Deny from all Allow from 86.42.219.12
However when I tried either of these approaches I was just met with an error message:
Forbidden You don't have permission to access /server-status on this server.
The Fix
At the bottom of my main apache2.conf file I noticed the lines
# Include generic snippets of statements Include /etc/apache2/conf.d/
Therefore I went into that folder and was met with 3 other files:
apache2-doc
charset
security
The top of the security file had these lines.
# Disable access to the entire file system except for the directories that # are explicitly allowed later. # # This currently breaks the configurations that come with some web application # Debian packages. It will be made the default for the release after lenny. # ## AllowOverride None # Order Deny,Allow # Deny from all #
So it seems that the main file was loading in these other files and the security file was blocking access to all other directories. By adding my rule to the top conf file the security file just over-ruled it again which caused the 403 error. Therefore I added the rules to the bottom of the security file and hey presto it worked.
Now I can access one of my domains to find my apache info. However I found that due to most of my sites on this virtual server using WordPress their ISAPI rules prevent a simple mysite.com/server-status from working.
WordPress is obviously trying to resolve the URL to a page, post or category list.
Therefore if you have got this problem you might need to use a domain not full of rules or create a rule that bypasses the WordPress guff.
Once you get it working you will get Apache info such as the following.
Current Time: Saturday, 31-Aug-2013 08:28:06 BST Restart Time: Saturday, 31-Aug-2013 08:15:13 BST Parent Server Generation: 0 Server uptime: 12 minutes 53 seconds Total accesses: 622 - Total Traffic: 4.2 MB CPU Usage: u26.65 s3.24 cu0 cs0 - 3.87% CPU load .805 requests/sec - 5.5 kB/second - 6.9 kB/request 3 requests currently being processed, 4 idle workers
Plus lots of info about current HTTP requests, CPU usage and session info.
It is well worth doing if you need info about your server and you are away from your SSH console.
Thursday, 22 August 2013
Handle jQuery requests when you want to reference code that hasn't loaded yet
Handle jQuery requests when you want to reference code that hasn't loaded yet
As you should be aware it is best practise to load your JavaScripts at the bottom of your HTML for performance and to stop blocking or slow load times.However sometimes you may want to reference an object that is not yet loaded higher up in the page.
If you are using a CMS or code that you cannot change then you may not be able to add your event handlers below any scripts that maybe needed to use them. This can cause errors such as:
Uncaught ReferenceError: $ is not defined
Uncaught ReferenceError: jQuery is not defined
If you cannot move your code below where the script is loaded then you can make use of a little PageLoader function that you can pass any functions to and which will hold them until jQuery (or any other object) is loaded before running the functions.
A simple implementation of this would involve a setTimeout call that constantly polls the check function until your script has loaded.
For example:
PageLoader = {
// holds the callback function to run once jQuery has loaded if you are loading jQuery in the footer and your code is above
jQueryOnLoad : function(){},
// call this function with your onload function as the parameter
CheckJQuery : function(func){
var f = false;
// has jQuery loaded yet?
if(window.jQuery){
f=true;
}
// if not we store the function if first time in loop otherwise and set a timeout
if(!f){
// if we have a function store it until jQuery has loaded
if(typeof(func)=="function"){
PageLoader.jQueryOnLoad = func;
}
// keep looping until jQuery is in the DOM
setTimeout(PageLoader.CheckJQuery,200);
}else{
// jQuery has loaded so call the function
PageLoader.jQueryOnLoad.call();
}
}
}
As you can see the object just holds the function passed to it in memory until jQuery has been loaded in the DOM. This will be apparent because window.jQuery will be true.
If the object isn't in the DOM yet then it just uses a setTimeout call to poll the function until it has loaded.
You could increase the length of time between the polls or even have a maximum limit so that it doesn't poll for ever and instead after 10 loops returns an error to the console. However this is just a simple example.
You would call the function by passing your jQuery referencing function to the CheckJQuery function like so.
<script>
PageLoader.CheckJQuery(function(){
$("#myelement").bind('click', function(e) {
// some code
alert("hello");
});
});
</script>
It's just a simple way to overcome a common problem where you cannot move your code about due to system limitations but require access to an object that will be loaded later on.
Thursday, 15 August 2013
MySQL Server Has Gone Away Fix For Wordpress 3.6
MySQL Server Has Gone Away Fix For Wordpress 3.6
I have been constantly having issues with the bag of XXXX that is Wordpress and I really, really hate relying on someone else's code especially when there is so much I would do to improve it.Adding new indexes and writing my own plugins for heavy task can only do so much to tweak the performance of a 3rd party CMS system.
However one of the problems I have been having lately is the dreaded "MySQL Server Has Gone Away" error littering my error log.
The symptoms I have been getting include:
- Trying to load a page but it just spinning away.
- Checking the console with a TOP command to see very low server loads (0.00 to 0.03)
- No Apache / MySQL processes running.
I first thought I had solved the problem some months back when I de-activated the Apache Caching plugins my server was using.
I did this because of the high number of Apache related errors in the error log files like this:
[Thu Feb 02 16:30:57 2012] [error] (103)Software caused connection abort: cache: error returned while trying to return mem cached data
These errors related to the times I was getting the slow page loads, high disk swapping and low server loads.
Apache was obviously causing me issues with it's own caching and a restart always fixed it.
As I was using WP Super Cache there was no need for duplicate caching and there are far too many levels on a LAMP set-up where caching can be enabled.
For me removing the Apache caching seemed to fix the issue, for a while at least.
However I keep getting intermittent issues where the same symptoms are present except instead of Apache errors in the error log I am getting lots of MySQL Server Has Gone Away errors like this:
[Wed Aug 14 23:37:29 2013] [error] [client 173.203.107.206] WordPress database error MySQL server has gone away for query UPDATE wp_posts SET robotsmeta = '' WHERE ID = 42693 made by WPOMatic->runCron, WPOMatic->processAll, WPOMatic->processCampaign, WPOMatic->processFeed, WPOMatic->processItem, WPOMatic->insertPost, wp_insert_post, do_action('wp_insert_post'), call_user_func_array, RobotsMeta_Admin->robotsmeta_insert_post
Therefore I looked into the MySQL Server Has Gone Away error and came across the Robs Note Book workaround for WordPress.
The fix he uses is pretty simple and involves using a custom wp-db.php file for your WordPress installation.
As this is a core file that handles all database queries it is a bit annoying in that it may need constant updates when new versions come out.
As the highest version of the workaround on his site is for WordPress 2.8.1 and I was on WordPress version 3.6 I had to create my own workaround for the file.
However the fix is pretty simple to implement and just involves using a new function with a number of retries for the failed query with a server re-connect between each loop iteration.
Basically all the fix does is something I do in my own projects many times when I encounter, lock timeouts or deadlocks e.g retry the query X number of times before quitting with an error.
You can see one such example I use to handle LOCK TIMEOUTS in MS SQL here.
Therefore if you need to apply the same fix in a future version of WordPress you can just follow these steps yourself. Download the existing copy of wp-db.php from the wp-includes folder, back it up and then make a copy before applying these changes.
1. In the db_connect() function that attempts to connect to your database ensure that the initialquery flags are set on and then off are placed around the first query run on the class initialise. This happens to currently be the $this->set_charset function which sets the connections correct character set.
The code should be wrapped around this function call and also before the $this->select function which selects the database to use.
$this->initialquery=1;
$this->set_charset( $this->dbh );
$this->ready = true;
$this->initialquery=0;
$this->select( $this->dbname, $this->dbh );
2. You need to create the following query and put in the file somewhere. This query attempts to run the query and then retries a number of times with a re-connect to the database in-between.
This is what handles the "MySQL Server Has Gone Away" error as it re-connects if the connection is no longer present.
You can change your retries to any number you want. I use 3 re-attempts as I don't see the point of any further retries as if you can't re-connect to your database after 3 goes you certainly have an issue.
function queryWithReconnect($query)
{
// set this to the number of re-attempts you want to try
$maxcount = 3;
$cnt = 1;
// loop until we reach our $maxcount value OR we get a good query result
while($cnt < $maxcount)
{
// close our connection and then re-connect to our database - this is what fixes the MySQL Has Gone Away error
// as if it has gone away we are re-attempting the connection.
@mysql_close($this->dbh);
// re-connect to our database
$this->db_connect();
// re-run our query and store the result in a global variable
$this->result = @mysql_query($query, $this->dbh);
// if we dont have an error from the server quit now, otherwise carry on!
if (!mysql_error($this->dbh))
{
// return 0 so we know the query ran ok
return 0;
}
// if we are here we had an error so increment our $cnt loop counter
$cnt+=1;
}
// we have looped up to our $maxcount number and all attempts at running the query failed so quit with a failure code!
return 1;
}
3. In the main query() function which runs ALL queries in WordPress you need to ensure that the call to our new function that re-attempts the query 3 times is placed just after the initial attempt at running the query.
Find the code that checks for a MySQL error and place our call to the new function inside it and above the code that clears any insert_id that may have been stored.
// If there is an error then take note of it..
if ( $this->last_error = mysql_error( $this->dbh ) ) {
// If its the first initial query e.g in the db_connect() function OR we have a MySQL error then call our queryWithReconnect
// function until it either passes OR fails after X attempts.
if (($this->initialquery)||($this->queryWithReconnect($query)!=0)) {
// Clear insert_id on a subsequent failed insert.
if ( $this->insert_id && preg_match( '/^\s*(insert|replace)\s/i', $query ) )
$this->insert_id = 0;
$this->print_error();
return false;
}
}
Since putting this re-try code in my wp-db.php file I have had no MySQL Gone Away Errors but it is too early to tell if my main issue of low server loads, no processes running and no website activity is solved yet.
However even if it hasn't this is a good trick to use for your own WordPress code.
If you want to download the latest version of the wp-db.php workaround file which works with WordPress 3.6 then you can get it from the link below.
Just re-name it from wp-db.txt to wp-db.php and then copy it into the wp-includes folder of your site. Be sure to make a backup first in-case it all goes tits up!
Download WordPress 3.6 Fix For MySQL Server Has Gone Away Error - wp-db.php
http://www.strictly-software.com/scripts/downloads/wp-db.txt
Wednesday, 14 August 2013
Help Fight Internet Censorship with the Pirate Bay's New PirateBrowser
Help Fight Internet Censorship with the Pirate Bay's New PirateBrowser
To celebrate it's 10th birthday the worlds most infamous censored site, The Pirate Bay, has introduced its own Internet browser to enable people to access its website even if it's being blocked by your ISP.Most big ISP's have blocked The Pirate Bay for its users claiming it breaches copyright by allowing people to download Torrents of films and music.
If you don't know what a torrent it it's a movie split into thousands of small pieces. Each piece is stored on various computers so that each user is not in theory holding a full version of a film that may be breaching copyright. When you download the torrent the files are all put together and downloaded from their various locations.
However most ISP's still see this as copyright violation and even though The Pirate Bay is just like Google in the fact they are just a search engine and don't actually host the films or music they have been attacked from all quarters.
Is Your ISP Blocking You?
You can quickly test whether you are being blocked by your ISP by clicking these links which all point to the Pirate Bay Website.
http://thepiratebay.org
http://thepiratebay.sx
https://piratereverse.info
Check More Pirate Bay Proxies
If you want a PHP script that will scan a number of known Pirate Bay Proxies then you can download this one I quickly knocked up from here: Pirate Bay Proxy Checker Script.
Just change the file extension to .php and either run it from your local computer or server. If you are running Windows I recommend downloading WAMP so that you can run an Apache server on your PC.
I also recommend changing the port number the server runs on to 8080 or 8888 so it can run side by side with IIS. This article will explain how to set WAMP up.
Use Pirate Bay Tor Browser
If you are being blocked from the Pirate Bay by your ISP then you have a number of choices.
1. Find a working mirror to the site - e.g search for Pirate Bay Proxies on DuckDuckgo.com or try this page: http://torrentproxies.com/ to find some not blocked by your ISP. Many mirrors will first open up in something called adf.ly with a JavaScript countdown. After a few seconds a link will appear saying "click here to continue". On clicking the page will change (usually to an advert with a blue bar at the top) and in the top right corner will be another countdown saying "Please Wait". Usually after 5 seconds it will say "SKIP AD" and on clicking it you will end up at the Pirate Bay mirror site.
2. Use a proxy to access the Pirate Bay. There are many free proxies out there and you can find a good list at http://nntime.com/proxy-list-01.htm. Use a tool like FoxyProxy to manage your proxy lists and toggle between them. Or you can go into your browsers network / proxy settings and manually enter the IP address and port number you want to use.
3. Download the new Pirate Bay Browser or the Mozilla TOR Browser. Both of which access the TOR network to help disguise your internet footprint by bouncing your HTTP requests through a series of servers. As the TOR website explains.
"Tor protects you by bouncing your communications around a distributed network of relays run by volunteers all around the world: it prevents somebody watching your Internet connection from learning what sites you visit, and it prevents the sites you visit from learning your physical location. Tor works with many of your existing applications, including web browsers, instant messaging clients, remote login, and other applications based on the TCP protocol."
With all the outrage surrounding the NSA spying on ALL American citizens and the recent revelations that they are using this information not only to catch terrorists but to catch petty criminals, drug dealers, tax avoiders and to spy on and blackmail politicians and judges it is in everyone's best interests that they tighten up on their Internet Security.
As this article I wrote explains there are a number of measures you can take to reduce your Internet footprint and whilst you might be totally invisible you can blend into the background.
So do yourself a favour and use the Pirate Bay browser or the TOR Mozilla Browser. Not only does making use of the TOR network help protect your privacy but the Pirate Bay Browser will let you access a number of Pirate Bay Mirror sites and bypass any censorship that your ISP or country may have introduced.
Download the TOR Browser
Once installed when you open the .EXE you will connect to the TOR network.
Once connected you can use the Firefox browser to surf the web. All your Pirate Bay Mirror sites and other Torrent sites are linked at the top in the icon bar. Pick one and then carry out your search for movies, applications or other content.
As you can see it's just a search engine like Google which makes it very unfair that the Pirate Bay is being blocked by ISP's as Google also indexes illegal content such as porn and copyrighted material.
This is a search for the US TV show Dexter.
Like most search engines once you run a search you get your results.
As you can see from the results on the right there are two columns, seeders and leachers.
You want to choose an item with as many seeders (uploaders) and as few leachers (downloaders) as possible to get a quick download.
You can also tell from the names of the file what sort they are e.g if the file has the word CAM in it then it's a poor quality camera in a cinema job. If it's HDTV quality it will be a bigger better quality file and BRRip is a BluRay copy.
Clicking on the file you want will open up the result page. Hopefully you will get comments telling you the quality of the file. Click on "Get This Torrent" to download your torrent file.
When you want to download a torrent from the Pirate Bay you will first need to have a torrent client. There are many out there including:
www.utorrent.com/downloads/winOnce you have one installed your application of choice you can just click on the torrent in your search results and it will open up in the client and start downloading.
bitlord.soft32.com/
www.bittorrent.com/
As you can see whilst you download the file you will also be uploading at the same time. This way you are not just taking without giving back the files you have already leached. You can control the bandwidth ratio between upload and download as well as set limits on the rate you upload.
However bewarned, ISP's are also on the lookout for people downloading (leaching) and uploading (seeding) torrents so you need to protect yourself.
Some of the ways include:
-Using someone else's WIFI or network.As with most security measures a wide range of different measures is much better than just relying on one system.
-Using an anonymous proxy that doesn't leak identifying information or even better a VPN. A paid for tool like BTGuard which is both a proxy and encryption tool which helps prevent your ISP throttle your traffic.
-Using a block list which will mean that all traffic is bounced around ISP routers so that they cannot scan the traffic.
-Setting your download port to a common port for other traffic such as 80 or 8080 (HTTP) so that it doesn't look obvious you are downloading P2P data. Most people use a "random port" but it is better to look like you are downloading HTTP content especially if you are encrypting your traffic as well.
-Limiting your upload rate to a minimum. Although this violates the spirit of P2P (sharing) the people going after you for stealing copyrighted material are more interested in those spreading (uploading) the content rather than downloading it.
-Ensuring you force outgoing traffic to be encrypted. This will help prevent your ISP see that you are using BitTorrent traffic and may prevent them throttling your bandwidth.
-Set a download cap on your traffic. Even if you are encrypting your traffic some ISP's may see the amount you are downloading and throttle it if they think you are up to no good.
-Using an application like PeerBlock which will block traffic from known bad IP addresses such as P2P blocklists, known bad IP addresses, spyware, FBI and copyright monitoring sites and so on. It is worth downloading and just sits in the background running as you do you're downloading.
Hopefully this article will help you make use of The Pirate Bay if you have been blocked so far from accessing it.
Tuesday, 30 July 2013
Handling Blocking in SQL 2005 with LOCK_TIMEOUT and TRY CATCH statements
Handing Blocking and Deadlocks in SQL Stored Procedures
We recently had an issue at night during the period in which daily banner hit/view data is transferred from the daily table to the historical table.
During this time our large website was being hammered by BOTs and users and we were getting lots of timeout errors reported due the the tables we wanted to insert our hit records into being DELETED and UPDATED causing locks.
The default lock time is -1 (unlimited) but we had set it to our default command timeout of 30 seconds.
However if the DELETE or UPDATE in the data transfer job took over 30 seconds then the competing INSERT (to insert a banner hit or view) would time out and error with a database timeout due to the Blocking process not allowing our INSERT to do its job.
We tried a number of things including:
- Ensuring all tables were covered by indexes to speed up any record retrieval
- Reducing the DELETE into small batches of 1000 or 100 at a time in a WHILE loop to reduce the length of time the LOCK was held each time.
- Ensuring any unimportant SELECT statements from these tables were using WITH (NOLOCK) to get round any locking issues.
However none of these actually helped solve the problem so in the end we rewrote our stored procedure (SQL 2005 - 2008) so that it handled the LOCK TIMEOUT error and didn't return an error.
In SQL 2005 you can make use of TRY CATCH statements which meant that we could try a certain number of times to insert our data and if it failed we could just return quickly without an error as we also used a TRANSACTION to enable us to ROLLBACK or COMMIT the transaction.
We also set the LOCK_TIMEOUT to 500 milliseconds (so x 3 = 1.5 seconds) as if the insert couldn't be done in that time frame then there was no point logging it. We could have inserted it into another table to be added to our statistics later on but that is another point.
The code is below and shows you how to trap BLOCKING errors including DEADLOCKS and handle them.
Obviously this doesn't fix anything it just "masks" the problem from the end user and reduces the number of errors due to database timeouts due to long waiting blocked processes.
CREATE PROCEDURE [dbo].[usp_net_update_banner_hit]
@BannerIds varchar(200), -- CSV of banner IDs e.g 100,101,102
@HitType char(1) = 'V', -- V = banner viewed, H = banner hit
AS
SET NOCOUNT ON
SET LOCK_TIMEOUT 500 -- set to half a second
DECLARE @Tries tinyint
-- start at 1
SELECT @Tries = 1
-- loop for 3 attempts
WHILE @Tries <= 3
BEGIN
BEGIN TRANSACTION
BEGIN TRY
-- insert our banner hits we are only going to wait half a second
INSERT INTO tbl_BANNER_DATA
(BannerFK, HitType, Stamp)
SELECT [Value], @HitType, getdate()
FROM dbo.udf_SPLIT(@BannerIds,',') -- UDF that splits a CSV into a table variable
WHERE [Value] > 0
--if we are here its been successful ie no deadlock or blocking going on
COMMIT
-- therefore we can leave our loop
BREAK
END TRY
-- otherwise we have caught an error!
BEGIN CATCH
--always rollback
ROLLBACK
-- Now check for Blocking errors 1222 or Deadlocks 1205 and if its a deadlock wait for a while to see if that helps
IF ERROR_NUMBER() = 1205 OR ERROR_NUMBER() = 1222
BEGIN
-- if its a deadlock wait 2 seconds then try again
IF ERROR_NUMBER() = 1205
BEGIN
-- wait 2 seconds to see if that helps the deadlock
WAITFOR DELAY '00:00:02'
END
-- no need to wait for anything for BLOCKING ERRORS as our LOCK_TIMEOUT is going to wait for half a second anyway
-- and if it hasn't finished by then (500ms x 3 attempts = 1.5 seconds) there is no point waiting any longer
END
-- increment and try again for 3 goes
SELECT @Tries = @Tries + 1
-- we carry on until we reach our limit i.e 3 attempts
CONTINUE
END CATCH
END
Tuesday, 16 July 2013
MySQL Server won't restart
MySQL Server won't restart
Today I went to restart MySQL from my SSH console with the following command:/etc/init.d/mysql restart
However even though the database server stopped it wouldn't restart.
I tried opening another console and running the status command.
/etc/init.d/mysql status
But this just told me it was stopped and a start command kept failing.
Even when I went into my VirtualMin website that manages my virtual server the service wouldn't restart.
I dug into the services and databases and tried accessing a database from VMIN and saw a message saying the system couldn't retrieve a list of databases. Further digging gave me this error:
can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock'
Now I knew I had changed some settings in the my.cnf configuration file the other day for performance tuning. So I searched the web and found that I had added a command that wasn't supported by my older version i.e MySQL 5.0.51.
The command in question was:
skip-external-locking
Because Java had crashed and I couldn't access the file to edit it easily.
I quickly ran a:
CHMOD 777 /etc/mysql/my.cnf
command to allow me to edit the file from FTP and then I swapped the new command with the older one which my version of MySQL supported:
skip-locking
I copied the file back and then hey presto a start command got the server back and running:
/etc/init.d/mysql status
I then made sure to CHMOD the file back so it couldn't be written by the website.
However on checking my website I only got to see the theme but NO articles.
It was a Wordpress site and the problem was probably WP-Super-Cache caching a page without data. I needed to run a REPAIR command on my wp_posts table to ensure all posts were visible again.
This is something I have seen many times before with hard reboots. The system comes back up but no articles appear. I always REPAIR and OPTIMIZE my wp_posts and wp_posts_meta table to rectify this.
This obviously locked the database up, as well as consuming 99% CPU whilst it ran - something which really annoys me, but afterwards the site was working.
So if you have been performance tuning your own MySQL database make sure you are not adding commands that your server doesn't support. A failed restart is a sure sign of unsupported configuration commands.
Wednesday, 10 July 2013
Apache Performance Tuning BASH Script
BASH Script to tune Apache Configuration Settings
There are so many configuration options, at so many different levels that need tuning to get optimal performance, it is a nightmare to find the right information. There is also too many people offering various solutions for Wordpress / Linux / Apache / MySQL configuration.
Different people recommend different sizes for your config values and just trying to link up server load with page/URL/script requests to find out the cause of any performance issue is a nightmare in itself.
I would have thought there would have been a basic tool out there that could log server load, memory, disk swapping over time and then link that up with the MySQL slow query log, Apache error AND access logs so that you could easily tell when you had issues what processes were running, which URL's were being hit and how much activity was going on to identify culprits for tuning. I have even thought of learning PERL just to write one - not that I want to!
Even with all the MySQL tuning possible, caching plugins installed and memory limits on potentially intensive tasks it can be a nightmare to get the best out of a 1GB RAM, 40GB Virtual Server that is constantly hammered by BOTS, Crawlers and humans. I ban over 50% of my traffic and I still get performance issues at various times of the day - why? I have no FXXING idea!
Without throwing RAM at the problem you can try and set your APACHE values in the config file to appropriate values for your server and MPM fork type.
For older versions of Apache the Multi-Processing Module, non-threaded, pre-forking webserver is well suited as long as the configuration is correct. However it can consume lots of memory if not configured correctly.
For newer versions (2+) the Worker MPM is better as each thread handles a connection at a time and this is considered better for high traffic servers due to the smaller memory footprint. However to get PHP working on this setting apparently needs a lot of configuration and you should read up about this before considering a change.
Read about Apache performance tuning here Apache Performance Tuning.
To find out your current apache version from the console run
apache2 -v OR httpd -v (depending on your server type, if you run top and see apache2 threads then use apache2 otherwise use httpd)
You will get something like this.
Server version: Apache/2.2.9 (Debian) Server built: Feb 5 2012 21:40:20
To find out your current module configuration from the console run
apache2 -V OR httdp -V
Server version: Apache/2.2.9 (Debian)
Server built: Feb 5 2012 21:40:20
Server's Module Magic Number: 20051115:15
Server loaded: APR 1.2.12, APR-Util 1.2.12
Compiled using: APR 1.2.12, APR-Util 1.2.12
Architecture: 64-bit Server
MPM: Prefork threaded: no forked: yes (variable process count)
etc etc etc...
There are lots of people giving "suitable" configuration settings for the various apache settings but one thing you need to do if you run TOP and notice high memory usage and especially high virtual memory usage is try and reduce disk swapping.
I have noticed that when Apache is consuming a lot of memory that your virtual memory (disk based) will be high and you will often experience either high server loads and long wait times for pages to load OR very small server loads e.g 0.01-0.05, an unresponsive website and lots of MySQL Server Gone Away messages in your error log file.
You need to optimise your settings so that disk swapping is minimal which means trying to optimise your MySQL settings using the various MySQL tuning tools I have wrote about as well as working out the right size for your Apache configuration values.
One problem is that if you use up your memory by allowing MySQL to have enough room to cache everything it needs then you can find yourself with little left for Apache. Depending on how much memory each process consumes you can easily find that a sudden spike in concurrent hits uses up all available memory and starts disk swapping.
Therefore apart from MySQL using the disk to carry out OR caching large queries you need to find the right number of clients to allow at any one time. If you allow too many and don't have enough memory to contain them all then the server load will go up, people will wait and the amount of disk swapping will increase and increase until you enter a spiral of doom that only a restart fixes.
It is far better to allow fewer connections and serve them up quickly with a small queue and less waiting than open too many for your server to handle and create a massive queue with no hope of ending.
One of the things you should watch out for is Twitter Rushes caused by automatically tweeting your posts to twitter accounts as this can cause 30-50 BOTS to hit your site at once. If they all consume your memory up then it can cause a problem that I have wrote about before.
Working out your MaxClients value
To work out the correct number of clients to allow you need to do some maths and to help you I have created a little bash script to do this.
What it does is find out the average size of an Apache thread then restarts Apache so that the correct "free size" value can be obtained.
It then divides the remainder by the Apache process size. The value you get should be roughly the right value for your MaxClients.
It will also show you how much disk swapped or virtual memory you are using as well as the size of your MySQL process.
I noticed on my own server that when it was under-performing I was using twice as much disk space as RAM. However when I re-configured my options and gave the system enough RAM to accommodate all the SQL / APACHE processes then it worked fine with low swapping.
Therefore if your virtual memory is greater than the size of your total RAM e.g if you are using 1.5GB of hard disk space as virtual memory and only have 1GB of RAM then it will show an error message.
Also as a number of Apache tuners claim that your MinSpareServers should be 10-25% of your MaxClients value and your MaxSpareServers value 25-50% of your MaxClientsValue I have also included the calculations for these settings as well.
#!/bin/bash
echo "Calculate MaxClients by dividing biggest Apache thread by free memory"
if [ -e /etc/debian_version ]; then
APACHE="apache2"
elif [ -e /etc/redhat-release ]; then
APACHE="httpd"
fi
APACHEMEM=$(ps -aylC $APACHE |grep "$APACHE" |awk '{print $8'} |sort -n |tail -n 1)
APACHEMEM=$(expr $APACHEMEM / 1024)
SQLMEM=$(ps -aylC mysqld |grep "mysqld" |awk '{print $8'} |sort -n |tail -n 1)
SQLMEM=$(expr $SQLMEM / 1024)
echo "Stopping $APACHE to calculate the amount of free memory"
/etc/init.d/$APACHE stop &> /dev/null
TOTALFREEMEM=$(free -m |head -n 2 |tail -n 1 |awk '{free=($4); print free}')
TOTALMEM=$(free -m |head -n 2 |tail -n 1 |awk '{total=($2); print total}')
SWAP=$(free -m |head -n 4 |tail -n 1 |awk '{swap=($3); print swap}')
MAXCLIENTS=$(expr $TOTALFREEMEM / $APACHEMEM)
MINSPARESERVERS=$(expr $MAXCLIENTS / 4)
MAXSPARESERVERS=$(expr $MAXCLIENTS / 2)
echo "Starting $APACHE again"
/etc/init.d/$APACHE start &> /dev/null
echo "Total memory $TOTALMEM"
echo "Free memory $TOTALFREEMEM"
echo "Amount of virtual memory being used $SWAP"
echo "Largest Apache Thread size $APACHEMEM"
echo "Amount of memory taking up by MySQL $SQLMEM"
if [[ SWAP > TOTALMEM ]]; then
ERR="Virtual memory is too high"
else
ERR="Virtual memory is ok"
fi
echo "$ERR"
echo "Total Free Memory $TOTALFREEMEM"
echo "MaxClients should be around $MAXCLIENTS"
echo "MinSpareServers should be around $MINSPARESERVERS"
echo "MaxSpareServers should be around $MAXSPARESERVERS"
If you get 0 for either of the last two values then consider increasing your memory or working out what is causing your memory issues. Either that or set your MinSpareServers to 2 and MaxSpareServers to 4.
There are many other settings which you can find appropriate values for but adding indexes to your database tables and ensuring your database table/query caches can fit in memory rather than swapped to disk is a good way to improve performance without having to resort to more caching at all the various levels Wordpress/Apache/Linux users love doing.
If you do use a caching plugin for Wordpress then I would recommend tuning it so that it doesn't cause you problems.
At first I thought WP SuperCache was a solution and pre-caching all my files would speed things up due to static HTML being served quicker than PHP.
However I found that the pre-cache stalled often, caused lots of background queries to rebuild the files which consumed memory and also took up lots of disk space.
If you are going to pre-cache everything then hold the files as long as possible as if they don't change there seems little point in deleting and rebuilding them every hour or so and using up SQL/IO etc.
I have also turned off gzip compression in the plugin and enabled it at Apache level. It seems pointless doing it twice and PHP will use more resources than the server.
The only settings I have enabled in WP-Super-Cache at the moment are:
- Don’t cache pages with GET parameters. (?x=y at the end of a url)
- Cache rebuild.
- Serve a supercache file to anonymous users while a new file is being generated.
- Extra homepage checks. (Very occasionally stops homepage caching)
- Only refresh current page when comments made.
- Cache Timeout is set to 100000 seconds (why rebuild constantly?)
- Pre-Load - disabled.
Also in the Rejected User Agents box I have left it blank as I see no reason NOT to let BOTS like googlebot create cached pages for other people to use. As bots will most likely be your biggest visitor it seems odd to not let these BOTS create cached files.
So far this has given me some extra performance.
Hopefully the tuning I have done tonight will help the issue I am getting of very low server loads, MySQL gone away errors and high disk swapping. I will have to wait and see!
Subscribe to:
Posts (Atom)