Wednesday, 5 October 2016

Disk Full - Linux - Hacked or Full of Log Files?

Disk Full - Linux - Hacked or Full of Log Files?

By Strictly-Software

This morning I woke up to find the symptoms of a hack attempt on my LINUX VPS server.

I had the same symptoms when I was ShockWave hacked a few years ago and some monkey overwrote a config file so that when I rebooted, hoping to fix the server, it would reload it in from a script hidden in a US car site.

They probably had no idea that the script was on their site either, but it was basically a script to enable various hacking methods and the WGet command in the config file ensured that my standard config was constantly overwritten when the server was re-started.

Another symptom was that my whole 80GB of disk space had suddenly filled up.

It was 30GB the night before and now with 30 odd HD movies hidden in a secret folder buried in my hard drive I could not FTP anything up to the site, receive or send emails or manually append content to my .htaccess file to give only my IP full control.

My attempts to clear space by clearing cached files was useless and it was only by burrowing through the hard drive folder by folder all night using the following command to show me the biggest files (visible and hidden) that I found the offending folder and deleted it.

du -hs $(ls -A)

However good this command is for finding files and folders and showing their size in KB, MB or GB, it is a laborious task to manually go from your root directory running the command over and over again until you find the offending folder(s).

So today when I thought I had been hacked I used a different process to find out the issue.

The following BASH script can be run from anywhere on your system in a console window and you can either enter a path if you think you know where the problem lies or just enter / when prompted to scan the whole machine.

It will list first the 20 biggest directories in order of size and then the 20 largest files in order of size.

echo -n "Type Filesystem: ";
resize;clear;date;df -h $FS;
echo "Largest Directories:"; 
du -x $FS 2>/dev/null| sort -rnk1| head -n $NUMRESULTS| awk '{printf "%d MB %s\n", $1/1024,$2}';
echo "Largest Files:"; 
nice -n 20 find $FS -mount -type f -ls 2>/dev/null| sort -rnk7| head -n $NUMRESULTS|awk '{printf "%d MB\t%s\n", ($7/1024)/1024,$NF}'

After running it I found that the problem was not actually a security breach but rather a plugin folder within a website containing log files. Somehow without me noticing the number of archived log files had crept up so much that it had eaten 50GB of space without my knowledge.

As the folder contained both existing and archived log files I didn't want to just truncate it or delete everything instead I removed all archived log files by using a wildcard search for the word ARCHIVED within the filename.


If you wanted to run a recursive find and delete within a folder then you may want to use something a bit different such as:

ind -type f -name '*ARCHIVED*' -delete

This managed to remove a whole 50GB of files within 10 minutes and just like lightening my sites, email and server started running again as they should have been.

So the moral of the story is that a full disk should be treated first as a symptom of a hacked server, especially if you were not expecting it, and the same methods used to diagnose and fix the problem can be used whether you have been hacked or allowed your server to fill itself up with log files or other content.

Therefore keep an eye on your system so you are not caught out if this does happen to you and if you do suddenly jump from 35GB to 80GB and stop receiving emails or being able to FTP content up (or files being copied up as 0 bytes), then you should immediately put some security measures into place.

My WordPress survival guide on security has some good options to use if you have been hacked but as standard you can do some things to protect yourself such as

  • Replacing the default BASH language with a more basic, older and secure DASH. You can still run BASH once logged into your console but as default it should not be running and allow hackers to run complex commands on your server.
  • You should always use SFTP instead of FTP as its more secure and you should change the default SSH port from 22 to another number in the config file so that standard port scanners don't spot that your server is open and vulnerable to attack.
  • If you are running VirtualMin on your server you should also change the default port for accessing it from 10000 to another number as well. Otherwise attackers will just swap from SSH attacks by console to web attacks where the front end is less protected. Also NEVER store the password in your browser in case you forget to lock your PC one day or your browsers local SQLLite Database is hacked and the passwords compromised.
  • Ensuring your root password and every other user password is strongly typed. Making passwords by joining up phrases or rememberable sentences where you swap the capitals and non capital letters over is a good idea. And always add a number to the start or end, or both as well as some special characters e.g 1967bESTsAIDfRED*_* would take a dictionary cracker a very long time to break.
  • Regularly change your root and other user passwords in case a keylogger has been installed on your PC and discovered them.
  • Also by running DENYHOSTS and Fail2Ban on your server you can ensure anyone who gets the SSH password wrong 3 times in a row is blocked and unable to access your console or SFTP files up to your server. If you forget yourself you can always use the VirtualMin website front end (if installed) to login and remove yourself from the DenyHosts list.
  • If you are running WordPress there are a number of other security tools such as the WordPress Firewall plugin that you can install which will hide your wp-admin login page away behind another URL and redirect people trying to access it to another page. I like the URL myself. It can also ban people who fail to login after a number of attempts for a set amount of time as well a number of other security features.

Most importantly of all regularly check the amount of free space you have on your server and turn off any logging that is not required if you don't need it.

Getting up at 5.30AM to send an email only to believe your site has been hacked due to a full disk is not a fun way to spend your day!

By Strictly-Software

 © 2016 Strictly-Software

A Karmic guide for Scraping without being caught

Quick Regular Expression to clean off any tracking codes on URLS

I have to deal with Scrapers all day long in my day job and I ban them in a multitude of ways from using firewalls, .htaccess rules, my own personal logger system that checks for the duration between page loads, behaviour, and many other techniques.

However I also have to scrape HTML content sometimes for various reasons, such as to find a piece of content related to somebody on another website linked to my own. So I know both methods to use to detect scrapers and stop them.

This is a guide to various methods that scrapers use to prevent being caught and have their IP address added to a blacklist within minutes of starting. Knowing the methods people use to scrape sites will help you when you have to defend your own from scrapers so it's good to know both attack and defense.

Sometimes it is just too easy to spot a script kiddy who has just discovered CURL and thinks it's a good idea to test it out on your site by crawling every single page and link available.

Usually this is because they have downloaded a script from the net, sometimes a very old one, and not bothered to change any of the parameters. Therefore when you see a user-agent in you logfile that is hammering you that just has the user-agent of "CURL" you can block it and know you will be blocking many other script kiddies as well.

I believe that when you are scraping HTML content from a site it always wise to follow some golden rules based on Karma. It is not nice to have your own site hacked or taken down due to a BOT gone wild therefore you shouldn't wish this on other people either. 

Behave when you are doing your own scraping and hopefully you won't find your own sites content appearing on a Chinese rip off under a different URL anytime soon.

1. Don't overload the server your are scraping. 

This only lets the site admin know they are being scraped as your IP / Useragent will appear in their log files so regularly that you might get confused for trying a DOS attack. You could find yourself added to a block list ASAP if you hammer the site you are scraping.

The best way to get round this is to put a time gap in-between each request you make. If possible follow the sites Robots.txt file if they have one and use any Crawl-Delay parameter they may have specified. This will make you look much more legitimate as you are obeying their rules.

If they don't have a Crawl-Delay value then randomise a wait time in-between HTTP requests, with at least a few seconds wait as the minimum. If you don't hammer their server and slow it down you won't draw attention to yourself.

Also if possible try to always obey the sites Robot.txt file as if you do you will find yourself on the right side of the Karmic law. There are many tricks people use such as dynamic Robots.txt files, and fake URL's placed within them, that are used to trick scrapers who break the rules by following DISALLOWED locations into honeypots, never-ending link mazes or just instant blocks.

An example of a simple C# Robots.txt parser I wrote many years ago that can easily be edited to obtain the Crawl-Delay parameter can be found here: Parsing the Robots.txt file with C-Sharp.

2. Change your user-agent in-between calls. 

Many offices share the same IP across their network due to the outbound gateway server they use, also many ISP's use the same IP address for multiple home users e.g DHCP. Therefore there is no easy way until IPv6 is 100% rolled out to guarantee that by banning a user by their IP address alone you will get your target.

Changing your user-agent in-between calls and using a number of random and current user-agents will make this even harder to detect.

Personally I block all access to my sites that use a list of BOTS I know are bad or where it is obvious the person has not edited the user-agent (CURL, Snoopy, WGet etc), plus IE 5, 5.5, 6 (all the way up to 10 if you want).

I have found one of the most common user-agents used by scrapers is IE 6. Whether this is because the person using the script has downloaded an old tool with this as the default user-agent and not bothered to change it or whether it is due to the high number of Intranet sites that were built in IE6 (and use VBScript as their client side language) I don't know.

I just know that by banning IE 6 and below you can stop a LOT of traffic. Therefore never use old IE browser UA's and always change the default UA from CURL to something else such as Chromes latest user-agent.

Using random numbers, dashes, very short user-agents or defaults is a way to get yourself caught out very quickly.

3. Use proxies if you can. 

There are basically two types of proxy.

The proxy where the owner of the computer knows it is being used as a proxy server, either generously to allow people in foreign countries such as China or Iran to access outside content or for malicious reasons to capture the requests and details for hacking purposes.

Many legitimate online proxy services such as "Web Proxies" only allow GET requests, float adverts in front of you and prevent you from loading up certain material such as videos, JavaScript loaded content or other media.

A decent proxy is one where you obtain the IP address and port number and then set them up in your browser or BOT to route traffic through. You can find many free lists of proxies and their port numbers online although as they are free you will often find speed is an issue as many people are trying to use them at the same time. A good site to use to obtain proxies by country is

Common proxy port numbers are 8000, 8008, 8888, 8080, 3128. When using P2P tools such as uTorrent to download movies it is always good to disguise your traffic as HTTP traffic rather than using the default setting of a random port on each request. It makes it harder but obviously not impossible for snoopers to see you are downloading bit torrents and other content. You can find a list of ports and their common uses here.

The other form of proxy are BOTNET's or computers where PORTS have been left open and people have reversed engineered it so that they can use the computer/server as a proxy without the persons knowledge.

I have also found that many people who try hacking or spamming my own sites are also using insecure servers. A port scan on these people often reveals that their own server can be used as a proxy themselves. If they are going to hammer me - then sod them I say as I watch US TV live on their server.

4. Use a rented VPS

If you are only required to scrape for a day or two then you can hire a VPS and set it up so that you have a safe non-blacklisted IP address to crawl from. With services like AmazonAWS and other rent by the minute servers it is easy to move your BOT from server to server if you need to do some heavy duty crawling.

However on the flipside I often find myself banning the AmazonAWS IP range (which you can obtain here) as I know it is so often used by scrapers and social media BOTS (bandwidth wasters).

5. Confuse the server by adding extra headers

There are many headers that can tell a server whether you are coming through a proxy such as X-FORWARDED-FOR, and there is standard code used by developers to work backwards to obtain the correct original IP address (REMOTE_ADDR) which can allow them to locate you through a Geo-IP lookup.

However not so long ago, and many sites still may use this code, it was very easy to trick sites in one country into believing you were from that country by modifying the X-FORWARDED-FOR header and supplying an IP from the country of your choice.

I remember it was very simple to watch Comedy Central and other US TV shown online just by simply using a FireFox Modify Headers plugin and entering in a US IP address for the X-FORWARDED-FOR header.

Due to the code they were using, they obviously thought that the presence of the header indicated that a proxy had been used and that the original country of origin was the spoofed IP address in this modified header rather than the value in REMOTE_ADDR header

Whilst this code is not so common anymore it can still be a good idea to "confuse" servers by supplying multiple IP addresses in headers that can be modified to make it look like a more legitimate request.

As the actual REMOTE_ADDR header is set by the outbound server you cannot easily change it. However you can supply a comma delimited list of IP's from various locations in headers such as X-FORWARDED-FOR, HTTP_X_FORWARDED, HTTP_VIA and the many others that proxies, gateways, and different servers use when passing HTTP requests along the way.

Plus you never know, if you are trying to obtain content that is blocked from your country of origin then this old technique may still work. It all depends on the code they use to identify the country of an HTTP requests origin.

6. Follow unconventional redirect methods.

Remember there are many levels of being able to block a scrape so making it look like a real request is the ideal way of getting your content. Some sites will use intermediary pages that have a META Refresh of "0" that then redirect to the real page or use JavaScript to do the redirect such as:

<body onload="window.location.href=''">


function redirect(){

Therefore you want a good super scraper tool that can handle this kind of redirect so you don't just return adverts and blank pages. Practice those regular expressions!

7. Act human.

By only making one GET request to the main page and not to any of the images, CSS or JavaScript files that the page loads in you make yourself look like a BOT.

If you look through a log file it is easy to spot Crawlers and BOTs because they don't obtain these extra files and as a log file is mainly sequential you can easily spot the requests made by one IP or User-Agent just by scanning down the file and noticing all the single GET requests from that IP to different URLS. 

If you really want to mask yourself as human then use a regular expression or HTML parser to get all the related content as well.

Look for any URLS within SRC and HREF attributes as well as URLS contained within JavaScript that are loaded up with AJAX. It may slow your own code down plus use up more of your own bandwidth as well as the server you are scraping but it will disguise you much better and make it harder for anyone looking at a log file to distinguish you from a BOT with a simple search.

8. Remove tracking codes from your URL's.

This is so that when the SEO "guru" looks at their stats they don't confuse their tiny little minds by not being able to work out why it says 10 referrals from Twitter but only 8 had JavaScript enabled or had the tracking code they were using for a feed. This makes it look like a direct, natural request to the page rather than a redirect from an RSS or XML feed.

Here is an example of a regular expression that removes anything after the query-string including the question mark.

The example uses PHP but the expression itself can be used in any language.

$url = "";

$url = preg_replace("@(^.+)(\?.+$)@","$1",$url);

There are many more rules to scraping without being caught but the main aspect to remember is Karma. 

What goes around comes around, therefore if you scrape a site heavily and hammer it so bad that it costs the user so much bandwidth and money that they cannot afford it, do not be surprised if someone comes and does the same to you at some point!

Tuesday, 23 August 2016

The Naming and Shaming of programming tightwads

Let the Shame List begin

Just like the News of the World when they published their list of paedophiles, nonces and kiddy fiddlers I am now creating my own list of shame which will publicly list the many people who have contacted me and done any of the following:

1. Asked for a new feature to be developed for one of my Wordpress plugins that only they required. Then once I have delivered the upgrade they don't even say "Thank You".

In fact 9 out of 10 times I don't even get the smallest of donations even when I have been promised them beforehand. I have lost count of the people who email me promising to donate me money if only I do this or that but when I do it they seem to forget how to click that big yellow DONATE button in the plugin admin page.

Do these people really think I live only to serve their useless coding skills by implementing features they themselves are too unskilled to develop or too tight to pay for? Is this really what people expect from Open Source code? I don't mind if you cannot code and add the feature or fix the bug yourself but if you can't then at least have the decency to donate some money for my time. Is that too much to ask for?

2. The other group of people (and there are many) are those who email me at stupid times throughout the morning 4am sometimes - demanding that I fix their site immediately due to my plugin "not working".

In fact 99 out 100 times it is usually the case that they have either been a numpty and not followed or understood the instructions, deleted all or some of the files or haven't set the relevant permissions up correctly.

Not only do I try and make all my Wordpress plugins easy to use for the non technical to use by outputting detailed error messages that explain what they must to do to fix the problem but most plugins have a "Test Configuration" button on them that will run all the necessary tests and then list any problems as well as fixes for them.

If these people cannot even read and understand error messages such as "Filepath does not exist at this location" because they have been silly enough to delete that file or folder then why should I offer free 24 hour support for them?

Here's an idea. If I email you back with steps to fix your incompetence - donate me some money.

Believe it or not I don't help people out for fun or offer free 24 hour support for FREE products.

You get what you pay for! 

If you are too tight to offer to pay me to develop your custom feature or too tight to even donate the smallest amount when demanding (as I have had on numerous occasions) that I do X Y or Z by next Tuesday then why should I bend over to help you?

3. Then there are those companies (even some that have been big multi-nationals) that email me asking for relevant licences to be added to my downloadable scripts so that they can use them in their own projects. Probably projects that they will be making lots of money from by re-selling my code. Yet they refuse to donate even the slightest amount to the cause.

4. Finally and most important are the SEO Scammers, which you can read about more in detail here. They are advertisers who offer you money to post articles on your site yet when you do they then tell you that you will be paid in 20+ days. Why so long for so little money I have no idea. Yet on multiple occasions now I have been SEO Scammed where they fail to pay me my money but despite this, and me taking the article down. They have gained from the link juice passed along in links without rel="nofollow" on them and the site/domain authority.

It is very surprising how long PR Link Juice and authority stays around after the fact. Experiments we did showed that when setting up a pyramid system with one site at the top with zero links, and 100 or so with PR 4-6 all linking to that sites homepage. The top site zoomed up Googles rankings and even when we stopped the experiment the referrals from these sites (despite there being no links), stayed around in Google Webmaster Tools reports for months and months afterwards.

In future I am going to name and shame every person and company who carries out one of these actions on my blog. There are many other places you can do this on the web, darkweb and even Facebook So it is worth checking these places for names, emails and company addresses before doing any work with them.

A basic tech review is also advised to see where the company is based with a WHOIS and DNS search.

It might help those other developers considering open source development to realise that it's a dead end that causes more hassle than it's worth. If you think your going to get rich developing scripts that can easily be stolen, downloaded, re-used and modified then you are living in a fantasy world.

Let the shaming begin.

Just to let you know, since I started this list I have had quite a few people donate money and I have removed their names from the list. I am not heartless and I don't want people searching Google for their own name to find this page first.

Therefore you know what to do, pay me the money you owe me or make a donation.

Sebastian Long - This was an advertiser who offered a measly £60 for putting up an article (not exactly much) on my racing site for the Goodwood Sussex Stakes held in July. I posted the exact article he wanted and even added extra SEO to help him but he didn't want any of that so I took it out. Once he was happy he said I would be paid within 20 days -it would have been nice for him to tell me this before hand but I am too trustworthy, although that is slowly diminishing.. It's so far been over 20 days (and 20 working days), and I have not been paid. I have contacted him multiple times and have now taken the article down. However his name and his company will remain on the numerous SEO / Advertiser blacklists that I put him on due to his lack of respect in honouring a very simple contract.

Kevin Clark - who did not know how to set up a sitemap - "press the build button" and wanted help "fixing" issues that were not broken which I duly gave out. No dontation received.

Raymond Peytors - who asked about the now non supported pings to ASK or Yahoo - these have not been supported for sitemaps for a long time now. No donation recieved.

Mike Shanley - who did not seem to know how to read the "Read Me" text that comes with Wordpress plugins. On all my plugins I add a "Test Set-up" button which runs through the setup and displays problems and solutions for the user. The Read Me guide also explains how to run the Test when installing the plugin. Donation? Not on your nelly.

Juergen Mueller - For sending me an error message related to another plugin that he thought somehow was related to my plugin. This was due to the memory limit of his server/site being reached by said plugin.
Despite that he had all the details within the error message to fix it he still decided to email me for help. Despite me explaining how to fix the problem and steps he should do in future to fix problems I did not get a donation.

Holder Heiss - who even though he had read my disclaimer that said I don't give away support for free still asked me and received free help. He tried to motivate me to solving his problem with the following sentence

 "I understand that you are cautious about giving free support for your free software. Anyway as I like using and would like to continue using the google sitemap plugin, maybe I can motivate you to have a look on this topic reported by several users:"

Even though he had not donated me any money I still checked the system, upgraded my software and looked for a problem that I could not find - probably related to another plugin. You get what you pay for and he got at least an hours worth of support for free!

Pedro Galvez Dextre - Who complained about the error message "Sitemap Build Aborted. The Server Load was 4.76 which is equal to or above your specified threshold of 0.9" and asked what was wrong????

Cindy Livengood - who couldn't be bothered to read the Readme.txt file as they "bored her" even though they contained an example post which would show if the plugin was working or not.

There have been many other people but I only have so much time to go through my inbox.

By the way if your name is on this list or appears on it in future and you would like it removed then you know what to do - a donate button is at the bottom of each plugin, on my website, on my blog and many other places.

Please remember people - I have had a serious illness and I am still in lots of pain. Therefore I have stopped supporting my Wordpress plugins for this reason PLUS the lack of donations I have received.

I work at a company where I am charged out at £700 a day therefore a donation of £10 is not going to make me work for a day or two on a plugin that is open-source and should be taken as such.

You get what you pay for and I wrote these plugins for myself not for anyone else.

I put them up on Wordpress to see if anyone else found them useful. If you do not like them then use another plugin.

If you want professional support then be prepared to pay for it. If not read on and follow these basic debugging steps and use Google. That's how I had to learn LINUX and WordPress!

As stated in my Readme.txt file of my Sitemap plugin:

I have an error - How to debug

If you have any error messages installing the plugin then please try the following to rule out conflicts with other plugins
-Disable all other plugins and then try to re-activate the Strictly Google Sitemap plugin
- some caching plugins can cause issues.
-If that worked, re-enable the plugins one by one to find the plugin causing the problem. Decide which plugin you want to use.
-If that didn't work check you have the latest version of the plugin software (from WordPress) and the latest version of WordPress installed
-Check you have Javascript and Cookies enabled.
-If you can code turn on the DEBUG constant and debug the code to find the problem otherwise contact me and offer me some money to fix the issue :)
-Please remember that you get what you pay for so you cannot expect 24 hour support for a free product. Please bear that in mind if you decide to email me.
A donation button is on my site and in the plugin admin page. 
-If you must email me and haven't chosen to donate even the smallest amount of money please read this >> 
-If you don't want to pay for support then ask a question on the message board and hope someone else fixes the problem.

But I need this or that and your plugin doesn't do it

Sorry but tough luck.

I wrote this plugin for my own requirements not anyone else and if you have conflicts with other plugins or require extra work then offer to pay me to do the development or do the work yourself.

This is what Open Source programming should be about.

I wrote this plugin as other Sitemap plugins didn't do what I wanted them to and you should follow the same rules.

If you don't like this plugin or require a new feature you must remember that you have already bought a good amount of my time for the princely sum of £0.

Hopefully the starting of the shaming will stop the influx of emails I constantly receive asking for help without donations.

Remember every one of my plugins has a donate button at the bottom of it and you get what you pay for!