Friday, 11 May 2018

Don't Fall For Trick Links - Use REL = NOOPENER and NOREFERRER in Browsers

For the XSS Hole In Browers use NOOPENER and NOREFERRER

By Strictly-Software

As some of you might know the rel attribute in anchor tags can be used for more than just nofollow or as some stupid SEO gurus think "follow" which doesn't exist.

It can also be used to trick users who click links that open pages and then run code that uses the window.opener object to change the page's HTML that you have just come from before closing.

For example Chrome is good at protecting XSS attacks even from the same origin, however even it can fall victim to trick links. Adding into your rel attribute rel="noopener" should stop the opened page from being able to modify code on the page that opened it through the parent page through the window.opener object.

However in some browsers this does not work too well, and you should be aware of this as some browsers like FireFox sometimes need an extra noreferrer value added to prevent the opened page from modifying your initial page.

If you want to see an example of this in action then go to this link on my website www.strictly-software.com/test1.html.

Try this in FireFox first as that is the browser which doesn't seem to respect the noopener attribute.

Try it in your own personal browser of choice as well and play around with the code (copy and paste it to your local machine to tamper and run), to see if taking noopener or noreferrer out of the links work or not or whether a blank link with no rel attribute passes the object reference along at all.

There are two links on the page, the first when clicked should change the page you are looking at to the top part of a Facebook login screen.

If you can imagine getting a link in your messages, emails or on Facebook itself and clicking it to find that it seems you have been logged out.

So you login again.

Only now that the page is not Facebook but a page my trick link or window.open() page has made you think is Facebook. The link or window.open('tamper.html','win') has had it's HTML changed and the hacker is logging your email and password.

In this example I have just used an image so their is no danger of having your details stolen.

What the first page does is just offer the user a link to click. It might be from a friend or hacker but once clicked it will use target="_blank" to open a new window.

As soon as the window is open an onload event is fired that uses the window.opener object to gain access to it's parent and change the HTML.

I have used a basic example here and an image of Facebook but you can see the code in action.

function RunScam(){
 window.opener.document.documentElement.innerHTML="<html><body><img style='width:1000px' src='FBTest.png' alt='Fake Facebook Page Example' /></body></html>";
 window.close();
}


I use an onload="RunScam()" function to call that code above.

This function uses the window.opener object to reference the document.documentElement object and then innnerHTML to reformat the page before closing the new window.

Remember I am just using an image here so there is no risk but the function could be extended to load in stylesheets, real inputs that record your passwords and look just as real as the site it is faking. It could be a bank, a social media site or any other kind of site people would want to get passwords for.

Once you have checked the fake link out try the next one.

It shouldn't do anything but open a blank window.

Remember the link is at www.strictly-software.com/test1.html as you may have to go back from the fake Facebook page.

So try this out in FireFox and then you will see the importance of adding noopener and the older workaround when FF didn't support noopener, noreferrer.

Monday, 15 January 2018

Quickly Grab Generated Source Code With One Click

Quickly Grab Generated Source Code With One Click


By Strictly-Software

Now that my broken arm is getting better I will be doing more code. It still hurts like mad though, the arm bone didn't even look like it belonged anywhere near the shoulder where it was dislocated.

If you want WordPress plugins then go and check out the main site which I need to do some work on. I am also thinking of building an alternative search engine to get round Google's/CIA/NSA's de-ranking and demonetisation.

I used to have a Super Search Engine years ago, that took the top 10 items from Google, BING and Yahoo, however they kept changing the source code until it all became AJAX loaded in on the fly and too hard to scrape.

I think with the push or deletion of alternative news down the rankingsand pro-establishment news gaining viewers they would never had got a year ago due to Facebook's subservience to the USA and Israeli governments. More and more people will move to new decentralised social media platforms and once that happens Facebook and Google, who are already losing out to duckduckgo.com  due to privacy concerns will lose money in their share price as well as many members.                                 

The problem is money of course and too few people click on adverts or donate out the kindness of their heart.

I think, like search.darkpolitriks.com, that has a starting page of core main #altnews websites and podcasts, I could write my own one and charge £10 for a relevant #altnews blog or channel to be added to the SERP, just so that small alternative sites have the same chance of being found in results and sites like CNN and the BBC are weeded out.

Easiest way of creating a SERP. Just ensure the site is relevant and not mainstream.

Anyway I was fixing a bug today when I realised that it was a bookmark with an http source on an https site that prevented the lock from showing.

Sometimes I don't think people realise how dangerous loading third party scripts can be.

Just loading in a CSS stylesheet could cause nightmares.

For example say your site loaded in a stylesheet from www.SomeSiteIDontControl1.com which loads in a background PNG image which in turn loads in another remote 3rd party stylesheet from www.SomeSiteIDontControl2.com.

Then one day the person in control of that site changes that 2nd image to a dangerous .js file or .exe that loads in an XSS attack.

You are so far removed from the actual cause of the problem that with minification and compression you might have no hope in finding the dangerous file.

So one day the 2nd CSS file that you are loading looks something like this:


background:url(http://www.somesiteIdontcontrol2.com/images/background.png) no-repeat 16px 0;


Then one day this site owner changes his background image to be an .js file e.g


background:url(http://www.somesiteIdontcontrol2.com/images/dodgyscript.js) no-repeat 16px 0;


And when the page loads, and after your onDOMLoad event loads in these scripts it hits your user with the JavaScript sites code.

A recursive script might be handy to run every day to check diagnostics by referencing every URL it finds in any style-sheet or JavaScript on your site.

Follow it backwards and check every other URL it finds.

Another way, if you are perfectly happy with your code is to create local versions of the files and images and keep it all on a server you control so no-one could malform the objects being loaded.

This is a bookmarklet script I wrote years ago that shows me the DOM loaded afterwards and not before.

I wanted to see what scripts and files had been added since I pressed the View Source button that shows the HTML and JS/CSS before any code is run on the page.

I use it all the time. I created a bookmarklet and added it to my bookmark bar so it's within easy reach with a URL to www.google.com and save it.

I then edit it and change the location of the JavaScript I want from www.google.com to the code so that it runs. This might not be necessary anymore but I had add a real URL in the old days.

This code basically takes a snapshot of the DOM once all 3rd party objects have modified the code, loaded videos, changed images and anything else sites like to do when onDOMLoad (not onWindowLoad, which only fires once every image and external object has been loaded.

As you are loading the code with a press of a button there is plenty of time for the onDOM onWindow and onFrame load events to fire, plus many others.


javascript:(function()%7b function htmlEscape(s)%7bs=s.replace(/&/g,'&amp;');s=s.replace(/>/g,'&gt;');s=s.replace(/</g,'&lt;');return s;%7d x=window.open(); x.document.write('<pre>' + htmlEscape('<html>\n' + document.documentElement.innerHTML + '\n</html>')); x.document.close(); %7d)();


As the HTML 5 spec still allows for href="javascript: ..... " then a link or button can run JavaScript when it really should be running external document.addEventListener events to each object needing code to fire when hit.

The code just creates a URL encoded function called htmlEscape which replaces brackets and ampersands and opens a new window writing this new code out into the document.documentElement.innerHTML.

Not hard to do but very useful.


By Strictly-Software

© 2018 Strictly-Software

Monday, 18 September 2017

Automatic PHP 7 Support for Strictly Auto Tags

Automatic PHP 7 Support for Strictly Auto Tags


By Strictly-Software

Please check out the new free and premium plugins for PHP 7 users on the main site.

As PHP doesn't allow short cutting for some reason like decent languages I could not make a single file that would handle both PHP 5 and PHP 7 as I wanted to with a simple check to see if callbacks were supported or the modifier that was removed ( \e ).

Therefore there are now both free versions and premium versions on my site and you should check (and like) the Facebook page to keep informed of changes as WordPress have taken down my plugins for some reason.

By Strictly-Software

Tuesday, 24 January 2017

PHP7 Support For Strictly AutoTags

PHP7 Support For Strictly AutoTags


By Strictly-Software


If you are using the popular Strictly AutoTags plugin then everything should be working fine however if you have upgraded to PHP 7 then that will have caused problems.

Not every developer has the time or knowledge to know that a new PHP version will remove features or cause issues with their plugins. However in this case it's due to the /e modifier being dropped.

$content = preg_replace("/(\.[”’\"]?\s*[A-Z][a-z]+\s[a-z])/e","strtolower('$1')",$content);


The only difference apart from the callback is that I am using @ @ as wrappers around my regular expression this is just so I can see it more easily with far less escaping required.

So replace the line above which is about line 1345 of the strictly-autotags/strictlyautotags.class.php file.

$content = preg_replace_callback("@(\.[”’\"]?\s*[A-Z][a-z]+\s[a-z])@",
 function ($matches) {
  return strtolower($matches[0]);
 },
$content);

Other people have used this fix for the plugin in the WordPress forum so it should work. I don't use PHP7 yet so never had to deal with it.

However if you are a developer please help others out on the forum. I have had over 223,616 downloads of the free version. If just everyone of those people had donated me £1 then I could spent my whole time working on it but everyone wants everything for free it seems nowadays which is why I have my premium plugin with more features > Strictly AutoTags Premium Plugin Version.

Remember you can also find up to date information on my Facebook page for my Automation plugins, this and the Strictly TweetBOT plugin which go hand in hand.

Also remember there is a Facebook page for thee plugins you can check for help as I don't automatically get notified of new problems on the WordPress site for some reason.

You can find this page at https://www.facebook.com/strictlysoftware/

Remember if you have a bug with any of my plugins to do the following:


  1. Check the WordPress forum for similar bugs and fixes https://wordpress.org/support/plugin/strictly-autotags
  2. Check the ReadMe file or admin page for any help.
  3. Check your PHP and APACHE error logs to ensure it's this plugin causing the issues.
  4. Run though the standard debug practises laid out here: Giving useful debug information.
  5. Provide as much info to the developer as possible e.g PHP version, WP version, Plugin version, any other installed plugins, when it started failing, was anything else installed near that time, details of your process for tagging.


By Strictly-Software


© 2017 Strictly-Software

Monday, 23 January 2017

Find Any WiFi Password on a Windows Computer

Find Any WiFi Password on a Windows computer


By Strictly-Software

The title is a little misleading as it doesn't bring you back to the early 2000's and let you go driving around estates with a laptop, breaking into password encrypted WiFi routers. Not that you used to need to as in most estates your computer could pick up an unlocked router or three without a problem.

This is slightly different in it allows you to find ANY password that belongs to a router your PC/Laptop has been connected to in the past.

You may not like writing things down and have had a memory slip or you haven't used the router for so long the password escapes any looks for it.

First - Find out what you can access

This bit allows us to find out all the WiFi routers such as friends routers and gadgets like Chromecast that you have forgotten the password to.

Open your command prompt in administrator mode otherwise this won't work.

Once you have your command prompt up lets find out what WiFi spots we have connected to in the past or have access to. If you were around a friends one time and connected but forgot the password then you may need to use this to re-gain it if his WiFi router is in the list.

Type the following into the prompt: netsh wlan show profiles

It should then list all the routers you have had connections to from the computer you are on.

C:\Windows\system32>netsh wlan show profiles 

User profiles
-------------
    All User Profile     : Chromecast1034
    All User Profile     : BTHub4-NX23
    All User Profile     : TALKTALK-3ERA24
    All User Profile     : virginmedia8817891
    All User Profile     : strictlywifi10x
    All User Profile     : strictly-ukhorse-air


Now we have a list of spots and we pick the one we need the password for. The command is pretty similar to the preceding one it just needs the routers name added to it and the term key=clear. If you don't add this to the end then you won't get to view the password in clear text.

netsh wlan show profile BTHub4-NX23 key=clear

This will give you detailed info on the router, whether it connects automatically, authentication mode e.g WPA2 and even details of your current cost and whether you are over the data limit set by your provider.

Lets try and find the password for the connection BTHub4-NX23

C:\Windows\system32>netsh wlan show profile BTHub4-NX23 key=clear

Profile BTHub4-NX23 on interface WiFi:
=======================================================================

Applied: All User Profile

Profile information
-------------------
    Version                : 1
    Type                   : Wireless LAN
    Name                   : BTHub4-NX23
    Control options        :
        Connection mode    : Connect automatically
        Network broadcast  : Connect only if this network is broadcasting
        AutoSwitch         : Do not switch to other networks

Connectivity settings
---------------------
    Number of SSIDs        : 1
    SSID name              : "BTHub4-NX23"
    Network type           : Infrastructure
    Radio type             : [ Any Radio Type ]
    Vendor extension          : Not present

Security settings
-----------------
    Authentication         : WPA2-Personal
    Cipher                 : CCMP
    Security key           : Present
    Key Content            : r85583569z

Cost settings
-------------
    Cost                   : Unrestricted
    Congested              : No
    Approaching Data Limit : No
    Over Data Limit        : No
    Roaming                : No
    Cost Source            : Default



As you can see from the Key Content section the password for this router is r85583569z.

Open the WiFi section on your desktop and connect by adding the key and it should connect. If not you have a problem.

So if you don't like writing passwords down or just want to use your mates WiFi without spending hours hunting down where he put his WiFi routers login details then this trick can come in handy.

By Strictly-Software


© 2017 Strictly-Software

Wednesday, 5 October 2016

Disk Full - Linux - Hacked or Full of Log Files?

Disk Full - Linux - Hacked or Full of Log Files?

By Strictly-Software

This morning I woke up to find the symptoms of a hack attempt on my LINUX VPS server.

I had the same symptoms when I was ShockWave hacked a few years ago and some monkey overwrote a config file so that when I rebooted, hoping to fix the server, it would reload it in from a script hidden in a US car site.

They probably had no idea that the script was on their site either, but it was basically a script to enable various hacking methods and the WGet command in the config file ensured that my standard config was constantly overwritten when the server was re-started.

Another symptom was that my whole 80GB of disk space had suddenly filled up.

It was 30GB the night before and now with 30 odd HD movies hidden in a secret folder buried in my hard drive I could not FTP anything up to the site, receive or send emails or manually append content to my .htaccess file to give only my IP full control.

My attempts to clear space by clearing cached files was useless and it was only by burrowing through the hard drive folder by folder all night using the following command to show me the biggest files (visible and hidden) that I found the offending folder and deleted it.


du -hs $(ls -A)


However good this command is for finding files and folders and showing their size in KB, MB or GB, it is a laborious task to manually go from your root directory running the command over and over again until you find the offending folder(s).

So today when I thought I had been hacked I used a different process to find out the issue.

The following BASH script can be run from anywhere on your system in a console window and you can either enter a path if you think you know where the problem lies or just enter / when prompted to scan the whole machine.

It will list first the 20 biggest directories in order of size and then the 20 largest files in order of size.

echo -n "Type Filesystem: ";
read FS;NUMRESULTS=20;
resize;clear;date;df -h $FS;
echo "Largest Directories:"; 
du -x $FS 2>/dev/null| sort -rnk1| head -n $NUMRESULTS| awk '{printf "%d MB %s\n", $1/1024,$2}';
echo "Largest Files:"; 
nice -n 20 find $FS -mount -type f -ls 2>/dev/null| sort -rnk7| head -n $NUMRESULTS|awk '{printf "%d MB\t%s\n", ($7/1024)/1024,$NF}'

After running it I found that the problem was not actually a security breach but rather a plugin folder within a website containing log files. Somehow without me noticing the number of archived log files had crept up so much that it had eaten 50GB of space without my knowledge.


As the folder contained both existing and archived log files I didn't want to just truncate it or delete everything instead I removed all archived log files by using a wildcard search for the word ARCHIVED within the filename.


rm *ARCHIVED*


If you wanted to run a recursive find and delete within a folder then you may want to use something a bit different such as:


ind -type f -name '*ARCHIVED*' -delete


This managed to remove a whole 50GB of files within 10 minutes and just like lightening my sites, email and server started running again as they should have been.

So the moral of the story is that a full disk should be treated first as a symptom of a hacked server, especially if you were not expecting it, and the same methods used to diagnose and fix the problem can be used whether you have been hacked or allowed your server to fill itself up with log files or other content.

Therefore keep an eye on your system so you are not caught out if this does happen to you and if you do suddenly jump from 35GB to 80GB and stop receiving emails or being able to FTP content up (or files being copied up as 0 bytes), then you should immediately put some security measures into place.

My WordPress survival guide on security has some good options to use if you have been hacked but as standard you can do some things to protect yourself such as


  • Replacing the default BASH language with a more basic, older and secure DASH. You can still run BASH once logged into your console but as default it should not be running and allow hackers to run complex commands on your server.
  • You should always use SFTP instead of FTP as its more secure and you should change the default SSH port from 22 to another number in the config file so that standard port scanners don't spot that your server is open and vulnerable to attack.
  • If you are running VirtualMin on your server you should also change the default port for accessing it from 10000 to another number as well. Otherwise attackers will just swap from SSH attacks by console to web attacks where the front end is less protected. Also NEVER store the password in your browser in case you forget to lock your PC one day or your browsers local SQLLite Database is hacked and the passwords compromised.
  • Ensuring your root password and every other user password is strongly typed. Making passwords by joining up phrases or rememberable sentences where you swap the capitals and non capital letters over is a good idea. And always add a number to the start or end, or both as well as some special characters e.g 1967bESTsAIDfRED*_* would take a dictionary cracker a very long time to break.
  • Regularly change your root and other user passwords in case a keylogger has been installed on your PC and discovered them.
  • Also by running DENYHOSTS and Fail2Ban on your server you can ensure anyone who gets the SSH password wrong 3 times in a row is blocked and unable to access your console or SFTP files up to your server. If you forget yourself you can always use the VirtualMin website front end (if installed) to login and remove yourself from the DenyHosts list.
  • If you are running WordPress there are a number of other security tools such as the WordPress Firewall plugin that you can install which will hide your wp-admin login page away behind another URL and redirect people trying to access it to another page. I like the https://www.fbi.gov/wanted/cyber URL myself. It can also ban people who fail to login after a number of attempts for a set amount of time as well a number of other security features.


Most importantly of all regularly check the amount of free space you have on your server and turn off any logging that is not required if you don't need it.

Getting up at 5.30AM to send an email only to believe your site has been hacked due to a full disk is not a fun way to spend your day!


By Strictly-Software

 © 2016 Strictly-Software

A Karmic guide for Scraping without being caught

Quick Regular Expression to clean off any tracking codes on URLS

I have to deal with Scrapers all day long in my day job and I ban them in a multitude of ways from using firewalls, .htaccess rules, my own personal logger system that checks for the duration between page loads, behaviour, and many other techniques.

However I also have to scrape HTML content sometimes for various reasons, such as to find a piece of content related to somebody on another website linked to my own. So I know both methods to use to detect scrapers and stop them.

This is a guide to various methods that scrapers use to prevent being caught and have their IP address added to a blacklist within minutes of starting. Knowing the methods people use to scrape sites will help you when you have to defend your own from scrapers so it's good to know both attack and defense.

Sometimes it is just too easy to spot a script kiddy who has just discovered CURL and thinks it's a good idea to test it out on your site by crawling every single page and link available.

Usually this is because they have downloaded a script from the net, sometimes a very old one, and not bothered to change any of the parameters. Therefore when you see a user-agent in you logfile that is hammering you that just has the user-agent of "CURL" you can block it and know you will be blocking many other script kiddies as well.

I believe that when you are scraping HTML content from a site it always wise to follow some golden rules based on Karma. It is not nice to have your own site hacked or taken down due to a BOT gone wild therefore you shouldn't wish this on other people either. 

Behave when you are doing your own scraping and hopefully you won't find your own sites content appearing on a Chinese rip off under a different URL anytime soon.


1. Don't overload the server your are scraping. 

This only lets the site admin know they are being scraped as your IP / Useragent will appear in their log files so regularly that you might get confused for trying a DOS attack. You could find yourself added to a block list ASAP if you hammer the site you are scraping.

The best way to get round this is to put a time gap in-between each request you make. If possible follow the sites Robots.txt file if they have one and use any Crawl-Delay parameter they may have specified. This will make you look much more legitimate as you are obeying their rules.

If they don't have a Crawl-Delay value then randomise a wait time in-between HTTP requests, with at least a few seconds wait as the minimum. If you don't hammer their server and slow it down you won't draw attention to yourself.

Also if possible try to always obey the sites Robot.txt file as if you do you will find yourself on the right side of the Karmic law. There are many tricks people use such as dynamic Robots.txt files, and fake URL's placed within them, that are used to trick scrapers who break the rules by following DISALLOWED locations into honeypots, never-ending link mazes or just instant blocks.

An example of a simple C# Robots.txt parser I wrote many years ago that can easily be edited to obtain the Crawl-Delay parameter can be found here: Parsing the Robots.txt file with C-Sharp.


2. Change your user-agent in-between calls. 

Many offices share the same IP across their network due to the outbound gateway server they use, also many ISP's use the same IP address for multiple home users e.g DHCP. Therefore there is no easy way until IPv6 is 100% rolled out to guarantee that by banning a user by their IP address alone you will get your target.

Changing your user-agent in-between calls and using a number of random and current user-agents will make this even harder to detect.

Personally I block all access to my sites that use a list of BOTS I know are bad or where it is obvious the person has not edited the user-agent (CURL, Snoopy, WGet etc), plus IE 5, 5.5, 6 (all the way up to 10 if you want).

I have found one of the most common user-agents used by scrapers is IE 6. Whether this is because the person using the script has downloaded an old tool with this as the default user-agent and not bothered to change it or whether it is due to the high number of Intranet sites that were built in IE6 (and use VBScript as their client side language) I don't know.

I just know that by banning IE 6 and below you can stop a LOT of traffic. Therefore never use old IE browser UA's and always change the default UA from CURL to something else such as Chromes latest user-agent.

Using random numbers, dashes, very short user-agents or defaults is a way to get yourself caught out very quickly.


3. Use proxies if you can. 

There are basically two types of proxy.

The proxy where the owner of the computer knows it is being used as a proxy server, either generously to allow people in foreign countries such as China or Iran to access outside content or for malicious reasons to capture the requests and details for hacking purposes.

Many legitimate online proxy services such as "Web Proxies" only allow GET requests, float adverts in front of you and prevent you from loading up certain material such as videos, JavaScript loaded content or other media.

A decent proxy is one where you obtain the IP address and port number and then set them up in your browser or BOT to route traffic through. You can find many free lists of proxies and their port numbers online although as they are free you will often find speed is an issue as many people are trying to use them at the same time. A good site to use to obtain proxies by country is http://nntime.com.

Common proxy port numbers are 8000, 8008, 8888, 8080, 3128. When using P2P tools such as uTorrent to download movies it is always good to disguise your traffic as HTTP traffic rather than using the default setting of a random port on each request. It makes it harder but obviously not impossible for snoopers to see you are downloading bit torrents and other content. You can find a list of ports and their common uses here.

The other form of proxy are BOTNET's or computers where PORTS have been left open and people have reversed engineered it so that they can use the computer/server as a proxy without the persons knowledge.

I have also found that many people who try hacking or spamming my own sites are also using insecure servers. A port scan on these people often reveals that their own server can be used as a proxy themselves. If they are going to hammer me - then sod them I say as I watch US TV live on their server.


4. Use a rented VPS

If you are only required to scrape for a day or two then you can hire a VPS and set it up so that you have a safe non-blacklisted IP address to crawl from. With services like AmazonAWS and other rent by the minute servers it is easy to move your BOT from server to server if you need to do some heavy duty crawling.

However on the flipside I often find myself banning the AmazonAWS IP range (which you can obtain here) as I know it is so often used by scrapers and social media BOTS (bandwidth wasters).


5. Confuse the server by adding extra headers

There are many headers that can tell a server whether you are coming through a proxy such as X-FORWARDED-FOR, and there is standard code used by developers to work backwards to obtain the correct original IP address (REMOTE_ADDR) which can allow them to locate you through a Geo-IP lookup.

However not so long ago, and many sites still may use this code, it was very easy to trick sites in one country into believing you were from that country by modifying the X-FORWARDED-FOR header and supplying an IP from the country of your choice.

I remember it was very simple to watch Comedy Central and other US TV shown online just by simply using a FireFox Modify Headers plugin and entering in a US IP address for the X-FORWARDED-FOR header.

Due to the code they were using, they obviously thought that the presence of the header indicated that a proxy had been used and that the original country of origin was the spoofed IP address in this modified header rather than the value in REMOTE_ADDR header

Whilst this code is not so common anymore it can still be a good idea to "confuse" servers by supplying multiple IP addresses in headers that can be modified to make it look like a more legitimate request.

As the actual REMOTE_ADDR header is set by the outbound server you cannot easily change it. However you can supply a comma delimited list of IP's from various locations in headers such as X-FORWARDED-FOR, HTTP_X_FORWARDED, HTTP_VIA and the many others that proxies, gateways, and different servers use when passing HTTP requests along the way.

Plus you never know, if you are trying to obtain content that is blocked from your country of origin then this old technique may still work. It all depends on the code they use to identify the country of an HTTP requests origin.


6. Follow unconventional redirect methods.

Remember there are many levels of being able to block a scrape so making it look like a real request is the ideal way of getting your content. Some sites will use intermediary pages that have a META Refresh of "0" that then redirect to the real page or use JavaScript to do the redirect such as:

<body onload="window.location.href='http://blah.com'">

or

<script>
function redirect(){
   document.location.href='http://blah.com';
}
setTimeout(redirect,50);
</script> 

Therefore you want a good super scraper tool that can handle this kind of redirect so you don't just return adverts and blank pages. Practice those regular expressions!


7. Act human.

By only making one GET request to the main page and not to any of the images, CSS or JavaScript files that the page loads in you make yourself look like a BOT.

If you look through a log file it is easy to spot Crawlers and BOTs because they don't obtain these extra files and as a log file is mainly sequential you can easily spot the requests made by one IP or User-Agent just by scanning down the file and noticing all the single GET requests from that IP to different URLS. 

If you really want to mask yourself as human then use a regular expression or HTML parser to get all the related content as well.

Look for any URLS within SRC and HREF attributes as well as URLS contained within JavaScript that are loaded up with AJAX. It may slow your own code down plus use up more of your own bandwidth as well as the server you are scraping but it will disguise you much better and make it harder for anyone looking at a log file to distinguish you from a BOT with a simple search.


8. Remove tracking codes from your URL's.

This is so that when the SEO "guru" looks at their stats they don't confuse their tiny little minds by not being able to work out why it says 10 referrals from Twitter but only 8 had JavaScript enabled or had the tracking code they were using for a feed. This makes it look like a direct, natural request to the page rather than a redirect from an RSS or XML feed.

Here is an example of a regular expression that removes anything after the query-string including the question mark.

The example uses PHP but the expression itself can be used in any language.


$url = "http://www.somesite.com/myrewrittenpage?utm_source=rss&utm_medium=rss&utm_campaign=mycampaign";

$url = preg_replace("@(^.+)(\?.+$)@","$1",$url);


There are many more rules to scraping without being caught but the main aspect to remember is Karma. 

What goes around comes around, therefore if you scrape a site heavily and hammer it so bad that it costs the user so much bandwidth and money that they cannot afford it, do not be surprised if someone comes and does the same to you at some point!

Latest Cheap Amazon Goods