Showing posts with label Forbidden. Show all posts
Showing posts with label Forbidden. Show all posts

Tuesday, 29 April 2014

Running WAMP on Windows 8 alongside IIS - 403 Forbidden Index

Problems with WAMP on Windows 8 - Forbidden 403 - Running IIS and Apache side by side

By Strictly-Software

I have just got a new laptop which has come with Windows 8.1.

Thankfully this version of Windows gives me more of  a Windows 7 feel as I really hate those big tablet like buttons. I can't use them as a touch screen anyway and I code not spent all my time on Facebook so they give me no benefit.

I want to get where I am going without Windows telling me how things should look and feel and Windows 8 is numptyfing the user interface even more so that old people can buy their wool with a few swipes but a coder has to hack about for ages to get anything working.

Anyway as I develop in PHP and C#, ASP I need to run both IIS and WAMP side by side on the same PC.

If you have read my earlier article on getting round issues with IIS blocking port 80 you will know that I like to change the port Apache uses to 8888 (a common HTTP alternative) so that I can run both web servers side by side without having to switch IIS off before switching WAMP on and vice versa.

Quick overview is to edit the httpd.conf file and change the listen port from 80 to 8888 e.g

Listen 8888

Also changing the line that mentions the server to:

ServerName localhost: 8888

You can read this article here. Troubleshooting WAMP Server on Windows 7

However with Windows 8 I found this didn't fix the problem and when I tried to access localhost:8888 I would get a Forbidden 403 status error code back when accessing index.php or phpmyadmin.php.

Apparently there are multiple solutions depending on what you want to do with your server.

As Windows 8 is an IPv6 Operating system and WAMP is IPv4 the loopback address is NOT 127.0.0.1 but ::1.

A simple ping to localhost in your command prompt will prove this as you will get back ::1 and not 127.0.0.1.

To fix this problem there are two solutions depending on whether you want to run IIS alongside WAMP. If you don't mind toggling between IIS and WAMP then follow Method 1. If not try Method 2.

Method 1

To get round the differences between WAMP being IPv4 and Windows 8 being IPv6 you need to edit some files.

Instead of changing your httpd.conf file as my previous article does you change the following line.

Listen 80

to

Listen 0.0.0.0:80

Then you would need to edit your hosts file in c:\windows\system32\drivers\etc to comment out the line ::1 localhost e.g:

# localhost name resolution is handled within DNS itself.
127.0.0.1       localhost
# ::1                   localhost

However this doesn't stop the issue of running WAMP alongside IIS.

Therefore if you do want both to run side by side without toggling them on/off ignore what I just said and instead follow method 2.

Method 2

Follow all my steps in the earlier article: Troubleshooting WAMP Server on Windows 7 and then you need to also change the following lines of code in your phpmyadmin.conf file which will be in your c:\wamp\alias\ folder to the following:



<Directory "c:/wamp/apps/phpmyadmin3.5.1/">
    Options Indexes FollowSymLinks MultiViews
    AllowOverride all
        Order Deny,Allow
 Deny from all
 Allow from all
</Directory>


And then in your C:\wamp\bin\apache\apache2.2.22\conf\httpd.conf file (or whatever version you are using) you need to have these lines of code.

Remove the allow from 127.0.0.1 and replace with allow from all like we did in the previous file.


<Directory />
    Options FollowSymLinks
    AllowOverride None
    Order deny,allow
    Allow from all
</Directory>


Now you should be able to access phpmyadmin with either http://localhost:8888/phpmyadmin/ or http://127.0.0.1:8888/phpmyadmin/ and still get your .NET or ASP classic code running with IIS from a simple http://localhost or http://127.0.0.1.

These two methods should sort you out if you get stuck like I did and I am sure when the next version of Windows comes out we will all have some more problems to solve to get WAMP running alongside IIS!

Wednesday, 23 October 2013

4 simple rules robots won't follow

4 simple rules robots won't follow

Job Rapists and Content Scrapers - how to spot and stop them!

I work with many sites from small blogs to large sites that receive millions of page loads a day. I have to spend a lot of my time checking my traffic log and logger database to investigate hack attempts, heavy hitting bots and content scrappers that take content without asking (on my recruitment sites and jobboards I call this Job Raping and the BOT that does it a Job Rapist).

I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.

These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.

4 Rules For BOTS to follow


1. Look at the Robots.txt file and follow the rules

If you don't even bother looking at this file (and I know because I log those that do) then you have broken the most basic rule that all BOTS should follow.

If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.

To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)

How to write code to parse a Robots.txt file (including the sitemap directive).


2. Identify yourself correctly

Whilst it may not be set in stone, there is a "standard" for BOTS to identify themselves correctly in their user-agents and all proper SERPS and Crawlers will supply a correct user-agent.

If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)

They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.htmlhttp://www.bing.com/bingbot.htm and http://tweetedtimes.com

On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.

At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.

Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.

If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!


3. Set up a Reverse DNS Entry

I am now using the "standard" way of validating crawlers against the IP address they crawl from.

This involves doing a reverse DNS lookup with the IP used by the bot.

If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.

I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.

I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.


4. Job Raping / Scraping is not allowed under any circumstances. 

If you are crawling my system then you must have permission from each site owner as well as me to do this.

I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.

This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.

Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.

I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging.

It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.

If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!