I have just been working on my robot which is written in C# and I was trying to resolve some issues with redirects on certain pages which the bot wasn't following. The default setting for the HttpWebRequest object is to follow up to 50 redirects however you can override the default settings with the following properties:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(_url);
request.AllowAutoRedirect = true;
request.MaximumAutomaticRedirections = 5;
However I was having some issue with a link to an article that was going to an advert page first and then redirecting afterwards. I was trying to come up with a solution to bypass this page but running a test console app from my laptop on the URL with my HttpRequest wrapper class was only returning the following error:
However when I tried to access the same URL from a webpage the error wasn't being raised and the HTTP response was being returned with a 200 status code.
The server committed a protocol violation. Section=ResponseStatusLine
A quick look on the web and I found that this problem can be caused by invalid headers in the response such as extra carriage returns or incomplete or invalid headers. To get round the problem of invalid headers the following code can be added to the app.config file.
However I tried this solution and it didn't resolve the issue. I then paid a closer look to the other headers I was passing and noticed that on the website I was passing as the User-Agent header the same value as the current users browser e.g
// set user-agent to the same agent as the current browser
string useragent = Request.ServerVariables["HTTP_USER_AGENT"].ToString();
// use my HTTP request object to make requests
HTTPRequest webReq = new HTTPRequest(url, proxy, useragent);
However when I was running my test console application I was using a default user-agent which I have made up myself e.g:
Mozilla/4.0 (compatible; RobsRobot 1.3; www.strictly-software.com;)
I changed this default agent to IE 6.0 e.g:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
and then I was able to retrieve a valid response from the remote server. However on the second response I got the error again. Changing the agent to IE 7.0 once again allowed me to retrieve a response but only for one attempt. So I am wondering whether the server in question which is an nginx server has some sort of IP/Agent logging and was blocking multiple requests within a certain time limit.
So I tried using a proxy server and found that no matter which user-agent I used I got a valid response back from the server every time. I could do multiple quick requests and use my own user-agent string.
Therefore I am not quite sure what the problem is as I haven't managed to narrow it down 100% but I am pretty sure that the page in question is doing some sort of server side agent sniffing and then delivering advertising content related to the request. This advertisement seems to have an issue with its headers which causes the protocol error which is now handled by my object to return an empty string as the response.