Strictly Software: URL

Showing posts with label URL. Show all posts

Thursday, 26 May 2022

Extending Try/Catch To Handle Custom Exceptions - Part 1

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 1

By Strictly-Software

Have you ever wanted to throw custom exception errors whether for business logic such as if a customer isn't found in your database, or whether more complex logic has failed that you don't want to handle but raise so that you know about it.

For me, I required this solution due to a BOT I use that collects information from various sites every day and uses regular expressions to find the pieces of data I require and extract them. The problem is that the site often changes its HTML source code and it will break a regular expression I have specifically crafted to extract the data using it.

For example, I could have a very simple regular expression that is getting the ordinal listing of a name using the following C# regular expression:

string regex = @"<span class=""ordinal_1"">(\d+?)</span>";

Then one day I will run the BOT and it won't return any ordinals for my values and when I look inside the log file I find it will have my own custom message due to finding no match such as "Unable to extract value from HTML source" and when I go and check the source it's because they have changed the HTML to something like this:

<span class="ordinal_1__swrf" class="o1 klmn">"1"<sup>st</sup></span>

This is obviously gibberish many designers add into HTML to stop BOTS crawling their data hoping that the BOT developer has written a very specific regex that will break and return no data when met with such guff HTML.

Obviously, the first expression whilst meeting the original HTML alright is too tight to handle extra guff within the source.

Therefore it required a change in the expression to something a bit more flexible in case they added even more guff into the HTML:

string regex = @"<span class=""ordinal_+?[\s\S]+?"">""?(\d+?)""?<sup";

As you can see I have made the expression looser, with extra question marks ? in case quotes are or are not wrapped around values, and using non-greedy match any character and non-character expressions like [\s\S]+? to handle the gibberish from the point it appears to where I know it has to end at the closing quote or bracket.

So instead of just logging the fact I have missing data from crawls I wanted to raise the errors with TRY/CATCH and make the fact that a specific piece of HTML no longer matches my expression an exception that will get raised so I can see it as soon as it happens.

Well with C# you can extend the base object TRY/CATCH so that your own exceptions based upon your own logic can be used. In future articles, we will build up to a full C# Project with a couple of classes, a simple BOT and some regular expressions we can use to test what happens when trying to extract common values from the HTML source on various URL's to throw exceptions.

Creating The TRY CATCH CLASS

First off the TRY / CATCH C# Class where I am extending the base object to call, usual messages but using String,Format so that I can pass in specific messages. I have put numbers at the start of each method so that when running the code later we can see which exceptions get called.

I have just created a new solution in Visual Studio 2022 called TRYCATCH and then named my RegExException that extends the Exception object. You can call your solution what you want but for a test, I don't see why just following what I am doing is not okay.


[Serializable]
public class RegExException : Exception
{
public RegExException() { }

public RegExException(string regex, string msg)
   : base(String.Format("1 Regular Expression Exception: Regular Expression: {0}; {1}", regex, msg )) { }

public RegExException(string msg)
    : base(String.Format("2 {0}", msg)) { }

public RegExException(string regex, Exception ex)
    : base(String.Format("3 Regular Exception is no longer working: {0} {1}", regex,ex.Message.ToString())) { }

}

As you can see the Class has 3 overloaded methods which either just take a single message, a regular expression string and a message, and a regular expression and an exception these values are placed in specific places within the exception messages.

You would cause a Regular Expression Exception to be thrown by placing something like this in your code:

throw new RegExException(String.Format("No match could be found with Regex {0} on supplied value '{1}'", re, val));

Hopefully, you can see what I am getting at, and we will build on this in a future post.

Tuesday, 20 January 2009

ISAPI URL Rewriting - Hot Linking

Banning Image Hot Linking

I was updating my ISAPI rules the other day implementing some new rules to identify XSS hacks and new SQL injection fingerprints and I came across some articles on the web about banning image hot linking which I thought was a cool idea. If you have spent time and effort designing images then there is nothing worse than someone just stealing them or even worse hot linking to the image so that your bandwidth is also taken up hosting their images!

I implemented the rules with a redirect to a logging page that logged the referral property to my traffic DB so that I could see which sort of requests were being made for images that didn't originate from the site they belonged to. After a week or so of logging I had enough data to come to the conclusion that I wouldn't be able to implement a hot linking ISAPI rule in the majority of the commercial applications I work on. This is for the simple reason that most sites I develop send out one or more HTML emails to users that themselves make use of hot linking to images and logos e.g using full URLs back to the image on the webserver. Unless I could come up with a rule that could handle every type of webmail or email client then there would always be someone somewhere in the world opening up their latest site generated email, clicking the "load images" button only to get nothing in return.

There are so many different mail clients out there developed in hundreds of countries that no rule could keep up with all the possibilities. From viewing my logging I had at least 2 different Polish mail clients as well as a number of Russian, Chinese and god knows what. So if you are developing a commercial application that has to send out HTML marketing or other forms of email I would advise against it.

However for small or personal sites then there is nothing wrong with implementing a rule to ban hot linkers. One of the cool ideas I read about was to return a banner advert or some other imagery that would annoy the user. However if you are going to use these rules you should know that as soon as the linker finds out you have blocked them they will just download the image from your site and host it themselves if they really wanted it. The original reason for this type of rule from what I have read is to implement it periodically to log those sites who are linking maybe without even blocking the image and then to issue cease and desist orders to the offenders.

The ISAPI Rules for IIS

I have to work with IIS at my current company and we use both 32 bit and 64 bit servers which means we have different versions of the ISAPI DLL installed on each and therefore slightly different syntax. The component I use is ISAPI Rewrite.

You will notice that the rules are slightly different due to the regular expression engine differences between versions.

Both versions do a number of conditional checks to ensure that
-The referer is not blank
-The referer is not the current site which is obviously okay
-The referer is not an image search bot. Looking at the most popular robots.
-The referer is not an email client.
-The user-agent is not a popular search engine bot.
-The image in question must be a gif, jpeg, jpg, png or bmp
-If all those conditions are matched I redirect to a 403 forbidden page.

Version 2.7 ( 32 bit server - httpd.ini)

Notice that I capture the host in the first conditional and then use a back reference to that matched value in the third conditional which ensures only hosts that are not the current site get matched.

RewriteCond Host: (.+)
RewriteCond Referer: .+
RewriteCond Referer: (?!https?://\1.*).*
RewriteCond Referer: (?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa)).*
RewriteCond Referer: (?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent)).*
RewriteCond User-Agent: (?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw)).*
RewriteRule .*\.(?:gif|jpe?g|png|bmp) /403.aspx [I,O,L]

Version 3 (64 bit server - .htaccess)

RewriteCond %{HTTP_REFERER} ^.+$
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:www\.)?mysite\..*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://(?:images\.|www\.|cc\.)?(cache|mail|live|google|googlebot|yahoo|msn|ask|picsearch|alexa).*) [NC]
RewriteCond %{HTTP_REFERER} ^(?!https?://.*(webmail|e?mail|live|inbox|outbox|junk|sent).*) [NC]
RewriteCond %{HTTP_USER_AGENT} ^(?!.*(google|yahoo|msn|ask|picsearch|alexa|clush|botw).*) [NC]
RewriteRule .*\.(jpe?g|png|gif|bmp) /403.aspx [NC,L]

Quirks and Differences

As always nothing is ever simple especially when you want to implement the same rule across two different versions of an application. As well as the obvious syntax differences with the flags and HTTP header names I found the following:

Using the IIS converter tool to convert the 2.7 rules to version 3 did not convert all the rules. It could not handle the back references and therefore I explicitly match the site domain in the v3 rules.
The negative lookahead asserts differ between versions. I could not get them working in 2.7 without putting the trailing .* outside the grouping e.g (?!https?://\1.*).* whereas in v3 they are within the grouping e.g ^(?!https?://(?:www\.)?mysite\..*)
The documentation recommends NOT using the ^ and $ when using rules with conditions because internally all conditions are combined together to create one rule and this can lead to unexpected behaviour.
I could not get the Ignore Case flags [I] working in version 2 with the negative lookaheads. As soon as I added the flag the rules would not match. This does not seem to be a problem in version 3 and the equivalent flag [NC] works fine.

Apart from those quirks both sets of rules have been tested on live systems and work fine. However if you are going to use rules such as these you are always going to run into problems with new mail clients or user-agents that come along oh so very frequently and unless you are going to constantly update your ini files you will have a considerable percentage of false positives.

The full ISAPI Rewrite documentation can be found here: http://www.isapirewrite.com/docs/

Monday, 1 December 2008

Adding Remote URL content to Index

Indexing a remote URL for use in a knowledge base

I have just completed work on a small knowledge base that I built in ASP.NET which consisted of a few quite funky features one of which was the ability to add an article into the system that was at a remote location. Most of the articles revolve around written content or files which are attached to the articles but sometimes users may come across an article on the web that they think would be great to add to the system and want it to be indexed and searchable just like any other article. In my previous incarnation of this which I hastily had written one night back in the late 90's in classic ASP you could add a URL but the only indexable content that could be used to find it in the knowledge base was the tag words I allowed the user to add alongside the URL. Obviously this isn't really good enough so in the latest version on saving the article I do the following:

Check the URL looks valid using a regular expression.
Access the URL through a proxy server and return the HTML source.
Locate and store the META keywords, description and title if they exist.
Remove everything apart from content between the start and close BODY tags.
From the body I strip any SCRIPT tags and anything between them.
Remove all HTML tags.
Clean the remaining content by removing noise words, numbers and swear words.
I add the remaining content which consists of good descriptive wording to the META keywords, description and title which I stored earlier.
I save this content to the database which then updates the Full Text Index so that it becomes searchable by the site users.

Following this process means that I get all the benefits of having the remote article indexed and searchable without the downside of having to store the whole HTML source code. After cleaning I am left with only the core descriptive wording that is useful and do away with all the rubbish.

I will show you the two main methods that retrieve the URL content and cleans the source which I have done using C#.

1. Method to access remote URL through proxy server.


public static string GetURLHTML(string remoteURL, string proxyServer)
    {
        string remoteURLContent = "";
    
        WebProxy proxy = new WebProxy(proxyServer, true); //pass the name of the proxy server
        WebRequest webReq = WebRequest.Create(remoteURL);        
        webReq.Proxy = proxy; //set request to use proxy        

        // Set the HTTP-specific UserAgent property so those sites know whos come and ripped them up
        if (webReq is HttpWebRequest)
        {
            ((HttpWebRequest)webReq).UserAgent = ".NET Framework Strategies Knowledge Base Article Parser v1.0"; //Set up my useragent
        }

        WebResponse webResp;
        int responseStatusCode = 0;

        try{
            // Get the response instance
            webResp = (HttpWebResponse)webReq.GetResponse();

            // Read an HTTP-specific property.
            if (webResp is HttpWebResponse)
            {
                responseStatusCode = (int)((HttpWebResponse)webResp).StatusCode;
            }
        }catch(Exception ex){
            return remoteURLContent;
        }

        //we can only collect HTML from valid responses so ignore 404s and 500s
        if (responseStatusCode != 200)
        {
            return remoteURLContent;
        }
    
        // Get the response stream.
        Stream respStream = webResp.GetResponseStream();
               
        StreamReader reader = new StreamReader(respStream, Encoding.ASCII);
        remoteURLContent = reader.ReadToEnd();
   
        // Close the response and response stream.
        webResp.Close();
                
        return remoteURLContent;
    }

The reason I use a proxy is down to the security policy set on our web servers.

2. Method to gather the main content.


//When article poster wants us to save a remote URL as the KB article content then we need to get the content and parse it
 protected string IndexURL(string remoteURL)
 {
     KeywordParser keywordParser;
     string METAKeywords = "", METADescription = "", METATitle = "";
     string cleanHTML = "";
     StringBuilder indexText = new StringBuilder();

     //As I have to access all remote URLs through a proxy server I access my application setting from the web.config file
     string proxyServer = ConfigurationManager.AppSettings["ProxyServer"].ToString();

     //now access the remote URL and return the HTML source code if we can
     string remoteURLHTML = UtilLibrary.GetURLHTML(remoteURL, proxyServer);

     //if we have some HTML content to parse and clean
     if (!String.IsNullOrEmpty(remoteURLHTML))
     {        
         remoteURLHTML = remoteURLHTML.ToLower(); //lower case it all as a)it doesn't matter and b)means no need for ignore options in regular expressions

         //Set up some regular expressions to help identify the META conent we want to index in the source
         Regex HasKeywords = new Regex("<meta\\s+name=\"keywords\"");
         Regex HasDescription = new Regex("<meta\\s+name=\"description\"");
         Regex HasTitle = new Regex("<title>");

         //As I am using replaces to quickly return the content I require I do a test first for the relevant tag otherwise if the source doesn't
         //contain the META tag then we will be left with the whole HTML source which we obviously don't want!!
         if (HasKeywords.IsMatch(remoteURLHTML))
         {          
             //get the data we require by replacing anything either side of the tag
             METAKeywords = "KEYWORDS = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"keywords\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
         }
         if (HasDescription.IsMatch(remoteURLHTML))
         {          
             METADescription = "DESCRIPTION = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<meta\\s+name=\"description\"\\s+content=\")(.+)(\"(?:.|\n)+)", "$2");
         }
         if (HasTitle.IsMatch(remoteURLHTML))
         {
             METATitle = "TITLE = " + Regex.Replace(remoteURLHTML, "((?:.|\n)+?<title>)(.+)(<\\/title>(?:.|\n)+)", "$2");
         }

         cleanHTML = remoteURLHTML;

         //now get main content which is between open close body tags
         cleanHTML = Regex.Replace(cleanHTML, "((?:.|\n)+?<body.*?>)((?:.|\n)+?)(<\\/body>(?:.|\n)+)", "$2");

         //strip any client side script by removing anything between open and close script tags         
         cleanHTML = Regex.Replace(cleanHTML, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
      
         //put a gap before words that appear just before closing tags so that we keep gaps between values from listboxes
         cleanHTML = Regex.Replace(cleanHTML, "(\\w)(<\\/\\w)", "$1 $2");

         //strip HTML tags
         cleanHTML = Regex.Replace(cleanHTML, "<[^>]+?>", "");

         //Decode the HTML so that any encoded HTML entities get stripped
         cleanHTML = HttpUtility.HtmlDecode(cleanHTML);
       
         //now add all the content we want to index back together
         if (!String.IsNullOrEmpty(METAKeywords))
         {          
             indexText.Append(METAKeywords + " ");
         }
         if (!String.IsNullOrEmpty(METADescription))
         {          
             indexText.Append(METADescription + " ");
         }
         if (!String.IsNullOrEmpty(METATitle))
         {          
             indexText.Append(METATitle + " ");
         }
         if (!String.IsNullOrEmpty(cleanHTML))
         {
             indexText.Append(cleanHTML);
         }

     }

     return indexText.ToString();
 }

I have left out the other function that strips noise words, numbers and swear words as its nothing special just a couple of loops that check some arrays containing the noise words that need removing.

The performance of this method varies slightly depending on the size of the content that is being parsed. Also its possible to leave in the content any noise words and numbers as these will not get added to any Full Text Index anyway as SQL Server will automatically ignore most noise words and numbers. However if data storage is an issue you may still want to do this so that you only save to the database table core content.

Thursday, 2 October 2008

Using Javascript to Parse a Querystring

Javascript has no support for querystring parsing

There maybe times when you need to parse a querystring using Javascript. I have recently found myself requiring to do this when rewriting a web application as lots of places were using inline Javascript script due to parameters being outputed into the HTML source through server side code. I wanted to move as much inline script out of the page source into external files and I found that a lot of cases I could only do this by using Javascript to access the querystring instead of server side code.

As Javascript has no inbuilt objects that you can reference you will either have to create your own one making use of the location.href and location.search objects which refer to the current URL and the querystring portion of the URL.

A lot of scripts will use string parsing to split the querystring up on & and = which is fine if you remember about the # anchor portion of the URL. However my script uses regular expressions to split the querystring up into its constituant parts.

I have created a little object that can be accessed on any page that references it. The features of this object are:

Parse the current location's querystring or supply your own querystring. It maybe that you have passed a URL encoded querystring as a value within the querystring and once you have accessed this parameter you need to parse it on its own.
Returns the number of parameters within the querystring.
Returns the parameters as key/value in an associative array.
Option to output the parameters as a formatted string.
Ability to access the anchor value seperatley if it exists.
Handles parameter values specified multiple times by comma seperating them.

The code:



function PageQuery(qry){

this.ParamValues = {};
this.ParamNo = 0;

var CurrentQuery, AnchorValue = "";

//if no querystring passed to constructor default to current location
if(qry && qry.length>0){
 CurrentQuery = qry;
}else{
 if(location.search.length>0){
  CurrentQuery = location.href;
 }else{
  CurrentQuery = "";
 }
}

//may want to parse a query that is not the current window.location
this.ParseQuery = function(qry){ 
  var rex = /[?&]([^=]+)(?:=([^&#]*))?/g;
  var rexa = /(\#.*$)/;
  var qmatch, key, amatch, cnt=0;

  //parse querystring storing key/values in the ParamValues associative array
  while(qmatch = rex.exec(qry)){
   key = denc(qmatch[1]);//get decoded key
   val = denc(qmatch[2]);//get decoded value

   if(this.ParamValues[key]){ //if we already have this key then update it if it has a value
    if(key&&key!="") this.ParamValues[key] = this.ParamValues[key] + ","+val;
   }else{
    this.ParamValues[key] = val;
    cnt++;
   }
  }
  //as no length property with associative arrays
  this.ParamNo = cnt;

  //store anchor value if there is one
  amatch = rexa.exec( qry );
  if(amatch) AnchorValue = amatch[0].replace("#","");
 }

//run function to parse querystring and store array of key/values and any anchor tag
if(CurrentQuery.length){
 this.ParseQuery( CurrentQuery );
} 

this.GetValue = function(key){ if(!this.ParamValues[key]) return ""; return this.ParamValues[key]; }
this.GetAnchor = AnchorValue;

// Output a string for display purposes
this.OutputParams = function(){
  var Params = "";
  if(this.ParamValues && this.ParamNo>0){
   for(var key in this.ParamValues){
    Params+= key + ": " +  this.ParamValues[key] + "\n";
   }
  }
  if(AnchorValue!="") Params+= "Anchor: " + AnchorValue + "\n";
  return Params;
 }
}

//Functions for encoding/decoding URL used in object

//encode
function enc(val){
if (typeof(encodeURIComponent)=="function"){
 return encodeURIComponent(val);
}else{
 return escape(val);
}
}
//decode
function denc(val){
if (typeof(decodeURIComponent)=="function"){
 return decodeURIComponent(val);
}else{
 return unescape(val);
}
}

How to call the code

To make use of the objects functions you can instantiate the PageQuery object by either passing the string that you want to parse as the constructors only parameter or you can pass nothing in which case it will default to the current location's querystring if there is one. Once you have created the object you can then reference the properties you require to return the relevant information.


//create new instance of the object
var qry = new PageQuery(); //defaults to current location if nothing passed to the constuctor
//var qry = new PageQuery("?id=1044&name=Rob+Reid#56"); //parse a specific string instead of using location.search
var id = qry.GetValue("id"); //get value for a parameter called id
var anc = qry.GetAnchor; //get the anchor # value if exists
var no = qry.ParamNo; //get the number of parameters
var s = qry.OutputParams();//return a formatted string for display purposes
var p = qry.ParamValues; //return the array of parameters

//loop through array of parameters
if(p){
for(var z in p){
alert("Query Parameter["+z+"] = " +p[z]);
}
}

So as you can see its a pretty simple but very flexible function that provides me with all the necessary functionality I require when handling querystrings with Javascript.

Click here to download the ParseQuery object script.

Thursday, 26 May 2022

Extending Try/Catch To Handle Custom Exceptions - Part 1

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 1

By Strictly-Software

Creating The TRY CATCH CLASS

Tuesday, 20 January 2009

ISAPI URL Rewriting - Hot Linking

Monday, 1 December 2008

Adding Remote URL content to Index

Thursday, 2 October 2008

Using Javascript to Parse a Querystring

Who is Strictly-Software?

My Stuff

Settings

Sites to Visit

Strictly-Software Tweets

Blog Archive

My Top Articles

Translate My Blog

Search This Blog

Labels

Thursday, 26 May 2022

Extending Try/Catch To Handle Custom Exceptions - Part 1

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 1

By Strictly-Software

Creating The TRY CATCH CLASS

Tuesday, 20 January 2009

ISAPI URL Rewriting - Hot Linking

Monday, 1 December 2008

Adding Remote URL content to Index

Thursday, 2 October 2008

Using Javascript to Parse a Querystring

Who is Strictly-Software?

My Stuff

Settings

Sites to Visit

Strictly-Software Tweets

Blog Archive

My Top Articles

Translate My Blog

Subscribe to Strictly-Software

Search This Blog

Labels