Wednesday, 3 August 2022

Extending Try/Catch To Handle Custom Exceptions - Part 2

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 2

By Strictly-Software

This is the second part of our custom Try/Catch overwrite program that is aimed to catch Regular Expression errors that would normally be thrown up by !IsMatch (doesn't match) and so on.
 
So why would you want to do this when the C# .NET RegEx functions already return a true or false or 0 matches etc which you could use for error detection?
 
Well, you can read the first article above this one for info however you may have a case where you have a BOT that is automatically running every day all day on certain sites where the HTML has been worked out and expression especially made to match them. 

If for example the HTML source changes and the piece of data you are collecting suddenly does not get returned anymore as your expression does not match anything instead of just logging to a file which means scanning and checking for errors once you notice your fully automated 24/7/365 system has stopped working 2 months later. 

Therefore instead of just logging the results of IsMatch you may want to throw up an error and stop the program so that a new expression can be made to match the HTML ASAP without lots of data being missed.
 
If you want more info on the Try/Catch class, read the first example article. Here we build just a basic BOT that is going to utilise the RegExException class we made last time. It is not a perfect BOT, you may get some warnings come up, I did it in VS 2021, and it's a specific BOT as its aims are to get the META Title, Description, and the full source of the URL that is passed into it. 

Any errors with the regular expressions not matching anymore are thrown up using our RegExException TryCatch cobject so that they detail the expression, text, or whichever of the 3 methods we made. If you want to use it as the basis of your own BOT you can but has been specifically designed to demonstrate this RegEx TryCatch example so you may want to edit or remove a lot of content first before building back on top of it if you want to make your own BOT in C#.
 
So here is the BOT

public class HTML
{
	public string URL  // property
	{ get; set; }

	public string HTMLSource  
	{ get; set; }

	public string HTMLTitle
	{ get; set; }

	public string HTMLDesc
	{ get; set; }

	public void GetHTML()
	{
	    string HTMLSource = "";
	    string HTMLTitle = "";
	    string HTMLDesc = "";

	    // if no URL then all properties are blank
	    if (String.IsNullOrWhiteSpace(this.URL))
	    {
			HTMLTitle=HTMLDesc = HTMLSource = "";
			return;
	    }

	    try
	    {
			HttpWebRequest request = (HttpWebRequest)WebRequest.Create(this.URL);
			HttpWebResponse response = (HttpWebResponse)request.GetResponse();

			if(response.StatusCode == HttpStatusCode.Forbidden)
			{
		    	throw new Exception(String.Format("We are not allowed to visit this URL {0} 403 - Forbidden", this.URL));
			}
			else if(response.StatusCode == HttpStatusCode.NotFound)
			{
		    	throw new Exception(String.Format("This URL cannot be found {0} 404 - Not Found", this.URL));
			}
            // 200 = OK, we have a response to analyse
            else if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream receiveStream = response.GetResponseStream();
                StreamReader readStream = null;
                if (response.CharacterSet == null)
                {
                    readStream = new StreamReader(receiveStream);
                }
                else
                {
                    readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
                }

                string source = this.HTMLSource = readStream.ReadToEnd();

                if (!String.IsNullOrWhiteSpace(source))
                {
                    // extract title with a simple regex                        
                    string reTitle = @"^[\S\s]+?<title>([\S\s]+?)</title>[\S\s]+$";
                    // Get the match for META TITLE if we can using a wrapper method that takes a regex string, source string and group to return
                    string retval = this.GetMatch(reTitle, source, 1);
                    if (!String.IsNullOrWhiteSpace(retval))
                    {
                        this.HTMLTitle = retval;
                    }
                    else // failed to find the <title>Page Title</title> in the HTML source so throw a Regex Exception
                    {
                        throw new RegExException(reTitle, new Exception(String.Format("No match could be found for META Title with {0}", reTitle)));
                    }

                    // META DESCRIPTION
                    string reDesc = @"^[\s\S]+?<meta content=[""']([^>]+?)['""]\s*?name='description'\s*?\/>[\s\S]*$";
                    // Get the match for META DESC if we can using a wrapper method that takes a regex string, source string and group to return
                    retval = this.GetMatch(reDesc, source, 1);
                    if(!String.IsNullOrWhiteSpace(retval))
                    {
                        this.HTMLDesc = retval;
                    }
                    else // failed to find the <title>Page Title</title> in the HTML source so throw a Regex Exception
                    {
                      throw new RegExException(reDesc, new Exception(String.Format("No match could be found for META Description with {0}", reDesc)));
                    }
                }

                response.Close();
                readStream.Close();
            }
	    }
	    catch(WebException ex)
	    {
          	// handle 2 possible errors
          	if(ex.Status==WebExceptionStatus.ProtocolError)
          	{
              Console.WriteLine("Could not access {0} due to a protocol error", this.URL);
          	}
          	else if(ex.Status==WebExceptionStatus.NameResolutionFailure)
          	{
              Console.WriteLine("Could not access {0} due to a bad domain error", this.URL);
          	}
	    } // catch anything we haven't handled and throw so we can code in a way to handle it
	    catch (Exception ex)
	    {
			throw;
	    }
	}


	private string GetMatch(string re,string source,int group)
	{
	    string ret = "";

	    // sanity test
	    if (String.IsNullOrWhiteSpace(re) || String.IsNullOrWhiteSpace(source) || group == 0)
	    {
			return "";
	    }
	    // use common flags these could be passed in to method if need be
	    Regex regex = new Regex(re, RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

	    if (regex.IsMatch(source))
	    {
          	MatchCollection matches = regex.Matches(source);

          	Console.WriteLine(String.Format("We have {0} matches", matches.Count.ToString()));

		foreach (Match r in matches)
		{
		    if (r.Groups[group].Success)
		    {
				ret = r.Groups[group].Value.ToString().Trim();
				break;
		    }
		}                
	    }
	    // return matched value
	    return ret;
	}
}
 
So as you can see this HTML Class is my basic BOT. It has 4 properties, URL (which is just to set the URL of the code to get) then: HTMLSource, HTMLTitle, and HTMLDesc which as you can see are all abbreviations of the 3 main bits of the content we want from our BOT, the META Title, META Description, and the whole HTML source. We set these using the code in our HTML classes GetHTML() which does the main work but uses another method GetMatch() to try and get the match of the parameters passed in e.g the expression, the source code to get any match from, and the no of the match to get.
 
For example, there may be bad XHTML/HTML all over the web where people have used ID"s multiple times or put elements that should only exist once in multiple times. With this last parameter, you can ensure you are getting the match you want, usually 1 for the first.
 
Hopefully, you can see what the class is doing and how the results of the IsMatch method can fire off a custom RegExException or not and how it is formatted.
 
In the next article, we will show the content that uses this class to return data and put it all together
 
 
© 2022 By Strictly-Software 

No comments: