Wednesday 6 March 2024

A New Test For The Brave Browser and New Security Focused Browsers

How Can We Test For Brave?

By Strictly-Software

I use the Brave browser as well as a few other Chromium-based security-focused browsers such as (Opera, CCleaner, and DuckDuckGo) most of the time due to their inbuilt security measures. My preferred browser is Brave due not only to its Shield which removes trackers, cookies, adverts, and has a measure of built-in fingerprint spoofing. Plus it has its own search engine, which can be accessed from the address bar, and it doesn't block certain sites like Google and other search engines do.

It also blocks Google's Accelerated Mobile Pages and takes you to the original publisher's site, can have strict upgrades to HTTPS URLs when linked to an unsecured domain, as well as its incognito pages are based on the TOR engine.

I never use Chrome anymore as it used to be a quick plain browser but has now got bogged down with too many options. Also, I don't like Google which relies on heavy use of adverts and selling user information for revenue, plus its other links to intelligence agencies, and its censorship. This has pushed mainstream media articles ahead of legitimately more relevant sites. I also don't like it's use of banned reading lists that try to prevent people from viewing sites the US Intel Establishment has deemed unsuitable like RT.com or Infowars.com

Plus I don't know what they now do server side to try and identify users due to all these new privacy-based browsers, and the number of privacy plugins/extensions, that could help protect you. However, it's good to know you can have a measure of protection without bogging the browser down with plugins such as fingerprint spoofers like Trace and AdBlockers like MalwareBytes or the DuckDuckGo plugin. Also, I like being able to earn cryptocurrency from Brave which rewards the user, not the site for viewing small adverts.

However, the issue with Brave is that it hides as Chrome and doesn't have its own user agent. It used to, and the hope is that in the future it will again but at the moment it just shows a Chrome user-agent. 

For example, my latest Brave user-agent is:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36

There have also been a lot of changes to how Brave identified itself in the Window and Navigator sub-object that used to make Brave identifiable through inspection of certain objects. These seem to have changed back and forth many times and you can see some of the old objects that used to be checked in my older article on looking for Brave here.

Why bother looking for Brave, or indeed any browser when feature detection should be used rather than user-agent sniffing or other means to find the name of the Browser?

Well, Brave prides itself on security, hiding adverts, removing tracking code and cookies, and now has a certain level of fingerprint sniffing protection to stop sites from using properties identified by JavaScript or Server Side code to change Response headers to make identifying unique users and traffic to sites much harder.

Therefore you may want to write Chromium extensions that add extra protection to Brave browsers or do the reverse, add code back in to replace removed adverts, or help track link clicks with removed Ping attributes or other devious plans your boss wants you to run on sites accessed by people using Brave. 

I don't know, I just like to play with the code and see what object detection features each browser reveals. They may all be based on Chromium but they can all have unique features.

An updated function from my previous article makes use of the fact that Brave now identifies itself in the window.navigator object, specifically the brave object in navigator and the isBrave property within that. 

First I rule out any Mozilla browsers e.g Firefox by looking for a well-known Mozilla property, and ensure the browser is Chromium based by checking for a Webkit property. It is similar to the two line old function but uses a new Webkit property as the old one has been removed and is wrapped in a function.

function IsBrave(){
	
	let w=window,n=w.navigator; // shorten key objects

	// as many tests we know browsers now support; prove its not Mozilla; prove its Chromium based and has Brave properties in window and navigator objects
	let isBrave = !("mozInnerScreenX" in w) && ("chrome" in w && "onwebkitanimationiteration" in w && "brave" in n && "isBrave" in n.brave) ? true : false;
	
	return isBrave;
}

And if you want to check for the new CCleaner browser there isn't any specific objects in the navigator object I can see but it does have a unique user-agent e.g:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 CCleaner/121.0.0.0

However, the Duckduckgo browser, which comes with an email protection service, allowing you to use a specific @duck.com email address that only forwards non-spam emails to your real address, does have a specific object if you want to check for it.

// detect duckduckgo
let duckduckgo = ("duckduckgo" in navigator) ? true : false;
However, we all should be detecting objects to allow for features rather than the old method of user-agent sniffing, but if you did want to identify certain browsers there are unique objects, usually on the navigator object that can be checked.

You should know that there are lots of ways you can protect yourself from tracking and unique identification nowadays but it does rely on the device you are using to browse as well as the model of that device.

For instance, even if you are now using IPv6 addresses on your router which makes unique identification a lot easier than the older DCHP method of picking a free IPv4 address as local to you as possible when accessing the web, your system can be set to use temporary IPv6 addresses.

This means that your real unique identifying IP address is never shown to outside sites. A look at ipconfig /all in your command prompt will show you whether you are using them or not and a search online will show you how to change the settings if you're not using them yet.

Also, if you are using Android or iOS phones or devices that use Google as an integral part of their system, you can delete the unique advertising ID that exists to allow sites to uniquely track you whether or not your browser or plugin removes Google tracking codes. 

On iOS the ad identifier is also called "IDFA" and on Android, "AAID". As these IDs can only be accessed using server-side code, Java or Python, and a Google library, modifying the DOM won't stop you from being tracked. 

You can easily remove these IDs in the Privacy / Advertising section of your phone's settings if you have Android 12 or above. iOS devices are more complicated but this article explains how to remove them.

If you want to check what properties exist for a browser object then these two lines of code can help you out rather than a loop.
const keys = Object.keys(window);
console.log(keys);

This just outputs all the keys such as events, other objects like document and navigator, and properties as well into the developer tools console area that all modern browsers have.

Remember as Brave, CCleaner, DuckDuckgo, Edge, Opera and Chrome are all based on the same Chromium browser they are the same standard-compliant browsers. However, as I stated in my last article, it is amazing how many modern sites still break when I use a user-agent switcher and change my string to IE6 for example. 

They really shouldn't if they were using feature detection by checking for the existence of objects before running certain code.

By Strictly-Software

Sunday 22 January 2023

TSQL Batch Updates SQL 2005 - 2008

Updating tables in Batches to prevent locking in SQL

There are times when you may need to carry out UPDATES on large tables that are in use and constantly being inserted, deleted, or updated.

If you carry out a large UPDATE that affects all of the rows in the table, and in my case millions of rows, then the table will be locked for the duration of the update and any other processes that may need to carry out DML statements will be BLOCKED from doing so.

For example, you may experience long delays caused by locking when trying to return data from this table on a website or API service. Or even worse deadlocks but you will most definitely experience performance issues and if any SELECT statements that access the data don't use a WITH (NOLOCK) statement they too will have to wait in line for the UPDATE to finish.

Obviously wrapping WITH (NOLOCK) onto every SELECT statement is not a good solution unless you know what you are doing as it will provide dirty reads and you may end up giving your users old or incorrect data

This might be fine for some scenarios but in critical applications where data integrity is key then you need another solution that provides data integrity and allows you to UPDATE the table without performance problems.

When I find myself requiring the need to UPDATE every record in a large table I use a BATCH UPDATE process which cuts the large UPDATE statement down into lots of small UPDATES that affect only a few rows at a time.

By doing this the UPDATE rattles through all the tows of even large tables very quickly as long as the batch size of records updated in each loop iteration is small enough not to cause locking that may affect front-end processes.

For example instead of the whole table getting locked for an hour with lots of blocked processes building up behind waiting for it to finish it would instead only be locked for lots of little time periods lasting seconds or less.

These smaller locking periods allow other processes in to do their work and if the batch size is small enough and you have appropriate indexes you might find that you won't experience a full table lock anyway.

There are various methods for carrying out this approach and you should tailor your BATCH SIZE to your own requirements. Before SQL 2005 you could use the: SET NOCOUNT 50 command to set the size of the batch but in SQL 2005 and beyond you can use a variable directly with an UPDATE TOP (@VAR) command.

This is an example of a BATCH UPDATE, that uses a column in the table called LastUpdated, which gets updated every time the row in the table is. You could do this either through stored procedures that update the table, or triggers on insert and update. However because on each loop I update this column it means on the next loop the same records won't get touched as the time is within the 20 minutes I have set as the BATCH to be updated, 

Obviously, this must be tailored to your own system, whether you create an "updated" flag column that is defaulted to 0, and then on the BATCH UPDATE set to 1, and the WHERE statement that selects which TOP X records are looked at ignore any that have been set to 1.

You definitely need something to change on UPDATE, otherwise, you will find this process going on forever as there is no way to order the records for that the UPDATE statement so that it could keep getting the same TOP(X) records on each batch, enabling the process to rattle on forever with you scratching your head wondering why. 

If you can wangle an ORDER BY statement with a convoluted statement then that might work, however having a simple date or flag that is updated within the batch, and is also checked on each loop so that the same records are not looked at over and over again is the easy answer to this issue. 

SET NOCOUNT ON
SET DATEFORMAT YMD

DECLARE @ROWS INT, @TOTALROWS INT, @BATCHSIZE INT

SELECT @ROWS = 1,
@TOTALROWS = 0,
@BATCHSIZE = 50

-- As we start @ROWS at 1 and we know there are thousands of records to update
-- then when it gets to the stage where the UPDATE returns @@rowcount of 0
-- we have finished the criteria of the loop and so exit it the sanity check ensures this 
DO WHILE @ROWS > 0 
BEGIN
     
     -- Show the time this batch started as it might take hours
     PRINT 'Job started at ' + UPPER(FORMAT(GETDATE(),'hh:mm:ss dd/MMM/yyyy'))

     -- We must have a way that we don't keep updating the same records over and over again,
     -- so I use the LastUpdated date which gets updated on each batch update then checked
     -- in the where clause to ensure the date is at least 20 minutes in the future

     -- Update data in the table in batches to prevent blocking if we have to do this
     -- whilst people are using the site and accessing the table at the same time
     UPDATE  TOP(@BATCHSIZE) MyTable
     SET     MyColumn = dbo.udf_SOME_FUNCTION(MyPK),
             Flag = 1,
             LastUpdated = GETDATE()
     WHERE   SomeDate > '2023-JAN-01'
             AND Flag = 0 -- put in for the update
             -- this could be cut out
             AND DATEDIFF(MINUTE,LastUpdated,GETDATE())>20

     SELECT @ROWS = @@ROWCOUNT, @TOTALROWS = @TOTALROWS + @ROWS

    
     PRINT 'Updated ' + CAST(@ROWS as varchar) + ' in batch'

     -- As this UPDATE job may take hours we want other processes
     -- to be able to access the tables in case of locks therefore
     -- we wait for 2 seconds between each BATCH to allow
     -- time for these processes to aquire locks and do their job
     WAITFOR DELAY '00:00:02'

     -- samity check
     IF @ROWS = 0
       BREAK
 
 END

PRINT 'Updated ' + CAST(@TOTALROWS as varchar) + ' total rows'

I am currently using this process now to update a table that is constantly being accessed by a very busy API system that has over a million rows in it and by using BATCH UPDATES it isn't causing any BLOCKING, LOCKING or performance issues at all.

If you really wanted to give the SELECT statements looking at this table as you UPDATE it in batches then you could add in a DELAY within each loop e.g a 2-second DELAY after the UPDATE statement and the SELECT @ROWS = ..... that collects stats for you to look at after the process has finished would just be something like this.
WAITFOR DELAY '00:00:02'
So hopefully this might help you out if you are having to UPDATE large tables that are also in use at the same time by websites, API's, services, or 3rd parties.

© 2023 - By Strictly-Software

Sunday 30 October 2022

What do you do when you have multiple blogs under one Google Blogger Account?

Changing The Names Of Replies And Posts

By Strictly-Software

If you are a blogger user and you have multiple blogs each with different names then you might run into the problem that when you post, or reply to a message using your Google account it will put your main Blogger account name in as the author, or comment author.

Now the way I have got round this is to use JavaScript in the footer of the blog. Add a JavaScript / HTML gadget to the footer of your blog and then you can fix this issue.

I will show you two ways you can fix this using old JavaScript and newer JavaScript .

To fix the name of post authors you can use this code. Say your blogger author name is Dave Williams but you are using pseudo names for your blog, say "Strictly-Software" for example then you don't want Dave Williams posted at the end of each post, you want Strictly-Software to appear.

Wrap these in SCRIPT blocks, whether a short <script>....</script> block or a full old style <script type="text/javascript"><!-- ..... //--></script> tags. Up to you.

This looks for all the spans with a class of "fn" which appears under the post where it says "Posted by ..... at 9:58pm" etc. So we loop through all the fn classes and change the innerHTML to the value we want.
 
(function(){
var x = document.getElementsByClassName("fn");
var i;
for (i = 0; i < x.length; i++) {
    x[i].innerHTML = "Strictly-Software";
}
})()

Now what about comments? 

If you are replying to someone and you don't want to use the Name/URL option to put your blogs name in, then you can use the Google option so it knows you own the comment, but then you can use some code to identify the links which contain the Google Authors name and if found update them to the name you want.

Same as before with the SCRIPT tags, although I put both together one after the other in one script block at the bottom of my layout.

Now, this code uses a more modern document.querySelectorAll method to find all elements that match the selector we pass in which will be an anchor tag. This is because all comments have the name of the person commenting wrapped in a comment tag IF you use a Google Account to post it. 

Obviously, Anonymous and Name/URL comments can't as there is nowhere to link to.

(function(){
	var x = document.querySelectorAll("a");
	var i;
	for (i = 0; i < x.length; i++) {		
		if(x[i].innerHTML=="Dave Williams"){
			x[i].innerHTML="Strictly-Software"
		}		
	}
})();

As you can see I am looping through all the elements that the querySelectorAll("a"); returns. 

If we had used just querySelector it would have only found the first A tag on the page and we need to find and replace all comments on the page which have Dave Williams as the current contents and if found, we replace it with Strictly-Software by using the innerHTML of the element.

I just thought I would post this as I had to do this tonight on one of my other blogs as I noticed the comment author was wrong. You could use the 2nd method for getting the author name as well if you wanted but I thought I would show you two ways of doing this.

Hope this helps if you have the same issue.

By Strictly-Software

Thursday 11 August 2022

Testing For Internet Connectivity

Setup Measures For A Big API System

By Strictly-Software

I have a BOT that runs nightly, and in the SetUp() methods to ensure everything is okay, it runs a number of tests before logging to an API table in my database. This is so I know if the API I am using is available, whether I needed to login to it, when the last job was started and stopped, and when the last time I did log into the API. 

It just helps in the backend to see if there are any issues with the API without debugging it. Therefore the SetUp job has to be quite comprehensive.

The things I need to find out, which are logged into the logfile.log I keep building until the job is finished with major Method Input Params, Return Params, Handled Errors, Changed System Behaviour, SQL Statements that return an error e.g timeout, an unexpected divide by zero error and the like are the following from my SetUp job.

1. Whether I can access the Internet, this is done with a test to a page that just returns an IPv4 or IPv6 address. The major HTTP Status code I look for there is a 200 status code. If I get nowhere but get a status code of 1 that is a WebExceptionStatus.NameResolutionFailure, which means the DNS could not find the location and it's probably due to either your WIFI button being turned off (Flight Mode), or your Network Adapter having issues.

I test for web access with an obvious simple HTTP test to an IP page that returns my address, with 2 simple regular expressions that can ID if it's IPv4 or IPv6.


public string TestIPAddress(bool ipv4=true)
{
    string IPAddress = "";
    // if we are using an IPv6 address the normal page will return it if we want to force
    // get an IPV4 address then we go to the IPv4 page which is the default behaviour due to
    // VPNS and Proxies using IPV4 usually
    string remotePageURL = ((ipv4) ? "https://ipv4.icanhazip.com/" : "https://icanhazip.com/");
   
    this.UserAgent = this.BrainiacUserAgent; // use our BOT UserAgent as no need to spoof
    this.Referer = "LOCALHOST"; // use our localhost as we are running from our PC
    this.StoreResponse = true; // store the reponse from the page
    this.Timeout = (30000); // ms 30 seconds            

    // a simple BOT that just does a GET request I am sure you can find a C# BOT example on my blog or tons on the web
    this.MakeHTTPRequest(remotePageURL);            

    // if status code is between 200 & 300 we should have data
    if (this.StatusCode >= 200 && this.StatusCode < 300)
    {
		IPAddress = this.ResponseContent; // get the IP Address

		// quick test for IPAddress Match, IPv4 have 3 dots
		if(IPAddress.CountChars('.')==3)
		{
		   // An IpV4 address can replace dots and if all numbers then it is IPv4
		   string ip2 = IPAddress.Replace(".", "");
		   if(Regex.IsMatch(ip2, @"^\d+$"))
		   {
				this.HelperLib.LogMsg("IP Address is IPv4: " + IPAddress + ";","MEDIUM"); // log to log file
		   }
		}
		// otherwise if it contains : semi colons (and nos and certain HEX letters)
		else if (IPAddress.Contains(':'))
		{
		   // could be an IpV6 address check only numbers and letters when colons are removed                    
		   if (Regex.IsMatch(IPAddress, @"^[:0-9A-F]+$"))
		   {
				this.HelperLib.LogMsg("IP Address is IPv6: " + IPAddress + ";","MEDIUM"); // log to log file
		   }
		}
    }
    else
    {
		// no response, flight mode button enabled?
		this.HelperLib.LogMsg("We could not access the Internet URL: " + remotePageURL + "; Maybe Flight Mode is on or Adapter is broken and needs a refresh", "LOW");

		IPAddress = "No IP Address returned; HTTP Errror: " + this.StatusCode.ToString() + "; " + this.StatusDesc + ";";

		// call my IsNetworkOn function to test we have a network
		if(this.IsNetworkOn())
		{
		   this.LastErrorMessage = "The Internet Network Is On but we cannot access the Internet";
		}
		else
		{
		   this.LastErrorMessage = "The Internet Network Is Off and we cannot access the Internet";
		}

		this.HelperLib.LogMsg(this.LastErrorMessage,"LOW"); // log error

		// throw the error will be caught in a lower method and stop the script
		throw new System.Exception(this.LastErrorMessage);
    }
}


If that fails then I run this method to see if the Network is available. You can see in the code above that I call it if there is no 200-300 HTTP status response or IP address returned. 

I test it by running the console script with and without the flight mode button on my laptop.

public bool IsNetworkOn()
{
bool IsNetworkOn = System.Net.NetworkInformation;

NetworkInterface.GetIsNetworkAvailable();

return IsNetworkOn;
}
I also have a series of Proxy addresses I can use to get round blocks on IP addresses although I do try and use Karmic Scraping e.g no hammering of the site, caching pages I need to prevent re-getting them, gaps between retries if there is a temporary error where I change the referer, user-agent and proxy (if required), before trying again after so many seconds.

Also before I turn on a global setting in my HelperLib class which is the main class all other objects refer to and pass from one to another, I test to see if the computer I am on is using a global proxy or not. I do an HTTP request to a page that returns ISP info, and I have a list of ISP's which I can check

If my ISP comes back with the right ISP for my home, I know I am not on a global proxy, also it tells me whether the IP address from earlier is IPv4 or IPv6, then I know I am not using a global proxy. If Sky or Virgin is returned I know I have my VPN enabled on my laptop and that I am using a global proxy.

If I am using a global proxy all the tests and checks for "getting a new random proxy:port" when doing an HTTP request are canceled as I don't need to go through a VPN and then a proxy. If the Global Proxy is turned off I might choose to get a random proxy before each HTTP call. 

If that setting is enabled in my Settings class. As you can't have config page in console scripts that hook into a DLL, my code in the DLL for my settings is just a class called Settings with a big Switch Statement where I get the setting I need or set it.

Once I have checked my Internet access I then do an HTTP call to the Betfair API Operational page which tells me if the API is working or not. If it isn't then I cannot log in into it. If it is working I do a Login test to ensure I can login this involves:
  • Using 1 of 2 JSON methods to login to the 2 endpoints that give me API access with a hashed API code, and my username & password.
  • If I can log in, I update the API table in the DB with the time I logged in and that I am logged in. There may be times I don't need to login to the API such as when I am just scraping all the Races and Runner info.
  • I then do a Session re-get, I hold my session in a file so I don't have to re-get a new one each time but on a run I like to get a new Session and extract the session code from the JSON return to pass back in JSON calls to get data etc.
  • I then do a database test using my default connection string I just SELECT TOP(1) FROM SYS.OBJECTS and ensure I get a response.

Once I know the API is operational, the DB connectivity is working and that I can scrape pages outside the Betfair API I can set all the properties such as whether to use a random proxy/referer/user-agent if one of my calls errors (with a 5 second wait), if I am using a global proxy I don't bother with the proxies.

Believe it or not that is a lot of code across 5 classes just to ensure everything is okay on running the job from the console or Windows Service

I throw my own errors if I cannot HTTP scrape, not just log them, so I can see what needs fixing quickly, and I create a little report with the Time, BOT version, Whether it is live, testing, placing real bets, running from a service or console and other important info I might need to read quickly.

So this is just a quick article on how I go about ensuring a project using a big DLL (with all the HTTP, HelperLib, API etc) code is working when I run it.

Who knows you might want to do similar checks on your large automated systems. 

Who knows. I am just making bank from my AutoBOT which TRADES on Betfair for me!

© 2022 Strictly-Software

Wednesday 3 August 2022

Extending Try/Catch To Handle Custom Exceptions - Part 2

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 2

By Strictly-Software

This is the second part of our custom Try/Catch overwrite program that is aimed to catch Regular Expression errors that would normally be thrown up by !IsMatch (doesn't match) and so on.
 
So why would you want to do this when the C# .NET RegEx functions already return a true or false or 0 matches etc which you could use for error detection?
 
Well, you can read the first article above this one for info however you may have a case where you have a BOT that is automatically running every day all day on certain sites where the HTML has been worked out and expression especially made to match them. 

If for example the HTML source changes and the piece of data you are collecting suddenly does not get returned anymore as your expression does not match anything instead of just logging to a file which means scanning and checking for errors once you notice your fully automated 24/7/365 system has stopped working 2 months later. 

Therefore instead of just logging the results of IsMatch you may want to throw up an error and stop the program so that a new expression can be made to match the HTML ASAP without lots of data being missed.
 
If you want more info on the Try/Catch class, read the first example article. Here we build just a basic BOT that is going to utilise the RegExException class we made last time. It is not a perfect BOT, you may get some warnings come up, I did it in VS 2021, and it's a specific BOT as its aims are to get the META Title, Description, and the full source of the URL that is passed into it. 

Any errors with the regular expressions not matching anymore are thrown up using our RegExException TryCatch cobject so that they detail the expression, text, or whichever of the 3 methods we made. If you want to use it as the basis of your own BOT you can but has been specifically designed to demonstrate this RegEx TryCatch example so you may want to edit or remove a lot of content first before building back on top of it if you want to make your own BOT in C#.
 
So here is the BOT

public class HTML
{
	public string URL  // property
	{ get; set; }

	public string HTMLSource  
	{ get; set; }

	public string HTMLTitle
	{ get; set; }

	public string HTMLDesc
	{ get; set; }

	public void GetHTML()
	{
	    string HTMLSource = "";
	    string HTMLTitle = "";
	    string HTMLDesc = "";

	    // if no URL then all properties are blank
	    if (String.IsNullOrWhiteSpace(this.URL))
	    {
			HTMLTitle=HTMLDesc = HTMLSource = "";
			return;
	    }

	    try
	    {
			HttpWebRequest request = (HttpWebRequest)WebRequest.Create(this.URL);
			HttpWebResponse response = (HttpWebResponse)request.GetResponse();

			if(response.StatusCode == HttpStatusCode.Forbidden)
			{
		    	throw new Exception(String.Format("We are not allowed to visit this URL {0} 403 - Forbidden", this.URL));
			}
			else if(response.StatusCode == HttpStatusCode.NotFound)
			{
		    	throw new Exception(String.Format("This URL cannot be found {0} 404 - Not Found", this.URL));
			}
            // 200 = OK, we have a response to analyse
            else if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream receiveStream = response.GetResponseStream();
                StreamReader readStream = null;
                if (response.CharacterSet == null)
                {
                    readStream = new StreamReader(receiveStream);
                }
                else
                {
                    readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
                }

                string source = this.HTMLSource = readStream.ReadToEnd();

                if (!String.IsNullOrWhiteSpace(source))
                {
                    // extract title with a simple regex                        
                    string reTitle = @"^[\S\s]+?<title>([\S\s]+?)</title>[\S\s]+$";
                    // Get the match for META TITLE if we can using a wrapper method that takes a regex string, source string and group to return
                    string retval = this.GetMatch(reTitle, source, 1);
                    if (!String.IsNullOrWhiteSpace(retval))
                    {
                        this.HTMLTitle = retval;
                    }
                    else // failed to find the <title>Page Title</title> in the HTML source so throw a Regex Exception
                    {
                        throw new RegExException(reTitle, new Exception(String.Format("No match could be found for META Title with {0}", reTitle)));
                    }

                    // META DESCRIPTION
                    string reDesc = @"^[\s\S]+?<meta content=[""']([^>]+?)['""]\s*?name='description'\s*?\/>[\s\S]*$";
                    // Get the match for META DESC if we can using a wrapper method that takes a regex string, source string and group to return
                    retval = this.GetMatch(reDesc, source, 1);
                    if(!String.IsNullOrWhiteSpace(retval))
                    {
                        this.HTMLDesc = retval;
                    }
                    else // failed to find the <title>Page Title</title> in the HTML source so throw a Regex Exception
                    {
                      throw new RegExException(reDesc, new Exception(String.Format("No match could be found for META Description with {0}", reDesc)));
                    }
                }

                response.Close();
                readStream.Close();
            }
	    }
	    catch(WebException ex)
	    {
          	// handle 2 possible errors
          	if(ex.Status==WebExceptionStatus.ProtocolError)
          	{
              Console.WriteLine("Could not access {0} due to a protocol error", this.URL);
          	}
          	else if(ex.Status==WebExceptionStatus.NameResolutionFailure)
          	{
              Console.WriteLine("Could not access {0} due to a bad domain error", this.URL);
          	}
	    } // catch anything we haven't handled and throw so we can code in a way to handle it
	    catch (Exception ex)
	    {
			throw;
	    }
	}


	private string GetMatch(string re,string source,int group)
	{
	    string ret = "";

	    // sanity test
	    if (String.IsNullOrWhiteSpace(re) || String.IsNullOrWhiteSpace(source) || group == 0)
	    {
			return "";
	    }
	    // use common flags these could be passed in to method if need be
	    Regex regex = new Regex(re, RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

	    if (regex.IsMatch(source))
	    {
          	MatchCollection matches = regex.Matches(source);

          	Console.WriteLine(String.Format("We have {0} matches", matches.Count.ToString()));

		foreach (Match r in matches)
		{
		    if (r.Groups[group].Success)
		    {
				ret = r.Groups[group].Value.ToString().Trim();
				break;
		    }
		}                
	    }
	    // return matched value
	    return ret;
	}
}
 
So as you can see this HTML Class is my basic BOT. It has 4 properties, URL (which is just to set the URL of the code to get) then: HTMLSource, HTMLTitle, and HTMLDesc which as you can see are all abbreviations of the 3 main bits of the content we want from our BOT, the META Title, META Description, and the whole HTML source. We set these using the code in our HTML classes GetHTML() which does the main work but uses another method GetMatch() to try and get the match of the parameters passed in e.g the expression, the source code to get any match from, and the no of the match to get.
 
For example, there may be bad XHTML/HTML all over the web where people have used ID"s multiple times or put elements that should only exist once in multiple times. With this last parameter, you can ensure you are getting the match you want, usually 1 for the first.
 
Hopefully, you can see what the class is doing and how the results of the IsMatch method can fire off a custom RegExException or not and how it is formatted.
 
In the next article, we will show the content that uses this class to return data and put it all together
 
 
© 2022 By Strictly-Software 

Thursday 26 May 2022

Extending Try/Catch To Handle Custom Exceptions - Part 1

Extending Core C# Object Try/Catch To Display Custom Regular Expression Exceptions - Part 1

By Strictly-Software

Have you ever wanted to throw custom exception errors whether for business logic such as if a customer isn't found in your database, or whether more complex logic has failed that you don't want to handle but raise so that you know about it.
 
For me, I required this solution due to a BOT I use that collects information from various sites every day and uses regular expressions to find the pieces of data I require and extract them. The problem is that the site often changes its HTML source code and it will break a regular expression I have specifically crafted to extract the data using it.
 
For example, I could have a very simple regular expression that is getting the ordinal listing of a name using the following C# regular expression:
 
string regex = @"<span class=""ordinal_1"">(\d+?)</span>";
Then one day I will run the BOT and it won't return any ordinals for my values and when I look inside the log file I find it will have my own custom message due to finding no match such as "Unable to extract value from HTML source" and when I go and check the source it's because they have changed the HTML to something like this:
<span class="ordinal_1__swrf" class="o1 klmn">"1"<sup>st</sup></span> 
 
This is obviously gibberish many designers add into HTML to stop BOTS crawling their data hoping that the BOT developer has written a very specific regex that will break and return no data when met with such guff HTML. 
 
Obviously, the first expression whilst meeting the original HTML alright is too tight to handle extra guff within the source. 
 
Therefore it required a change in the expression to something a bit more flexible in case they added even more guff into the HTML:
 
string regex = @"<span class=""ordinal_+?[\s\S]+?"">""?(\d+?)""?<sup";
 
As you can see I have made the expression looser, with extra question marks ? in case quotes are or are not wrapped around values, and using non-greedy match any character and non-character expressions like [\s\S]+? to handle the gibberish from the point it appears to where I know it has to end at the closing quote or bracket.
 
So instead of just logging the fact I have missing data from crawls I wanted to raise the errors with TRY/CATCH and make the fact that a specific piece of HTML no longer matches my expression an exception that will get raised so I can see it as soon as it happens. 
 
Well with C# you can extend the base object TRY/CATCH so that your own exceptions based upon your own logic can be used. In future articles, we will build up to a full C# Project with a couple of classes, a simple BOT and some regular expressions we can use to test what happens when trying to extract common values from the HTML source on various URL's to throw exceptions.
 

Creating The TRY CATCH CLASS

First off the TRY / CATCH C# Class where I am extending the base object to call, usual messages but using String,Format so that I can pass in specific messages. I have put numbers at the start of each method so that when running the code later we can see which exceptions get called.
 
I have just created a new solution in Visual Studio 2022 called TRYCATCH and then named my RegExException that extends the Exception object. You can call your solution what you want but for a test, I don't see why just following what I am doing is not okay.

[Serializable]
public class RegExException : Exception
{
public RegExException() { }

public RegExException(string regex, string msg)
   : base(String.Format("1 Regular Expression Exception: Regular Expression: {0}; {1}", regex, msg )) { }

public RegExException(string msg)
    : base(String.Format("2 {0}", msg)) { }

public RegExException(string regex, Exception ex)
    : base(String.Format("3 Regular Exception is no longer working: {0} {1}", regex,ex.Message.ToString())) { }

}
 
As you can see the Class has 3 overloaded methods which either just take a single message, a regular expression string and a message, and a regular expression and an exception these values are placed in specific places within the exception messages.
 
You would cause a Regular Expression Exception to be thrown by placing something like this in your code:
 
throw new RegExException(String.Format("No match could be found with Regex {0} on supplied value '{1}'", re, val));
 
Hopefully, you can see what I am getting at, and we will build on this in a future post.
 
© 2022 By Strictly-Software 
 
 

Wednesday 16 February 2022

Blogger EU Cookie Message Missing Problem & Solution

My EU Cookie Message Disappeared From My Site - How To Get It Back

By Strictly-Software

I had a bit of a weird experience recently, I found out, only from Google informing me, that for some reason one of my Blogger sites was not showing the EU Cookie Notice that should appear on all Blogger sites if in a European country where "Consent to use Cookies", is required by all website users.

It used to show, and my other blogger sites were still working and in fact, on my own PC, it was still showing the correct message e.g:

"This site uses cookies from Google to deliver its services and to analyse traffic. Your IP address and user agent are shared with Google, together with performance and security metrics, to ensure quality of service, generate usage statistics and to detect and address abuse."

However, when I viewed the site Google had told me about on another PC this was not appearing for some reason. 

I cleared all the cookies, path, domain, and session using the Web Extension called "Web Developer Toolbar", such as this Chrome version. After installing it, a grey cog appears in your toolbar if you fix it to. 

It is really helpful for turning password fields into text, if you want to see what you are typing, or need to see a Browser stored password in the field, or as needed in this instance, for deleting all kinds of Cookies. So after deleting all the Cookies, I refreshed the page and but the EU Cookie Message still didn't show.

Fixing Blogger Cookie Notice Not Showing

If you view the source of a blogger site that is showing the EU Cookie message, then you should find the following code in your source, not generated source, but the standard "View Source" options when you right-click on the page and view the context menu.

The following code should be just above the footer where all your widget scripts are loaded. Notice that I put some HTML comments above my version of the code so I could easily identify it from Bloggers version in the DOM when viewing the source. Why do this if Bloggers code is not in the source anyway? Wait and see.

<!-- this is my code as bloggers only appear in the source sometimes -->
<script defer='' src='/js/cookienotice.js'></script>
<script>
 document.addEventListener('DOMContentLoaded', function(event) {
      window.cookieChoices && cookieChoices.showCookieConsentBar && cookieChoices.showCookieConsentBar(
          (window.cookieOptions && cookieOptions.msg) || 'This site uses cookies from Google to deliver its services and to analyse traffic. Your IP address and user agent are shared with Google, together with performance and security metrics, to ensure quality of service, generate usage statistics and to detect and address abuse.',
          (window.cookieOptions && cookieOptions.close) || 'Ok',
          (window.cookieOptions && cookieOptions.learn) || 'Learn more',
          (window.cookieOptions && cookieOptions.link) || 'https://www.blogger.com/go/blogspot-cookies');
});
</script> 

So to fix the issue, copy the code out of the page and then go to your layout tool in Blogger. Add a widget at the bottom if no JavaScript/Text widget already exists and copy the code into it.

Now the odd thing is that as soon as I saved the widget and then my blogger site. I went to the website having issues and viewed the source code. When I did I saw that not only was my version of the code was in the HTML, but somehow this had put Bloggers own version back into the HTML as well!

Why this would do that I have no idea. However, it meant that I now had two lots of the same script being loaded and 2 lots of the EU Cookie code that shows the DIV and the options for wording appearing in my HTML.

The good thing though, is that this did not cause a problem for my site. I found by adding that code into a widget at the bottom of my page above Bloggers magically re-appearing code, that it did NOT cause the message to appear twice, but also that when I removed my own version of the code, the Blogger version remained.

Also even though I have put in a version of the Blogger code that uses English sentences into the HTML when I use a Proxy or VPN to visit a European country such as Germany, the wording appears in German.

I suspect that my code runs first as it's first in the DOM, then the Google code runs, overwriting my DIV with their DIV and of course the correct wording for the country we are in.

So as I thought everything was working, I removed my own code and saved the site. I then went to it, deleted path, domain, and session cookies, and then refreshed the page again and saw the blogger cookie code running okay. When viewing the source I could see that my code had gone but the blogger code was now still in the HTML, whereas it wasn't before.

However......

After a few hours when I came back to the computer which had not been showing the message and I re-checked by clearing all cookies (path, domain and session), and saw that Bloggers code had disappeared again, and the message was now not showing again!

Why this has happened I do not know as I had not re-saved the Blogger site in question during the time away so I have no idea what caused the Blogger EU Code to disappear again

There really should be an option in settings to force the Cookie compliance code to be inserted but as there isn't the answer seems to be to just leave your version of the cookie code in the HTML source in a widget at the bottom of the layout.

Why this works without causing issues I have no idea and it sounds like a bodge which it is, but as I cannot find any real answers to this problem online, or in Googles KB, I had to come up with a solution that worked to comply with Google's request and this seems to do it.

So the fact that when you have two lots of the same code in your HTML does NOT cause the message to appear twice is a good thing. This means that even if the original code re-appears then you are okay, and if it doesn't then your own code, which is a direct copy of the blogger code, runs instead. 

Also as your code runs first, if it is causing Bloggers code to also re-appear in the HTML then that will run afterwards ensuring the correct European language is shown in the message.

You can view the JavaScript which is loaded in by Blogger by just appending /js/cookienotice.js to any blogger site e.g this one, http://blog.strictly-software.com/js/cookienotice.js. You can then see the functions and HTML they use to show the DIV. You can also see at the top the ID's and Classes they put on the Cookie Message DIV.

So if you want to check which version of the EU Cookie code is running when both sets of JavaScript exist, you could add a bit of code underneath that checks for the Cookie DIV on display, and add some CSS to target cookieChoicesInfo, which is the ID of the DIV that is shown and you could change the background colour of the DIV to see if it is your DIV or Bloggers DIV that appears.

For example, you could put this under your JavaScript code to change the background colour of the DIV with the following code.

<style>
#cookieChoiceInfo{
	background-color: green !important;
}
</style>

Obviously green is a horrible colour for a background, but it easily stands out. When I did this I saw a Green DIV appear with the message in the correct language displayed, despite my EU options having English as the language for all the wording. 

This is because our code to load the script, and the cookie options into the page runs first, before any Blogger code that appears lower down in the HTML / DOM. When that Blogger code does run, it overwrites the DIV and the wording in the correct European language.

If you right-click on the DIV and choose "inspect" then the Developer Console will appear and you will be able to see that your style to change the background colour is being used on the Cookie message DIV. 

As it's a CSS style block with !important after the style, when the Blogger code overwrites the DIV and wording, the style for the background colour of the DIV is still being determined by our CSS Style block.

So the answer if your EU Cookie Compliance Message disappears is to add your own copy of their code into the site through a widget. 

This shouldn't cause any problems due to any duplicate DIV overwriting your DIV and if it disappears again then at least your version remains.

I just don't understand two things.

1. Why did the Cookie code disappear in the first place?

2. Why did the Blogger code re-appear when I added my own version of Bloggers own EU Cookie message code into the HTML and then why did it dissapear again a couple of hours? 

If anyone can answer these questions then please let me know. A search inside Googles Adsense site does not reveal any useful answers at all.

People just suggest adding query strings to your URL to force it to appear which is no good if your site is linked to from various Search Engines and other sites. Or to just delete all the cookies and refresh the page. 

These are two useless suggestions, and the only thing that seems to work for me is the solution I came up with above. So if you have the same problem try this solution.


By Strictly-Software