Strictly Software: IP address

Showing posts with label IP address. Show all posts

Monday, 26 November 2018

Obtaining an external IP address in memory for use in a Firewall Rule

By Strictly-Software

As I am currently using an ISP which constantly changes my external IP address due to excessive use of DCHP, I have to regularly update external firewalls on servers to allow my computer remote access.

This is obviously a right pain to do and I have no idea why my ISP changes my IP address so much when my old ISP used DCHP and kept it for months at a time.

Therefore I created this little noddy VBScript to sit on my desktop to obtain my IP address and hold it in my clipboard memory ready for me to just open up my firewall and paste it in.

The code uses the MSXML2.ServerXmlHttp object to obtain the IP address from an external website which just prints it on the page and I store the Response.Text into a variable.

I then use WScript.Shell object to open a new Notepad window and write the IP address out into it.

Then I use SendKeys to select the IP address and copy it into the clipboards memory before shutting down Notepad.

This means I can just quickly double click my desktop shortcut icon to obtain the IP address ready to paste it straight into a Firewall rule.

Check this script out and when you are ready and can see that hitting CTRL + V pastes the IP address out elsewhere you should remove the "TEST SECTION" and uncomment the part above it that closes down Notepad. This ensures you are not leaving empty objects around in your computers memory.

It's not exactly amazing code but it's very helpful at this time and saves a lot of time visiting a website manually to get my external IP address.

You might find this useful yourselves!


Option Explicit

Dim IPAddress, objShell
Dim objHTTP : Set objHTTP = WScript.CreateObject("MSXML2.ServerXmlHttp")

'* Obtain external IP address and store it in a variable
objHTTP.Open "GET", "http://icanhazip.com", False
objHTTP.Send
IPAddress = objHTTP.ResponseText

'* Open Notepad
Set objShell = WScript.CreateObject("WScript.Shell")
objShell.Run "notepad.exe", 9

WScript.Sleep 1000

'* Write out the IP address to a blank notepad file
objShell.SendKeys IPAddress

'* select the IPAddress and copy it into memory
objShell.SendKeys "^{a}"
objShell.SendKeys "^{C}"

'* uncomment the following section and remove the "TEST SECTION" when you are ready

'* Close notepad with the IP address in clipboard ready to be pasted
'* Open Save Dialog
'* objShell.SendKeys("%{F4}")
'* Naviagate to Don't Save
'* objShell.SendKeys("{TAB}")
'* Exit Notepad
'* objShell.SendKeys("{ENTER}")

'* TEST SECTION - Proves that the IP address can be pasted elsewhere
WScript.Sleep 1000

'* Proof that the IP address is in the clipboard and can be pasted out
objShell.SendKeys "Test Paste Works"
objShell.SendKeys "{ENTER}"
objShell.SendKeys "^{V}"
objShell.SendKeys "{ENTER}"
objShell.SendKeys "Try a manual CTRL and V to check the IP address is still in memory"
objShell.SendKeys "{ENTER}"

'* Kill objects - DO NOT REMOVE!!
Set objShell = Nothing
Set objHTTP  = Nothing

By Strictly-Software

© 2018 Strictly-Software

Wednesday, 5 October 2016

A Karmic guide for Scraping without being caught

Quick Regular Expression to clean off any tracking codes on URLS

I have to deal with Scrapers all day long in my day job and I ban them in a multitude of ways from using firewalls, .htaccess rules, my own personal logger system that checks for the duration between page loads, behaviour, and many other techniques.

However I also have to scrape HTML content sometimes for various reasons, such as to find a piece of content related to somebody on another website linked to my own. So I know both methods to use to detect scrapers and stop them.

This is a guide to various methods that scrapers use to prevent being caught and have their IP address added to a blacklist within minutes of starting. Knowing the methods people use to scrape sites will help you when you have to defend your own from scrapers so it's good to know both attack and defense.

Sometimes it is just too easy to spot a script kiddy who has just discovered CURL and thinks it's a good idea to test it out on your site by crawling every single page and link available.

Usually this is because they have downloaded a script from the net, sometimes a very old one, and not bothered to change any of the parameters. Therefore when you see a user-agent in you logfile that is hammering you that just has the user-agent of "CURL" you can block it and know you will be blocking many other script kiddies as well.

I believe that when you are scraping HTML content from a site it always wise to follow some golden rules based on Karma. It is not nice to have your own site hacked or taken down due to a BOT gone wild therefore you shouldn't wish this on other people either.

Behave when you are doing your own scraping and hopefully you won't find your own sites content appearing on a Chinese rip off under a different URL anytime soon.

1. Don't overload the server your are scraping.

This only lets the site admin know they are being scraped as your IP / Useragent will appear in their log files so regularly that you might get confused for trying a DOS attack. You could find yourself added to a block list ASAP if you hammer the site you are scraping.

The best way to get round this is to put a time gap in-between each request you make. If possible follow the sites Robots.txt file if they have one and use any Crawl-Delay parameter they may have specified. This will make you look much more legitimate as you are obeying their rules.

If they don't have a Crawl-Delay value then randomise a wait time in-between HTTP requests, with at least a few seconds wait as the minimum. If you don't hammer their server and slow it down you won't draw attention to yourself.

Also if possible try to always obey the sites Robot.txt file as if you do you will find yourself on the right side of the Karmic law. There are many tricks people use such as dynamic Robots.txt files, and fake URL's placed within them, that are used to trick scrapers who break the rules by following DISALLOWED locations into honeypots, never-ending link mazes or just instant blocks.

An example of a simple C# Robots.txt parser I wrote many years ago that can easily be edited to obtain the Crawl-Delay parameter can be found here: Parsing the Robots.txt file with C-Sharp.

2. Change your user-agent in-between calls.

Many offices share the same IP across their network due to the outbound gateway server they use, also many ISP's use the same IP address for multiple home users e.g DHCP. Therefore there is no easy way until IPv6 is 100% rolled out to guarantee that by banning a user by their IP address alone you will get your target.

Changing your user-agent in-between calls and using a number of random and current user-agents will make this even harder to detect.

Personally I block all access to my sites that use a list of BOTS I know are bad or where it is obvious the person has not edited the user-agent (CURL, Snoopy, WGet etc), plus IE 5, 5.5, 6 (all the way up to 10 if you want).

I have found one of the most common user-agents used by scrapers is IE 6. Whether this is because the person using the script has downloaded an old tool with this as the default user-agent and not bothered to change it or whether it is due to the high number of Intranet sites that were built in IE6 (and use VBScript as their client side language) I don't know.

I just know that by banning IE 6 and below you can stop a LOT of traffic. Therefore never use old IE browser UA's and always change the default UA from CURL to something else such as Chromes latest user-agent.

Using random numbers, dashes, very short user-agents or defaults is a way to get yourself caught out very quickly.

3. Use proxies if you can.

There are basically two types of proxy.

The proxy where the owner of the computer knows it is being used as a proxy server, either generously to allow people in foreign countries such as China or Iran to access outside content or for malicious reasons to capture the requests and details for hacking purposes.

Many legitimate online proxy services such as "Web Proxies" only allow GET requests, float adverts in front of you and prevent you from loading up certain material such as videos, JavaScript loaded content or other media.

A decent proxy is one where you obtain the IP address and port number and then set them up in your browser or BOT to route traffic through. You can find many free lists of proxies and their port numbers online although as they are free you will often find speed is an issue as many people are trying to use them at the same time. A good site to use to obtain proxies by country is http://nntime.com.

Common proxy port numbers are 8000, 8008, 8888, 8080, 3128. When using P2P tools such as uTorrent to download movies it is always good to disguise your traffic as HTTP traffic rather than using the default setting of a random port on each request. It makes it harder but obviously not impossible for snoopers to see you are downloading bit torrents and other content. You can find a list of ports and their common uses here.

The other form of proxy are BOTNET's or computers where PORTS have been left open and people have reversed engineered it so that they can use the computer/server as a proxy without the persons knowledge.

I have also found that many people who try hacking or spamming my own sites are also using insecure servers. A port scan on these people often reveals that their own server can be used as a proxy themselves. If they are going to hammer me - then sod them I say as I watch US TV live on their server.

4. Use a rented VPS

If you are only required to scrape for a day or two then you can hire a VPS and set it up so that you have a safe non-blacklisted IP address to crawl from. With services like AmazonAWS and other rent by the minute servers it is easy to move your BOT from server to server if you need to do some heavy duty crawling.

However on the flipside I often find myself banning the AmazonAWS IP range (which you can obtain here) as I know it is so often used by scrapers and social media BOTS (bandwidth wasters).

5. Confuse the server by adding extra headers

There are many headers that can tell a server whether you are coming through a proxy such as X-FORWARDED-FOR, and there is standard code used by developers to work backwards to obtain the correct original IP address (REMOTE_ADDR) which can allow them to locate you through a Geo-IP lookup.

However not so long ago, and many sites still may use this code, it was very easy to trick sites in one country into believing you were from that country by modifying the X-FORWARDED-FOR header and supplying an IP from the country of your choice.

I remember it was very simple to watch Comedy Central and other US TV shown online just by simply using a FireFox Modify Headers plugin and entering in a US IP address for the X-FORWARDED-FOR header.

Due to the code they were using, they obviously thought that the presence of the header indicated that a proxy had been used and that the original country of origin was the spoofed IP address in this modified header rather than the value in REMOTE_ADDR header.

Whilst this code is not so common anymore it can still be a good idea to "confuse" servers by supplying multiple IP addresses in headers that can be modified to make it look like a more legitimate request.

As the actual REMOTE_ADDR header is set by the outbound server you cannot easily change it. However you can supply a comma delimited list of IP's from various locations in headers such as X-FORWARDED-FOR, HTTP_X_FORWARDED, HTTP_VIA and the many others that proxies, gateways, and different servers use when passing HTTP requests along the way.

Plus you never know, if you are trying to obtain content that is blocked from your country of origin then this old technique may still work. It all depends on the code they use to identify the country of an HTTP requests origin.

6. Follow unconventional redirect methods.

Remember there are many levels of being able to block a scrape so making it look like a real request is the ideal way of getting your content. Some sites will use intermediary pages that have a META Refresh of "0" that then redirect to the real page or use JavaScript to do the redirect such as:


<body onload="window.location.href='http://blah.com'">


<script>
function redirect(){
   document.location.href='http://blah.com';
}
setTimeout(redirect,50);
</script>

Therefore you want a good super scraper tool that can handle this kind of redirect so you don't just return adverts and blank pages. Practice those regular expressions!

7. Act human.

By only making one GET request to the main page and not to any of the images, CSS or JavaScript files that the page loads in you make yourself look like a BOT.

If you look through a log file it is easy to spot Crawlers and BOTs because they don't obtain these extra files and as a log file is mainly sequential you can easily spot the requests made by one IP or User-Agent just by scanning down the file and noticing all the single GET requests from that IP to different URLS.

If you really want to mask yourself as human then use a regular expression or HTML parser to get all the related content as well.

Look for any URLS within SRC and HREF attributes as well as URLS contained within JavaScript that are loaded up with AJAX. It may slow your own code down plus use up more of your own bandwidth as well as the server you are scraping but it will disguise you much better and make it harder for anyone looking at a log file to distinguish you from a BOT with a simple search.

8. Remove tracking codes from your URL's.

This is so that when the SEO "guru" looks at their stats they don't confuse their tiny little minds by not being able to work out why it says 10 referrals from Twitter but only 8 had JavaScript enabled or had the tracking code they were using for a feed. This makes it look like a direct, natural request to the page rather than a redirect from an RSS or XML feed.

Here is an example of a regular expression that removes anything after the query-string including the question mark.

The example uses PHP but the expression itself can be used in any language.


$url = "http://www.somesite.com/myrewrittenpage?utm_source=rss&utm_medium=rss&utm_campaign=mycampaign";

$url = preg_replace("@(^.+)(\?.+$)@","$1",$url);

There are many more rules to scraping without being caught but the main aspect to remember is Karma.

What goes around comes around, therefore if you scrape a site heavily and hammer it so bad that it costs the user so much bandwidth and money that they cannot afford it, do not be surprised if someone comes and does the same to you at some point!

Friday, 27 March 2015

Changing your DNS Servers - Problems Accessing The Internet Due To DNS Issues Then Read This!

Changing your DNS Servers

Are You Having Problems Accessing The Internet Due To DNS Issues Then Read This!

By Strictly-Software

Sometimes I get problems at home when my laptop cannot connect to the Internet due to DNS issues.

This happens quite regularly and my ISP is Virgin Media.

If you really don't know what you ISP is then go to this site whatismyipaddress.com.

Also you may get issues when your ISP blocks certain sites that you want to visit e.g. The Pirate Bay or other torrent sites or information clearing houses or anti establishment websites.

Therefore if you are getting DNS issues then you should have a look at some of the various solutions below to see if any are suitable for your needs.

1. Default Set-Up

Your system will usually be set up by your ISP to use DHCP (Dynamic Host Configuration Protocol) which from the article about DHCP says that your computer assigns a different IP address automatically every time you access the Internet.

However my settings have been on DHCP for years and I haven't noticed my IP address change except for a couple of times which can be very annoying. Especially if you have firewall exceptions and programs on other servers like Fail2Ban or DenyHosts to prevent you being banned (SSH, TCP/IP),

The reason your IP address might change is down to DHCP which is used to issue unique IP addresses to computers accessing the ISP as well as automatically configuring other network information on your computer.

As we haven't moved to IPv6 yet and we don't have enough IPv4 addresses for every single device accessing the internet an ISP will share IP addresses as well as using proxy servers or gateways to allow your device, whether it be a phone, tablet, TV, laptop, PC or car, to access the internet.

However be careful before changing your DHCP settings as they could cause problems especially if you are on a shared network.

In most homes and small businesses, the router acts as the DHCP server.

However on large networks, a single computer might act as the DHCP server and changing your DNS settings could cause problems especially if you use internal networks to access websites.

For example at my company I have test sites on internal IP addresses and changing the DNS addresses prevents me from accessing those sites.

You can read more about DHCP here.

However whilst the article says that in most cases a new IP address is assigned to your device every time you connect to the network I find my computers IP address stays the same for quite a while and only changes every four or so months.

If you don't know your IP address then you can easily check your IP address from your own PC without having to open a webpage. Or you can easily use the Internet and run a search for "What is my IP".

If you would like a little script that you can run from the command line or at the click of a button to get a popup with your external IP address then you can read this article on obtaining your IP address from your computer without using a browser.

2. Checking your DNS address from the Command Prompt

You can check your current DNS settings from the command prompt with ipconfig /all which will show you all your network connections and DNS details.

I have shown you my own home computer settings and you can see the DNS settings I am using at the bottom are Google's.

C:\Users\rreid>ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : stard0026w7
   Primary Dns Suffix . . . . . . . : metal.strictly-software.com
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : metal.strictly-software.com

Ethernet adapter Local Area Connection:

   Connection-specific DNS Suffix . : metal.strictly-software.com
   Description . . . . . . . . . . . : Broadcom NetLink (TM) Gigabit Ethernet
   Physical Address. . . . . . . . . : D4-BE-D9-95-40-DB
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::fc53:2287:bb47:b8a9%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.0.7.79(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Lease Obtained. . . . . . . . . . : 23 February 2015 17:06:21
   Lease Expires . . . . . . . . . . : 19 March 2015 15:49:01
   Default Gateway . . . . . . . . . : fe80::20c:29ff:fe77:3876%11
                                                   10.0.7.249
   DHCP Server . . . . . . . . . . . : 10.0.7.244
   DHCPv6 IAID . . . . . . . . . . . : 248823513
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-17-02-22-03-D4-BE-D9-95-40-DB

   DNS Servers . . . . . . . . . . . : 8.8.8.8
                                                  8.8.4.4
   Primary WINS Server . . . . . . . : 10.0.7.1
   NetBIOS over Tcpip. . . . . . . . : Enabled

3. Obtaining Different DNS Settings

So one of the solutions to this problem is to change your DNS server IP addresses to public DNS Servers like Googles free addresses 8.8.8.8 and 8.8.4.4 or you can use many others including opendns.com.

This provides security at the DNS level that protects your computer from malware and other threats including probes and hacks which could save you without installing any further software.

If you are a business you have to pay but for personal use you don't and can just use these addresses 208.67.222.222 or 208.67.220.220.

Read more about opendns here.

You can find a big list of other DNS server IP's to use here.

Some of them include:

Provider	Primary DNS Server	Secondary DNS Server
Level3	209.244.0.3	209.244.0.4
DNS.WATCH	84.200.69.80	84.200.70.40
Comodo Secure DNS	8.26.56.26	8.20.247.20
OpenDNS Home	208.67.222.222	208.67.220.220
DNS Advantage	156.154.70.1	156.154.71.1
Norton ConnectSafe	199.85.126.10	199.85.127.10
GreenTeamDNS	81.218.119.11	209.88.198.133
SafeDNS	195.46.39.39	195.46.39.40
OpenNIC	107.150.40.234	50.116.23.211
SmartViper	208.76.50.50	208.76.51.51
Dyn	216.146.35.35	216.146.36.36
FreeDNS	37.235.1.174	37.235.1.177

4. Changing DNS Settings

Every computer will be slightly different and you and you can search Google to find out how to do it on your computer, tablet, phone or even your smart TV.

This example is for Windows 7 but it not so different for other Windows machines:

Go to Control Panel and select Network and Sharing Center.
Select Change Adaptor Settings.
Select the network connection you want to change LAN or WIFI.
The panel will open showing you your connection details.that shows the network connection you want to change e.g LAN or WIFI.
Scroll down to Internet Protocol Version 4 (TCP/IPv4) (leave IPv6 for later!),
Click on that row then Properties,
In the bottom half are the DNS settings. Your ISP will most likely have ticked "Obtain DNS Server address automatically" so de-select that,
Select "Use the following DNS server addresses" and then enter first your preferred IP address e.g 8.8.8.8 and then in the alternate DNS server put 8.8.4.4
Hit OK and the settings should be saved. Test it by trying to get to a web page. If you can access a webpage by the
If you are not sure the DNS settings are being used then go to point 2 and check from the command prompt that your DNS settings have been changed.

A good idea might be to set your DNS up so that it uses Googles primary IP 8.8.8.8 for the preferred server and then the OpenDNS IP 208.67.222.222 for the Alliterate DNS Server.

E.G Mix and Match in-case one of the servers goes down..

5. Other Quick Options To Try

Other options you can try to attempt to sort your WIFI out without getting "techie" include just turning off WIFI fr a 30 seconds then back on.

Or you could turn your machines Airplane mode on and waiting a minute before turning it back off.

Turning your router off by unplugging it or taking the power cable out of the back of it. Wait a good 5 minutes then turn it back on.

Using the computers online diagnostic tool to test why your system is down. If it says DNS problems then this article might be for you.

6. Conclusion

Remember all modern devices like phones, tablets and even Smart TV's can connect to the internet and they all have options to change their DNS settings if you are having problems.

Also you must be careful if you are on a work computer as changing your DNS might not be a good idea as you could have internal routing going on to locally hosted sites on the companies internal network.

Therefore changing them might prevent you from viewing some websites (especially dev/demo sites).

You can always test by checking an internal site to see if you get a workable screen or just a "cannot connect" message.

© 2015 Strictly-Software

Wednesday, 23 October 2013

4 simple rules robots won't follow

Job Rapists and Content Scrapers - how to spot and stop them!

I work with many sites from small blogs to large sites that receive millions of page loads a day. I have to spend a lot of my time checking my traffic log and logger database to investigate hack attempts, heavy hitting bots and content scrappers that take content without asking (on my recruitment sites and jobboards I call this Job Raping and the BOT that does it a Job Rapist).

I banned a large number of these "job rapists" the other week and then had to deal with a number of customers ringing up to ask why we had blocked them. The way I see it (and that really is the only way as it's my responsibility to keep the system free of viruses and hacks) if you are a bot and want to crawl my site you have to do the following steps.

These steps are not uncommon and many sites implement them to reduce bandwidth wasted on bad BOTS as well as protect their sites from spammers and hackers.

4 Rules For BOTS to follow

1. Look at the Robots.txt file and follow the rules

If you don't even bother looking at this file (and I know because I log those that do) then you have broken the most basic rule that all BOTS should follow.

If you can't follow even the most basic rule then you will be given a ban or 403 ASAP.

To see how easy it is to make a BOT that can read and parse a Robots.txt file please read this article (and this is some very basic code I knocked up in an hour or so)

How to write code to parse a Robots.txt file (including the sitemap directive).

2. Identify yourself correctly

Whilst it may not be set in stone, there is a "standard" for BOTS to identify themselves correctly in their user-agents and all proper SERPS and Crawlers will supply a correct user-agent.

If you look at some common ones such as Google or BING or a Twitter BOT we can see a common theme.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible;bingbot/2.0;+http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)

They all:
-Provide information on the browser compatibility e.g Mozilla/5.0.
-Provide their name e.g Googlebot, bingbot, TweetedTimes.
-Provide their version e.g 2.1, 2.0, 1.0
-Provide a URL where we can find out information about the BOT and what it does e.g http://www.google.com/bot.html, http://www.bing.com/bingbot.htm and http://tweetedtimes.com

On the systems I control and on many others that use common intrusion detection systems at firewalls and system level (even WordPress plugins). Having a blank user-agent or a short one that doesn't contain a link or email address is enough to get a 403 or ban.

At the very least a BOT should provide some way to let the site owner find out who owns the BOT and what the BOT does.

Having a user-agent of "C4BOT" or "Oodlebot" is just not good enough.

If you are a new crawler identify yourself so that I can search for your URL and see whether I should ban you or not. If you don't identify yourself I will ban you!

3. Set up a Reverse DNS Entry

I am now using the "standard" way of validating crawlers against the IP address they crawl from.

This involves doing a reverse DNS lookup with the IP used by the bot.

If you haven't got this setup then tough luck. If you have then I will do a forward DNS to make sure the IP is registered with the host name.

I think most big crawlers are starting to come on board with this way of doing things now. Plus it is a great way to identify correctly that GoogleBot is really GoogleBot, especially when the use of user-agent switcher tools are so common nowadays.

I also have a lookup table of IP/user-agents for the big crawlers I allow. However if GoogleBot or BING start using new IP addresses that I don't know about the only way I can correctly identify them (especially after experiencing GoogleBOT hacking my site) is by doing this 2 step DNS verification routine.

4. Job Raping / Scraping is not allowed under any circumstances.

If you are crawling my system then you must have permission from each site owner as well as me to do this.

I have had bots hit tiny weeny itsy bitsy jobboards with only 30 jobs have up to 400,000 page loads a day because of scrapers, email harvesters and bad bots.

This is bandwidth you should not be taking up and if you are a proper job aggregator like Indeed, JobsUK, GoogleBase then you should accept XML feeds of the jobs from the sites who want their jobs to appear on your site.

Having permission from the clients (recruiters/employers) on the site is not good enough as they do not own the content the site owner does. From what I have seen the only job aggregators who crawl rather than accept feeds are those who can't for whatever reason get the jobs the correct way.

I have put automated traffic analysis reports into my systems that let me know at regular intervals which bots are visiting me, which visitors are heavy hitting and which are spoofing, hacking, raping and other forms of content pillaging.

It really is like an arms race from the cold war and I am banning bots every day of the week for breaking these 4 simple to follow rules.

If you are legitimate bot then its not too hard to come up with a user-agent that identifies yourself correctly, set up a reverse DNS entry, follow the robots.txt rules and don't visit my site everyday crawling every single page!

Sunday, 3 March 2013

Stop BOTS and Scrapers from bringing your site down

Blocking Traffic using WebMin on LINUX at the Firewall

If you have read my survival guides on Wordpress you will see that you have to do a lot of work just to get a stable and fast site due to all the code that is included.

The Wordpress Survival Guide

For instance not only do you have to handle badly written plugins that could contain security holes and slow the performance of your site but the general WordPress codebase is in my opinion a very badly written piece of code.

However they are slowly learning and I remember once (and only a few versions back) that on the home page there were over 200+ queries being run most of them were returning single rows.

For example if you used a plugin like Debug Queries you would see lots of SELECT statements on your homepage that returned a single row for each article shown for every post as well as the META data, categories and tags associated with the post.

So instead of one query that returned the whole data set for the page in one query (post data, category, tag and meta data) it would be filled with lots of single queries like this.

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800)

However they have improved their code and a recent check of one of my sites showed that although they are still using seperate queries for post, category/tag and meta data they are at least getting all of the records in one go e.g

SELECT wp_posts.* FROM wp_posts WHERE ID IN (36800,36799,36798,36797,36796)

So the total number of queries has dropped which aids performance. However in my opinion they could write one query for the whole page that returned all the data they needed and hopefully in a future edition they will.

However one of the things that will kill a site like Wordpress is the amount of BOTS that hit you all day long. These could be good BOTS like GoogleBOT and BingBOT which crawl your site to find out where it should appear in their own search engine or they could be social media BOTS that look for any link Twitter shows or scrapers trying to steal your data.

Some things you can try to stop legitimate BOTS like Google and BING from hammering your site is to set up a Webmaster Tools account in Google and then change the Crawl Rate to a much slower one.

You can also do the same with BING and their webmaster tools account. However with BING they apparently respect the ROBOTS.txt command DELAY e.g

Crawl-delay: 3

Which supposedly tells BOTS that respect the Robots.TXT commands that they should wait 3 seconds before each crawl. However as far as I know only BING support this at the moment and it would be nice if more SERP BOTS did in future.

If you want a basic C# Robots.txt parser that will tell you whether your agent can crawl a page on a site, extract any sitemap command then check out > http://www.strictly-software.com/robotstxt however if you wanted to extend it to add in the Crawl-Delay command it wouldn't be hard ( line 175 in Robot.cs ) to add in so that you could extract and respect it when crawling yourself.

Obviously you want all the SERP BOTS like GoogleBot and Bingbot to search you but there are so many Social Media BOTS and Spammers out there nowadays that they can literally hammer your site into the ground no matter how many caching plugins and .htacess rules you put in to return 403 codes.

The best way to deal with traffic you don't want to hit your site is as high up the chain as possible.

Just leaving Wordpress to deal with it means the overhead of PHP code running, include files being loaded, regular expression to test for harmful parameters being run and so on.

Moving it up to the .htaccess level is better but it still means your webserver is having to process all the .htacess rules in your file to decide whether or not to let the traffic through or not.

Therefore if you can move the worst offenders up to your Firewall then it will save any code below that level from running and the TCP traffic is stopped before any regular expressions have to be run elsewhere.

Therefore what I tend to do is follow this process:

Use the Wordpress plugin "Limit Login Attempts" to log people trying to login (without permission) into my WordPress website. This will log all the IP addresses that have attempted and failed as well as those tht have been blocked. This is a good starting list for your DENY HOSTS IP ban table
Check the same IP's as well as using the command: tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more to see which IP addresses are visiting my site the most each day.
I then check the log files either in WebMin or in an SSH tool like PUTTY to see how many times they have been trying to visit my site. If I see lots of HEAD or POST/GET requests within a few seconds from the same IP I will then investigate them further. I will do an nslookup and a whois and see how many times the IP address has been visiting the site.
If they look suspicious e.g the same IP with multiple user-agents or lots of requests within a short time period I will comsider banning them. Anyone who is using IE 6 as a user-agent is a good suspect (who uses IE 6 anymore apart from scrapers and hackers!)
I will then add them to my .htaccess file and return a [F] (403 status code) to all their requests.
If they keep hammering my site I wll then move them from my DENY list in my .htaccess fle and add them to my firewall and Deny Hosts table.
The aim is to move the most troublesome IP's and BOTS up the chain so they cause the least damage to your site.
Using PHP to block access is not good as it consumes memory and CPU, the .htaccess file is better but still requires APACHE to run the regular expressions on every DENY or [F] command. Therefore the most troublesome users should be moved up to the Firewall level to cause the less server usage to your system.
Reguarly shut down your APACHE server and use the REPAIR and OPTIMIZE options to de-frag your table indexes and ensure the tables are performing as well as possible. I have many articles on this site on other tools which can help you increase your WordPress sites perforance with free tools.

In More Details

You should regularly check the access log files for the most IP's hitting your site, check them out with a reverse DNS tool to see where they come from and if they are of no benefit to you (e.g not a SERP or Social Media agent you want hitting your site) then add them to your .htaccess file under the DENY commands e.g

order allow,deny
deny from 208.115.224.0/24
deny from 37.9.53.71

Then if I find they are still hammering my site after a week or month of getting 403 commands and ignoring them I add them to the firewall in WebMin.

Blocking Traffic at the Firewall level

If you use LINUX and have WebMin installed it is pretty easy to do.

Just go to the WebMin panel and under the "Networking" menu is an item called "Linux Firewall". Select that and a panel will open up with all the current IP addresses, Ports and packets that allowed or denied access to your server.

Choose the "Add Rule" command or if you have an existing Deny command you have setup then it's quicker to just clone it and change the IP address. However if you don't have any setup yet then you just need to do the following.

In the window that opens up just follow these steps to block an IP address from accessing your server.

In the Chain and Action Details Panel at the top:

Add a Rule Comment such as "Block 71.32.122.222 Some Horrible BOT"
In the Action to take option select "Drop"
In the Reject with ICMP Type select "Default"

In Condition Details Panel:

In source address of network select "Equals" and then add the IP address you want to ban e.g 71.32.122.222
In network protocol select "Equals" and then "TCP"

Hit "Save"

The rule should now be saved and your firewall should now ban all TCP traffic from that IP address by dropping any packets it receives as soon as it gets them.

Watch as your performance improves and the number of 403 status codes in your access files drop - until the next horrible social media BOT comes on the scene and tries scrapping all your data.

IMPORTANT NOTE

WebMin isn't very clear on this and I found out the hard way by noticing that IP addresses I had supposedly blocked were still appearing in my access log.

You need to make sure all your DENY RULES are above the default ALLOW rules in the table WebMin will show you.

Therefore your rules to block bad bots, and IP addresses that are hammering away at your server - which you can check in PUTTY with a command like this:
tail -n 10000 access_log|cut -f 1 -d ' '|sort|uniq -c|sort -nr|more

Should be put above all your other commands e.g:

Drop If protocol is TCP and source is 91.207.8.110
Drop If protocol is TCP and source is 95.122.101.52
Accept If input interface is not eth0
Accept If protocol is TCP and TCP flags ACK (of ACK) are set
Accept If state of connection is ESTABLISHED
Accept If state of connection is RELATED
Accept If protocol is ICMP and ICMP type is echo-reply
Accept If protocol is ICMP and ICMP type is destination-unreachable
Accept If protocol is ICMP and ICMP type is source-quench
Accept If protocol is ICMP and ICMP type is parameter-problem
Accept If protocol is ICMP and ICMP type is time-exceeded
Accept If protocol is TCP and destination port is auth
Accept If protocol is TCP and destination port is 443
Accept If protocol is TCP and destination ports are 25,587
Accept If protocol is ICMP and ICMP type is echo-request
Accept If protocol is TCP and destination port is 80
Accept If protocol is TCP and destination port is 22
Accept If protocol is TCP and destination ports are 143,220,993,21,20
Accept If protocol is TCP and destination port is 10000

If you have added loads at the bottom then you might need to copy out the IPTables list to a text editor, change the order by putting all the DENY rules at the top then re-saving the whole IPTable list to your server before a re-start of APACHE.

Or you can use the arrows by the side of each rule to move the rule up or down in the table - which is a very laborious task if you have lots of rules.

So if you find yourself still being hammered by IP addresses you thought you had blocked then check the order of your commands in your firewall and make sure they are are at the top NOT the bottom of your list of IP addresses.

Monday, 28 January 2013

Accessing your computers external IP address from your computer without using a browser

Access your computers external IP address from your desktop without using a browser

Following on from yesterdays blog post about what can happen if you get given a new IP address and don't realise it, you might want a quick way to check your external IP address from your computer without having to open an Internet browser.

There are many "What is my IP address" sites about that show you your IP address plus other request headers such as the user-agent but you might want a quick way of seeing your external IP without having to open a browser first.

If you are using a LINUX computer it's pretty easy to use CURL or WGET to write a small script to scrape an IP checker page and return the HTML contents.

For instance in a command prompt this will return you the IP address using CURL by scraping the contents of icanhazip.com.

This site is good because it outputs the computers IP address that's accessing the URL in plain text so it means you don't have to do any reformatting at all.

curl icanhazip.com

However if you are on a Windows computer there is no simple way of getting your external IP address (the IP address your computer is seen on the outside world) without either installing Windows versions of CURL or WGET first or writing a script to do it for you using Microsoft objects.

Of course it would be nice if you could just use ipconfig from the command prompt to show your external address as well as your internal network details but unfortunately you can't do that.

As you're connected to the Internet through your router your PC isn't directly connected to the Internet.

Therefore there is no easy way you can get the IP address your ISP has assigned to your computer without seeing it from another computer on the Internet.

Therefore you can either use one of the many IP checker tools like whatismyip.com or icanhazip.com to get the details. Or you can even just click this link to search for "what is my IP address" and get Google to show you your IP address above the results.

However if you do want to do it without a browser you can write a simple VBS script to do it for you and then you can access your external IP from your desktop with a simple double click of the mouse.

How to make a VBS Script to get your computers external IP address.

Open notepad.
Copy and paste the following VBS code into a new notepad window.
Save the file as "whatismyip.vbs" on to your desktop.
To view your IP address just double click the file icon and a Windows message box will open and show you the IP address.

The script is very simple and all it does is scrape the plain text contents of the webpage at icanhazip.com and output it in a pop-up - simples!

Option Explicit
Dim objHTTP : Set objHTTP = WScript.CreateObject("MSXML2.ServerXmlHttp")
objHTTP.Open "GET", "http://icanhazip.com", False
objHTTP.Send
Wscript.Echo objHTTP.ResponseText
Set objHTTP  = Nothing

If you really want to use this from the command line you can do it by following these steps.

Open a command prompt.
Type "cscript " leaving a space afterwards (and without the quotes!).
Drag the whatismyip.vbs file to the command prompt so that you have a space between cscript and the path of the file e.g C:\Documents and Settings\myname>cscript "C:\Documents and Settings\myname\Desktop\whatismyip.vbs"
Hit Enter.
The IP address will appear after some guff about the Windows Script Host Version.

The output should look something like this:

C:\Documents and Settings\
myname >cscript "C:\Documents and Settings\myname\Desktop\whatismyip.vbs"
Microsoft (R) Windows Script Host Version 5.7
Copyright (C) Microsoft Corporation. All rights reserved.
89.42.212.239

So there you go, a LINUX and WINDOWS way of accessing your external IP address from your desktop without having to open Chrome or FireFox.

Sunday, 27 January 2013

Problem with SFTP after new router installation

Problem with SFTP / SSH after new router installation

This may seem like an obvious one but it can catch you out if you are not aware of the implications of having a new IP address assigned to your house's broadband. You may have moved your laptop to a new house or using a new wifi system to connect to the Internet.

More commonly you may have been given a new or upgraded router by your ISP provider.

Even if you are told the IP address has not changed by BT, Virgin, Sky, Verizon or whoever is giving you the new router you should do a check on any of the many IP checking pages out there on the web.

E.G this script shows you your current IP and ISP details.

Why is this important?

Well if you have your own server being hosted by a company e.g a cloud server somewhere and you have installed DENY HOSTS to block hacking attacks then you might find that you cannot SFTP (Secure FTP) into your server anymore or that using Putty and SSH to access your remote server suddenly stops working for no apparent reason. Obviously you want to access your server so the problem needs fixing.

Symptoms of an IP change causing problems include:

Your server reporting error messages such as "server unexpectedly closed the connection."
When you change the file transfer settings to simple FTP from SFTP you can access the server but then experience timeout errors or when the list directory command is run nothing happens.
Not being able to use Putty or another SSH tool to connect to your server.
Not having changed any settings on your computer but not being able to connect to your server anymore.

Solution to IP change:

Check and write down your new IP address.
Log into your server through WebMin or a web based system or from another computer that hasn't had an IP change.
Check your DENY HOSTS list to see if your IP address is listed and if so delete the record.
Add your new IP address in the ALLOW HOSTS list.
Re-start your server.

If you don't know how to do this read the 3rd part of my Wordpress Survival Guide about security.

Monday, 27 February 2012

How to delete a Virtual Server using Virtualmin

Deleting a Virtual Server from your server - Replacing the default site domain on a server with an IP address

The other day I talked about how you can setup a website for testing on your LINUX server using host headers so that you can test it before purchasing a domain name.

Now I did this on one PC and gave the virtual server a particular name in the hosts file so I could access it but because I was at work and behind a proxy it wouldn't allow me to access the website by the desired hostname. Therefore to get to the site I just put in my servers IP address e.g http://174.34.34.114.

However on my home laptop I changed my mind about the name I wanted to use for the new site and instead of editing the existing virtual server I created a new one with a new domain name (again one I hadn't purchased yet).

However because I wasn't behind an office proxy my local hosts file allowed me to access the website by the domain name I had given it in my file with a command like so e.g:

174.34.34.114 www.robsnewblog.com

However now I am back in the office and want to access the new site but due to the proxy issue when I enter the IP address I am met with the old site as that is the default site for the server. Changing the host file has no effect due to the issue with the Proxy. Therefore I cannot access my new domain www.robsnewblog.com either by the IP address OR the hosts file command.

Therefore the solution is to delete the old Virtual Server from my host in Virtualmin and reset the new site as the default server so that I can access it by the hosts IP address alone.

To do this you need to do the following:

1. Login to your host e.g https://myhostname.blah.com:10000/
2. Choose the virtual server you want to delete from the drop down box on the left e.g robstestdomain.
3. Extend the "Disable and Delete" menu option on the left menu and choose "Delete Virtual Server"
4. The main section will list all the services and files that will be deleted. Confirm that you want to delete the server.
5. You will be given an output like the following:

Delete Server

In domain robstestdomain.com

Deleting mail aliases ..
.. done
Deleting AWstats configuration file and Cron job ..
.. done

Removing password protection for AWstats ..
.. done

Deleting MySQL database robstestdomain ..
.. done

Deleting MySQL login ..
.. done

Disabling log file rotation ..
.. done

Deleting scheduled Webalizer reporting ..
.. done

Deleting virtual website ..
.. done

Deleting Apache log files ..
.. done

Removing from email domains list ..
.. done

Deleting home directory ..
.. done

Deleting administration user ..
.. done

Deleting administration group ..
.. done

Deleting server details for robstestdomain.com ..
.. done

Applying web server configuration ..
.. done

6. The virtual server will now have been deleted. A quick test by going to the domain or IP will show you the site cannot be accessed anymore and if you try to FTP to the site you will get an error trying to connect.

7. As you haven't bought a domain for the new site and if like me you are behind a proxy that is preventing you from using your hosts file to access the new domain then you will want to access the site from the IP address alone.

If this is the case and your hosts file is not working (after editing, and re-starting your browser) then you need to ensure that the Virtual Server you want to use is set as the default by going into VirtualMin, editing the Server, choosing "Server Configuration" from the menu and then "Website Options" and then making sure the "Default website for IP address?" radio button is ticked if it isn't already. This will ensure all requests to the IP address are forwarded to the Virtual Server in question.

7. Restart Apache - From WebMin (which you should have installed when you setup the server if you prefer graphical interfaces to using the SSH console) you can do this by going to System > Bootup and Shutdown. Tick the Apache2 box and then at the bottom hit the "Restart" button. Either that or use the command line and this command: /etc/init.d/apache2 restart.

8. If you have any problems go to System Settings and choose the "Re-Check Configuration" option to see if any errors appear or run apache2ctl configtest from the command line to ensure all configuration files are correct.

9. Go to the IP address of your site and hopefully it will now load up. Obviously if you are using an IP address to access the site but the website is set-up with a non existent domain name then paths to scripts and CSS files will not load correctly as they will be pointing to places that don't exist.

To fix this you will need to edit your database either in VirtualMin or through a MySQL management tool like PHPMyAdmin or Navicat and go through the wp_options table editing every row that currently points to your domain name to your IP address. The main ones will be:

siteurl
home
dashboard_widget_options

Obviously if you have custom plugins that have added their own data into the database you might need to change those as well. A quick SQL query will help you hunt any rows down

SELECT *
FROM wp_options
WHERE option_value = 'http://www.robstestdomain.com';

Where you would obviously replace the URL with the domain you want to replace with your IP address.

This should now let you see the website on your PC even a proxy is blocking your hosts file commands.

Monday, 26 November 2018

Obtaining an external IP address in memory for use in a Firewall Rule

Obtaining an external IP address in memory for use in a Firewall Rule

Wednesday, 5 October 2016

A Karmic guide for Scraping without being caught

Friday, 27 March 2015

Changing your DNS Servers - Problems Accessing The Internet Due To DNS Issues Then Read This!

Changing your DNS Servers

Are You Having Problems Accessing The Internet Due To DNS Issues Then Read This!

1. Default Set-Up

2. Checking your DNS address from the Command Prompt

3. Obtaining Different DNS Settings

4. Changing DNS Settings

Wednesday, 23 October 2013

4 simple rules robots won't follow

4 simple rules robots won't follow

4 Rules For BOTS to follow

Sunday, 3 March 2013

Stop BOTS and Scrapers from bringing your site down

Monday, 28 January 2013

Accessing your computers external IP address from your computer without using a browser

Sunday, 27 January 2013

Problem with SFTP after new router installation

Monday, 27 February 2012

How to delete a Virtual Server using Virtualmin

Who is Strictly-Software?

My Stuff

Settings

Sites to Visit

Strictly-Software Tweets

Blog Archive

My Top Articles

Translate My Blog

Search This Blog

Labels

Monday, 26 November 2018

Obtaining an external IP address in memory for use in a Firewall Rule

Wednesday, 5 October 2016

Friday, 27 March 2015

Changing your DNS Servers

Are You Having Problems Accessing The Internet Due To DNS Issues Then Read This!

1. Default Set-Up

2. Checking your DNS address from the Command Prompt

3. Obtaining Different DNS Settings

4. Changing DNS Settings

Wednesday, 23 October 2013

4 simple rules robots won't follow

4 Rules For BOTS to follow

Sunday, 3 March 2013

Monday, 28 January 2013

Sunday, 27 January 2013

Monday, 27 February 2012

Who is Strictly-Software?

My Stuff

Settings

Sites to Visit

Strictly-Software Tweets

Blog Archive

My Top Articles

Translate My Blog

Subscribe to Strictly-Software

Search This Blog

Labels