Exim and its queue: quick HowTo

Today we are posting a quick HowTo for those admins running exim on their server. Typically, this will also be useful for all those who run servers that use cPanel as their hosting platform. We will show the list of command to display the queue as such, how to remove single or all messages from the queue, and how to identify nasties in the queue.

  1. What’s in the queue?
    exim -bp
  2. How many message are in the queue?
    exim -bpc
  3. How do I get rid of a specific message?
    exim -Mrm {message-id}
  4. How do I empty the entire queue?
    exim -bp | awk '/^ *[0-9]+[mhd]/{print "exim -Mrm " $3}' | bash

    or

    exim -bp | exiqgrep -i | xargs exim -Mrm
  5. What is spamming from my server?
    grep cwd /var/log/exim_mainlog | grep -v /var/spool | awk -F"cwd=" '{print $2}' | awk '{print $1}' | sort | uniq -c | sort

We hope this will come in handy for you as much as it does for us at times!

 

This cheatsheet has been compiled from two sources:

http://www.cyberciti.biz/faq/exim-remove-all-messages-from-the-mail-queue/
http://www.inmotionhosting.com/support/email/exim/find-spam-script-location-with-exim

 

Latency and Throughput as key components of network performance

We have recently added another transit feed to our New York PoP, with a declared aim to bring down latency between London and New York to sub 70ms. We are more than happy to be able to state that current latencies between London Telehouse and New York are now around 67ms. An update to our latency overview has been posted here as well: https://worralorrasurfa.castlegem.co.uk/whmcs/knowledgebase.php?action=displayarticle&id=43.

With that, we want to explain the essentials of latency and throughput a bit.

Network latency in general states how long it takes for a packet to travel the distance between its source and destination. Network throughput, however, defines how much of your data you can send in one go (per time unit). Latency and througput are usually not directly related, unless we are in a situation where a link becomes saturated (upon which throughput will decrease, and latencies will most likely increase), and different applications or purposes require varying degrees of quality in terms of latency and throughput.

For example, if you want to manage a Linux server via ssh from home, you would like to see small latencies: you want to see what you type right away and not have to wait for ages for the characters to appear on your screen on the shell. Latency here is key, but throughput is not that important: ssh does not need enormous amounts of bandwidth. Now, video streaming is something different. If you want to watch youtube videos, you want the videos to come down your internet connection as smooth as if you were watching TV at home. In this case you need decent throughput, i.e. a lot of data per time unit, but latency is not that much of an issue here: it wont matter much if your video starts after 1 or 2 seconds, just as long as it is smooth.

Currently, we see emphasis on small latencies increasing. While this has always been a big concern for us due to the nature of our clients (a real lot of them are traders who require superb latencies to the exchanges), throughput used to be the decisive parameter for internet connections. Part of this shift in emphasis, we believe, is caused by the fact that nowadays most typical internet applications live very well with bandwidths available.

How can we measure latency and throughput? For latencies, ping, traceroute, and mtr are excellent friends. We wrote about these in a previous post, but let’s go into some examples:

ping

ping, put simply, checks the connectivity between source and destination:

# ping HOSTNAME
PING HOSTNAME (IP) 56(84) bytes of data.
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=1 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=2 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=3 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=4 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=5 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=6 ttl=60 time=66.8 ms

We can see that the latency between our host (a London Telehouse server) and the destination (one of our routers in New York) is pretty much 66.8ms. ping takes different arguments such as the size of the packets, or the number of packets to be sent, etc. The manpage (man ping) will give you details.

Traceroute

traceroute will not only check the latency between the source and destination, but will also show latencies (and thus possible issues) on the way there:

# traceroute HOSTNAME
traceroute to HOSTNAME(IP), 30 hops max, 60 byte packets
 1  ... (...)  0.419 ms  0.463 ms  0.539 ms
 2  40ge1-3.core1.lon2.he.net (195.66.224.21)  10.705 ms  10.706 ms  10.422 ms
 3  100ge1-1.core1.nyc4.he.net (72.52.92.166)  67.176 ms  67.189 ms  67.174 ms
 4  10ge9-7.core1.sjc2.he.net (184.105.213.197)  141.010 ms  140.897 ms  140.928 ms
 5  10ge1-2.core1.fmt2.he.net (72.52.92.73)  136.597 ms  136.746 ms  136.885 ms
 6  ....castlegem.co.uk (IP)  136.855 ms  136.437 ms  136.635 ms

As we can see, we get rather stable latencies throughout all the way from London to California. Large variations in the latencies on the way are not necessarily an indication for issues yet, though, as long as the destination latencies are still smooth and regular. Possible reasons for deviations on the way to your destination could be routers rate limiting their replies or, in worse case, routers or networks indeed being congested (we will get to measuring throughput shortly).

MTR

mtr can in a way be considered the combination of ping and traceroute. It displays the network path packets travel, and it keeps doing that by sending packet after packet.

HOSTNAME (0.0.0.0)                                                                   Fri Mar  7 09:51:28 2014
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                                                         Packets               Pings
 Host                                                                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. IP                                                                                  0.0%    28    0.3   1.9   0.3  41.4   7.7
 2. vl365-globalcrossing-peer.jump.net.uk                                               0.0%    28    0.3   5.8   0.3  64.3  16.6
 3. po7-20G.ar4.CHI2.gblx.net                                                           0.0%    28  259.2 114.3  89.3 259.2  57.0
 4. DESTINATION                                                                         0.0%    28   91.8  91.9  91.6  94.5   0.6

We can see that hop #3 has a large standard deviation, but latency to the destination is very consistent. In our case, this is from London to Chicago. Hop #3 simply seems to rate limit these probing packets, hence has a larger latency, or/and is busy doing other things than talking to us. It would not be uncommon to see packet loss on the routers either – this is fine and also due to rate limiting mechanisms – just as long as the destination latency is still consistent, i.e. no packet loss, and no extreme deviations.

That is all good – but how do we check throughput? There are several makeshift means to measure throughput, they range from timing browser requests on the command line (such as time lynx -source http://www.google.com/ > /dev/null) to using ftp with hashmarks on and the more common wget http://HOST/testfile. These will all give you a cursory glimpse into how fast you can download data from a destination to computer. There is, however, a very nice tool called iperf that does this job in a very professional manner.

iperf

iperf can measure throughput between two network locations, and it can give you a good idea of bottlenecks when used in combination with traceroute or mtr. The drawback of iperf is that you not only need a client, but also a server to connect to. iperf is thus primarily indeed more of a professional tool, i.e. something set up between providers or commercial clients and their providers to sort out potential issues, define SLAs, etc.

There is an excellent introductory article on iperf from 2007, which we are happy to link to here: http://www.enterprisenetworkingplanet.com/netos/article.php/3657236/Measure-Network-Performance-with-iperf.htm.

Example output, both from the server and client side, can be seen below:

# ./iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local IPx port 5001 connected with IPy port 59508
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.1 sec   566 MBytes   472 Mbits/sec
# ./iperf -c HOSTNAME -t 10
------------------------------------------------------------
Client connecting to HOSTNAME, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local IPy port 59508 connected with IPx port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   566 MBytes   474 Mbits/sec

Here we conclude our brief overview and hope that some of you will find it useful indeed!

 

 

NTP Amplification Attack – the gist of it

With recent DDOS attacks increasingly using NTP as an attack vector, and one of Cloudflare’s clients recently having been hit with a DDOS attack just short of 400gbps, we believe it is necessary to summarise what’s been going on, how such attacks are made possible at all, and what the community, and providers can do to prevent or mitigate such attacks as best possible.

A concise overview by means of a CERT alert can be found here: https://www.us-cert.gov/ncas/alerts/TA14-013A.

Essentially, an attacker send a certain command to a vulnerable NTP server, using a spoofed source address. The command itself is very short and produces very little traffic. The response, however, is a lot larger, besides the response is going to be sent back to the spoofed source address. This response is typically about 206 times larger than the initial request – hence the name amplification – a very effective means to quickly fill up even very powerful internet pipes.

Cloudflare published a very interesting article as well, giving a quick overview about the most recent attack and the technology behind it: http://blog.cloudflare.com/technical-details-behind-a-400gbps-ntp-amplification-ddos-attack.

The recommended course of action here is to secure your NTP server (cf. https://isc.sans.edu/diary/NTP+reflection+attack/17300) , as well as ensure that spoofed packets do not leave your network. Sample procedures are explained at BCP38.info.

 

Cloud computing

(this post will also appear in our virtual private server blog)

Ok, this had to come. Sooner or later – cloud computing. In a nutshell, do we offer it: yes – if the very specific advantages of the cloud warrant its use.

So, what is it – why are so many people crazy about it, and why is it so expensive, compared to a virtual private server or dedicated servers? Essentially, it is simply a different concept of providing resources on demand that can seemingly scale ad infinitum, with similar contention disadvantages like a virtual private server, however. Why? Because eventually, also cloud resources must run on a physical machine. And typically, you won’t be the only person using this machine for your cloud purposes, hence you share the resources with others, and therefore there will always be a level of contention subject to the ISP’s discretion – even if you use very sophisticated virtualisation and isolation methods. Most ISPs sell you cloud computing as a hype, when in fact it is very little else than a different version of a virtual private server.

Of course, cloud crusaders will tell you why you must have a cloud, and start explaining about the additional layer of flexibility, redundancy, security, scalability, etc. In return, one can ask questions such as: do you really want to host your confidential and secure data in a cloud without being able really pinpoint the exact location? Your data is “somewhere in the cloud”. How secure does that make you feel? How redundant is it really? How can I determine redundancy if it is not even clear to me where exactly my data is stored? What different level of redundancy is there compared to normal Raid systems and backup/recover solutions? My application’s resource usage varies by 25% only, why can I not go for a dedicated setup instead, or a virtual private server even?

We still consider one of the cloud’s main advantage its flexibility for sites and application that vary a lot in their resource use over time, with very irregular patterns as well. While they can scale a lot (depending on what you agree upon with your ISP), there will still be resource limits to be observed, so even in a cloud you should take care of estimating your expected peak resource usage.

This is a very basic and by no means comprehensive statement – there are a lot more complex issues to be observed with clouds – put the other way round: normally, even for very complex and high resource applications, you will only need a cloud if you can state its technical advantages over a virtual private server / VPS and dedicated server. Otherwise, in 99 out of 100 cases you will be better of with the latter, outside the cloud.

A very good read is http://www.eucalyptus.com/resources/info/cloud-myths-dispelled – 4 important technical aspects when it comes to cloud computing.

Oversubscription or not? Contention – the good and the bad

(this post will also appear in the virtual private server blog with additional content)

The terms oversubscription and contention are often heard when it comes to providing services as an ISP, for example regarding CPU resources, memory, bandwidth, etc. Generally speaking, it means selling/providing services or advertising them to more people than there are effective simultaneous resources available. Let us look at bandwidth, an often cited example, first:

Let us assume your VPS is on a Gig link to the host, and that there are 3 more guest systems on the same physical host, also with their own Gig link to the host, but the host only having a single GigE connection to its switch. In that case, we have a contention ratio of 4:1 since we have 4 VPS with Gig links, but only one effective Gig uplink to the outside world available. Here, oversubscription is rarely an issue – a typical VPS going at a sustained gbps isn’t feasible and would be better sold as a dedicated machine anyway, so such rates are only seen during spikes, with the other VPS rarely competing for the available host bandwidth. This example continues from the host to the switch it is connected to – how many other physical interfaces there are that are competing with the respective uplink. A trivial contention ratio here might be 8:1, 16:1, 24:1, or similar – i.e. your typical GigE switch with a gbps fibre uplink has to serve 8, 16, or 24 machines that, in theory, could also use the entire bandwidth of 1 gbps available. Here it is already trickier to judge whether contention might be an issue – a customer with rather evenly shaped traffic curves is more predictable than a customer seeing a lot of spikes during the day, spikes that can take valuable bandwidth away from other customers.

How do ISPs do it? Most ISPs accept contention ratios based on the traffic they see on the switches, routers, throughout the entire backbone. This is a legit practice, and as long as it doesn’t handicap one group of users against another, no one will probably see that is in effect being done at all. Some ISPs sell bandwidth at dumping prices, fully aware that their traffic/cost/revenue model is anything from risky to shady – when you encounter a bandwidth deal that sounds too good to be true, it probably is – like anything in life. Traffic cost patterns have evolved in various regions of the globe, and comparing several providers on a single continent or within a large geographic region (US, UK, continental Europe, etc.) can help you a lot to find out the truth behind the myths of “unlimited traffic” – it also helps asking questions such as “unlimited traffic – I have a website generating 100TB of traffic each month – still unlimited?”. Often the smallprint of providers will tell you things like “irregular use is due to caps”, “bandwidth in excess of…is subject to approval/surcharge”, etc. Make sure you get your ISP to give you the facts straight, ask until you are sure that what you need and want is going to be provided without hidden costs.

How do we do it? In terms of bandwidth, we have a contention ratio of 1:1. Yes, that’s correct. Bandwidth we sell is dedicated, and we make sure that you won’t share it with any other customer even in case all of them are requesting their share at the very same moment. This model has its disadvantage: our traffic is more expensive than oversubscribed traffic. The advantage: you get what you pay for, no matter what, no hidden costs, no caps.