Backups, backups, backups

Recent events have shown that awareness for one’s own data could still do with some improvement.

Everybody knows: backups are important. Then how come a lot of people don’t have any and implicitly rely on their host to handle them? This general assumption probably stems from experience in everyday’s life. If something breaks, we usually have warranty on it, so it gets replaced. If we accidentally delete a file from our home laptop, we recover it from some external drive, and so on. So certainly a company that deals with hosting your data should have backups, right? Or not?

The answer is: NO.

Most ISPs and hosting companies will not back up your data. And again: your service provider will NOT (really, they won’t!) back up your data. They will not have those urgently needed copies of your dedicated server. They will not have point in time copies of your virtual private server. They simply won’t.

I think the main point has now come across. But why not, they are … a host, after all, no? Yes, BUT: backups are not just snapshots of your VM. A proper backup involves a lot of brain because it includes parameters such as:

  • what to backup (which files, databases, …)
  • how to back it up (archives? all files? database dumps?)
  • when to back it up (midnight? early morning? when is it best for you?)
  • how often to back it up (once a day? a week? every hour?)
  • how many generations of backups to keep (1? 3? 10?)
  • how long to keep your backups (a month? half a year? 5 years?)
  • HOW TO RESTORE FILES FROM BACKUP (why is this in capitals: it’s nice to know your files are backed up. Do you know how to retrieve them? Where to place them? Are there any dependencies between files lost and still there?)

When you signed up with your ISP, how many of the above questions were addressed? Or maybe their TOS say “we are not responsible for your data”? Taking proper backups takes a lot of human effort, and considerable IT resources. There are hosts who sometimes do you a favour and take snapshots of your VMs, but their TOS will still say they are not responsible, and they will not know what they back up either.

The only way to get out of this dilemma is to have a managed contract and or a special SLA that lays down all the answers to the abovementioned questions, and defines what sort of responsibility the host has if they do not live up to this agreement (i.e. if they DO lose your data).

Which brings us to the next point: YOU should always! have backups as well – especially you. It is your data, after all. If you do not have the resources to back up your data in reasonable intervals, you should come to an agreement with your host for a managed backup service (and still try to at least have off site backups every now and then: have your host send a DVD to you for example). An enterprise host will guarantee that your backups are securely stored on redundant and resilient hardware (a single drive attached to a USB port of a 10 year old server that has trouble booting after every update is not a redundant and resilient backup service), and such host will also regularly check whether your backups are readable and can easily be restored at your convenience.

This isn’t anything that can be included in a single digit per month contract for a run of the mill virtual private server or a dedicated server acquired during a blowout sale. Ultimately, if you lose data, it is your time and money that are at stake. If you have no arrangement with your host, you will take the brunt of any data disaster occurring (and even if you have an SLA with your host that defers responsibility to your host: if your host doesn’t live up to that SLA, you are still going to sweat a lot – along with your host who will be facing a claim for damages).

How important is your data to you – do you care so little about it that you will not ensure that your data can be retrieved any time if force majeure or an accident delete your production setup? We are certain that this is not the case, so please:

Take responsibility for your data. Your data needs you, and vice versa. If you do not have the resources, find someone who has, and someone who can be blamed if they do not live up to the agreement made. Your host is not going to do anything for you unless you have it in writing (and by that we do not mean the shiny ad on a host’s homepage).

This is a rather fervent da capo of our post from 2011 (http://dedicatedservers.castlegem.co.uk/2011/10/backups/). Why this ardour? Because we care for your data. We want you to be able to lean back and enjoy the feeling of your data being reliably secure. Just keep in mind: it isn’t anything that comes for free, and not without asking and specifying.

This post is not directed at any ISP. It is also about ourselves: we have the very same and similar TOS that prevent us from being held responsible for any data loss on clients’ servers unless we have a direct agreement that states the opposite. But we do offer enterprise solutions where we certainly live up to every single letter of the agreement and regularly outperform: just like many other hosts out there as well.

 

Exim and its queue: quick HowTo

Today we are posting a quick HowTo for those admins running exim on their server. Typically, this will also be useful for all those who run servers that use cPanel as their hosting platform. We will show the list of command to display the queue as such, how to remove single or all messages from the queue, and how to identify nasties in the queue.

  1. What’s in the queue?
    exim -bp
  2. How many message are in the queue?
    exim -bpc
  3. How do I get rid of a specific message?
    exim -Mrm {message-id}
  4. How do I empty the entire queue?
    exim -bp | awk '/^ *[0-9]+[mhd]/{print "exim -Mrm " $3}' | bash

    or

    exim -bp | exiqgrep -i | xargs exim -Mrm
  5. What is spamming from my server?
    grep cwd /var/log/exim_mainlog | grep -v /var/spool | awk -F"cwd=" '{print $2}' | awk '{print $1}' | sort | uniq -c | sort

We hope this will come in handy for you as much as it does for us at times!

 

This cheatsheet has been compiled from two sources:

http://www.cyberciti.biz/faq/exim-remove-all-messages-from-the-mail-queue/
http://www.inmotionhosting.com/support/email/exim/find-spam-script-location-with-exim

 

Latency and Throughput as key components of network performance

We have recently added another transit feed to our New York PoP, with a declared aim to bring down latency between London and New York to sub 70ms. We are more than happy to be able to state that current latencies between London Telehouse and New York are now around 67ms. An update to our latency overview has been posted here as well: https://worralorrasurfa.castlegem.co.uk/whmcs/knowledgebase.php?action=displayarticle&id=43.

With that, we want to explain the essentials of latency and throughput a bit.

Network latency in general states how long it takes for a packet to travel the distance between its source and destination. Network throughput, however, defines how much of your data you can send in one go (per time unit). Latency and througput are usually not directly related, unless we are in a situation where a link becomes saturated (upon which throughput will decrease, and latencies will most likely increase), and different applications or purposes require varying degrees of quality in terms of latency and throughput.

For example, if you want to manage a Linux server via ssh from home, you would like to see small latencies: you want to see what you type right away and not have to wait for ages for the characters to appear on your screen on the shell. Latency here is key, but throughput is not that important: ssh does not need enormous amounts of bandwidth. Now, video streaming is something different. If you want to watch youtube videos, you want the videos to come down your internet connection as smooth as if you were watching TV at home. In this case you need decent throughput, i.e. a lot of data per time unit, but latency is not that much of an issue here: it wont matter much if your video starts after 1 or 2 seconds, just as long as it is smooth.

Currently, we see emphasis on small latencies increasing. While this has always been a big concern for us due to the nature of our clients (a real lot of them are traders who require superb latencies to the exchanges), throughput used to be the decisive parameter for internet connections. Part of this shift in emphasis, we believe, is caused by the fact that nowadays most typical internet applications live very well with bandwidths available.

How can we measure latency and throughput? For latencies, ping, traceroute, and mtr are excellent friends. We wrote about these in a previous post, but let’s go into some examples:

ping

ping, put simply, checks the connectivity between source and destination:

# ping HOSTNAME
PING HOSTNAME (IP) 56(84) bytes of data.
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=1 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=2 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=3 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=4 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=5 ttl=60 time=66.8 ms
64 bytes from gw-castlegem.init7.net (IP): icmp_seq=6 ttl=60 time=66.8 ms

We can see that the latency between our host (a London Telehouse server) and the destination (one of our routers in New York) is pretty much 66.8ms. ping takes different arguments such as the size of the packets, or the number of packets to be sent, etc. The manpage (man ping) will give you details.

Traceroute

traceroute will not only check the latency between the source and destination, but will also show latencies (and thus possible issues) on the way there:

# traceroute HOSTNAME
traceroute to HOSTNAME(IP), 30 hops max, 60 byte packets
 1  ... (...)  0.419 ms  0.463 ms  0.539 ms
 2  40ge1-3.core1.lon2.he.net (195.66.224.21)  10.705 ms  10.706 ms  10.422 ms
 3  100ge1-1.core1.nyc4.he.net (72.52.92.166)  67.176 ms  67.189 ms  67.174 ms
 4  10ge9-7.core1.sjc2.he.net (184.105.213.197)  141.010 ms  140.897 ms  140.928 ms
 5  10ge1-2.core1.fmt2.he.net (72.52.92.73)  136.597 ms  136.746 ms  136.885 ms
 6  ....castlegem.co.uk (IP)  136.855 ms  136.437 ms  136.635 ms

As we can see, we get rather stable latencies throughout all the way from London to California. Large variations in the latencies on the way are not necessarily an indication for issues yet, though, as long as the destination latencies are still smooth and regular. Possible reasons for deviations on the way to your destination could be routers rate limiting their replies or, in worse case, routers or networks indeed being congested (we will get to measuring throughput shortly).

MTR

mtr can in a way be considered the combination of ping and traceroute. It displays the network path packets travel, and it keeps doing that by sending packet after packet.

HOSTNAME (0.0.0.0)                                                                   Fri Mar  7 09:51:28 2014
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                                                         Packets               Pings
 Host                                                                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. IP                                                                                  0.0%    28    0.3   1.9   0.3  41.4   7.7
 2. vl365-globalcrossing-peer.jump.net.uk                                               0.0%    28    0.3   5.8   0.3  64.3  16.6
 3. po7-20G.ar4.CHI2.gblx.net                                                           0.0%    28  259.2 114.3  89.3 259.2  57.0
 4. DESTINATION                                                                         0.0%    28   91.8  91.9  91.6  94.5   0.6

We can see that hop #3 has a large standard deviation, but latency to the destination is very consistent. In our case, this is from London to Chicago. Hop #3 simply seems to rate limit these probing packets, hence has a larger latency, or/and is busy doing other things than talking to us. It would not be uncommon to see packet loss on the routers either – this is fine and also due to rate limiting mechanisms – just as long as the destination latency is still consistent, i.e. no packet loss, and no extreme deviations.

That is all good – but how do we check throughput? There are several makeshift means to measure throughput, they range from timing browser requests on the command line (such as time lynx -source http://www.google.com/ > /dev/null) to using ftp with hashmarks on and the more common wget http://HOST/testfile. These will all give you a cursory glimpse into how fast you can download data from a destination to computer. There is, however, a very nice tool called iperf that does this job in a very professional manner.

iperf

iperf can measure throughput between two network locations, and it can give you a good idea of bottlenecks when used in combination with traceroute or mtr. The drawback of iperf is that you not only need a client, but also a server to connect to. iperf is thus primarily indeed more of a professional tool, i.e. something set up between providers or commercial clients and their providers to sort out potential issues, define SLAs, etc.

There is an excellent introductory article on iperf from 2007, which we are happy to link to here: http://www.enterprisenetworkingplanet.com/netos/article.php/3657236/Measure-Network-Performance-with-iperf.htm.

Example output, both from the server and client side, can be seen below:

# ./iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local IPx port 5001 connected with IPy port 59508
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.1 sec   566 MBytes   472 Mbits/sec
# ./iperf -c HOSTNAME -t 10
------------------------------------------------------------
Client connecting to HOSTNAME, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local IPy port 59508 connected with IPx port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   566 MBytes   474 Mbits/sec

Here we conclude our brief overview and hope that some of you will find it useful indeed!

 

 

NTP Amplification Attack – the gist of it

With recent DDOS attacks increasingly using NTP as an attack vector, and one of Cloudflare’s clients recently having been hit with a DDOS attack just short of 400gbps, we believe it is necessary to summarise what’s been going on, how such attacks are made possible at all, and what the community, and providers can do to prevent or mitigate such attacks as best possible.

A concise overview by means of a CERT alert can be found here: https://www.us-cert.gov/ncas/alerts/TA14-013A.

Essentially, an attacker send a certain command to a vulnerable NTP server, using a spoofed source address. The command itself is very short and produces very little traffic. The response, however, is a lot larger, besides the response is going to be sent back to the spoofed source address. This response is typically about 206 times larger than the initial request – hence the name amplification – a very effective means to quickly fill up even very powerful internet pipes.

Cloudflare published a very interesting article as well, giving a quick overview about the most recent attack and the technology behind it: http://blog.cloudflare.com/technical-details-behind-a-400gbps-ntp-amplification-ddos-attack.

The recommended course of action here is to secure your NTP server (cf. https://isc.sans.edu/diary/NTP+reflection+attack/17300) , as well as ensure that spoofed packets do not leave your network. Sample procedures are explained at BCP38.info.

 

iftop – or where’s my server’s bandwidth going?!

During the past weeks we gave a small introduction to UNIX and Linux commands that may be nice to have at hand when it comes to administrating a server from the command shell, making some quick changes, or generally assisting a sysadmin with her every day tasks.

Today we want to have a look at iftop – a small program that allows you to check what your dedicated or virtual private server is doing in terms of internet traffic: where packets go to, and where they come from.

This is useful when you want to investigate some process or virtual machine hogging bandwidth on a server, or when you see unsual traffic patterns from your monitoring systems.

The syntax as such is very simple, for a start it should be sufficient to run

# /usr/sbin/iftop -i eth1 -p -P

from the shell (you will typically need root privileges). The -i switch lets you specify which interface to listen on, -p runs iftop in promiscuous mode (necessary for some virtualisation architectures), and -P shows portnumbers/services in addition to hosts.

On a standard CentOS install, iftop needs extra repositories to be installed (or to be compiled from source), and you will need (n)curses and libpcap packages installed as well.

 

Additional and in-depth information can be found here:
http://www.ex-parrot.com/pdw/iftop/ (author, source code)
http://www.cyberciti.biz/faq/centos-fedora-redhat-install-iftop-bandwidth-monitoring-tool/ (overview, examples)
http://sickbits.net/iftop-finding-traffic-hogs/ (overview, examples)

 

Forgotten Unix commands: nice and renice

Today we are going to shed some light onto the way processes can be (re-)prioritised when it comes to scheduling using the nice and renice commands.

Typically (but not always), priority values range from -20 (run with top priority) to +19/20 (run when nothing else runs), let’s have a look at an excerpt of a processlist with “ps axl” on a CentOS server:

ps axl
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0     1     0  20   0   4116   876 poll_s Ss   ?          0:00 init
1     0     2     1  20   0      0     0 kthrea S    ?          0:00 [kthreadd/1049]
1     0     3     2  20   0      0     0 worker S    ?          0:00 [khelper/1049]
5     0   134     1  16  -4  10420   612 poll_s S<s  ?          0:00 /sbin/udevd -d
1     0   560     1  20   0  63596  1216 poll_s Ss   ?          0:00 /usr/sbin/sshd
1     0   740     1  20   0 281748 10376 poll_s Ss   ?          0:59 /usr/sbin/httpd
1   497   756   749  25   5  64836  1056 hrtime SN   ?          0:54 /usr/sbin/zabbix_agentd
1     0   760     1  20   0 116668  1212 hrtime Ss   ?          0:03 crond
5    48  7776   740  20   0 282272  7352 inet_c S    ?          0:00 /usr/sbin/httpd
5    48  8022   740  20   0 282128  6508 inet_c S    ?          0:00 /usr/sbin/httpd
...

Except for udevd, everything is running at priority 20, without any sort of nicing. Unprivileged users can only lower the priority of their processes (so as to not interfere with underlying OS stability), the superuser can also increase the piority of a process, though. Let’s have a look at how this is done for processes that have already started:

# renice -n -4 -p 7776; ps axl
7776: old priority -4, new priority -4
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
...
5    48  7776   740  16  -4 282656  7580 inet_c S<   ?          0:00 /usr/sbin/httpd

As we can see, the PRI and NI columns have changed for this single process. The arguments here are -n and -p, -n takes an integer value to modify the priority by the value given, and -p takes the process ID as argument.

The nice command works in a similar fashion, it takes an integer for its -n parameter value, followed by the command as such, e.g.:

nice -n 2 ps axl results in:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
...
4     0 14503 12698  22   2 105464   892 -      RN+  pts/0      0:00 ps axl
...

whereas nice -n -2 ps axl yields:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
...
4     0 14533 12698  18  -2 105468   892 -      R<+  pts/0      0:00 ps axl
...

So…that’s all very NICE, yes, but what does it actually mean? Modifying the priority of processes can be useful when processes need more power than the CPU can actually deliver. In such cases, (re)nicing processes can help to stabilise a system, to avoid (too much) contention for resources, and to keep vital functions up and functioning.

A typical example may be compressing a large file, where it is not important how long that process actually takes, but it should not have any (or at least only very little) effect on everything else on the system. In such cases, one might run the compressing process with nice -n 19 gzip … thus giving it a lower priority. Here, however, it is also important to mention that Linux often features a program called ionice, which is specifically intended for scheduling I/O (as opposed to primarily scheduling CPU).

There is, however, no general rule as to “how much” more priority a process gets based on the differences in integer values.

nice and renice have lost importance over the years with CPU power ever increasing, but nevertheless they might come in handy at times, and we hope this quick intro will be useful to you when it comes to administrating your virtual private server or your dedicated server.

 

Forgotten Unix commands: xargs

In today’s post we will cover a bit about xargs – also very useful to perform rather complex (and repetitive) operations on the Linux shell.

xargs is particularly useful when it comes to piping (“|“) and chaining commands and their arguments together.

xargs reads items from the standard input or from pipes, delimited by blanks or newlines, and then executes the command one or more times with any initial arguments followed by items read from standard input. Blank lines on the standard input are ignored1.

Let’s do some examples:

# echo Where is my tux? | xargs
Where is my tux?

That was easy, especially since echo is the default command anyway.

Now something more useful maybe:

# find /var/log -name 'secure-*' -type f -print | xargs /bin/ls -las
4 -rw------- 1 root root  593 Dec 28 20:02 /var/log/secure-20131229
8 -rw------- 1 root root 6734 Jan  4 14:38 /var/log/secure-20140105
4 -rw------- 1 root root 3793 Jan 10 15:33 /var/log/secure-20140112
4 -rw------- 1 root root 1182 Jan 16 08:40 /var/log/secure-20140119

You do not really need the -print here, and you can do it with find alone as well:

# find /var/log -name 'secure-*' -type f -ls
11903950    4 -rw-------   1 root     root          593 Dec 28 20:02 /var/log/secure-20131229
11903981    8 -rw-------   1 root     root         6734 Jan  4 14:38 /var/log/secure-20140105
11903996    4 -rw-------   1 root     root         3793 Jan 10 15:33 /var/log/secure-20140112
11904057    4 -rw-------   1 root     root         1182 Jan 16 08:40 /var/log/secure-20140119

or:

# find /var/log -name 'secure-*' -type f -exec ls -las {} \;
4 -rw------- 1 root root 593 Dec 28 20:02 /var/log/secure-20131229
8 -rw------- 1 root root 6734 Jan  4 14:38 /var/log/secure-20140105
4 -rw------- 1 root root 3793 Jan 10 15:33 /var/log/secure-20140112
4 -rw------- 1 root root 1182 Jan 16 08:40 /var/log/secure-20140119

But maybe you forgot that syntax? It never hurts to have several approaches at hand!

Some more useful examples, the next one is to clean up things a bit. Say you have a couple of files in several directories:

# ls -las dir1; ls -las dir2
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:45 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 1
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 2
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:45 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 3
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 4

For some reason, you want them all to be moved to one single, new directory, let’s call it dir3:

# ls -las dir3
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:45 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..

You can now embark on a mv orgy, or you can do it a bit faster:

# find ./ -type f -print0 | xargs -0 -I {} mv {} dir3
# ls -las dir*
dir1:
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:46 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..

dir2:
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:46 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..

dir3:
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:46 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 1
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 2
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 3
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 4

Nice, isn’t it? {} acts as placeholder for the input files and argument list, the -0 and -I options are to handle special characters in filenames (which is also why we used -print0 in the find command) and to replace a specified string occurrence with one from the standard input.

Now, let’s cp all these files into a single directory (e.g. useful to make a quick “backup” of files to an external drive, etc.):

# find /tmp/xargs/ -type f -print0 | xargs -0 -r -I file cp -v -p file --target-directory=/tmp/xargs/dir3
`/tmp/xargs/dir2/3' -> `/tmp/xargs/dir3/3'
`/tmp/xargs/dir2/4' -> `/tmp/xargs/dir3/4'
`/tmp/xargs/dir1/1' -> `/tmp/xargs/dir3/1'
`/tmp/xargs/dir1/2' -> `/tmp/xargs/dir3/2'
# ls -las dir3
total 8
4 drwxr-xr-x 2 root root 4096 Jan 24 09:58 .
4 drwxr-xr-x 5 root root 4096 Jan 24 09:42 ..
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 1
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 2
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 3
0 -rw-r--r-- 1 root root    0 Jan 24 09:42 4

Looks good, all files copied.

Last, but not least, instead very nifty indeed, is quickly creating an archive of files (and/or directories):

# find /tmp/xargs -type f | xargs tar rvf archive.tar
tar: Removing leading `/' from member names
/tmp/xargs/dir2/3
/tmp/xargs/dir2/4
/tmp/xargs/dir1/1
/tmp/xargs/dir1/2
# ls -las
total 32
 4 drwxr-xr-x  5 root root  4096 Jan 24 10:01 .
 4 drwxrwxrwt. 4 root root  4096 Jan 24 09:42 ..
12 -rw-r--r--  1 root root 10240 Jan 24 10:01 archive.tar
 4 drwxr-xr-x  2 root root  4096 Jan 24 09:52 dir1
 4 drwxr-xr-x  2 root root  4096 Jan 24 09:52 dir2
 4 drwxr-xr-x  2 root root  4096 Jan 24 10:01 dir3
# tar -tf archive.tar
tmp/xargs/dir2/3
tmp/xargs/dir2/4
tmp/xargs/dir1/1
tmp/xargs/dir1/2

As we can see, all files are in the archive we created!

xargs isn’t something one can learn on the fly, especially the more complex operations that it can handle – but the latter fact makes it a very valuable tool for system administrators of both virtual private servers and dedicated servers.

Now it is time to wish you good luck and a lot of fun exploring the world of xargs!

1 from the man page of GNU’s version of xargs

 

Forgotten Unix Commands: awk

In our weekly series of forgotten UNIX commands, we will today give a brief overview over awk. awk is extremely useful for manipulating structured files, and displaying and working with information contained in these, so it can come in very handy for any dedicated server admin.

Let’s get started!

Assume we have some sort of logfile of a webserver, with entries like the following:

119.63.193.131 - - [17/Jan/2014:07:01:10 +0000] "GET / HTTP/1.1" 302 211 "-" "Mozilla/4.0 (...)"
211.129.81.174 - - [17/Jan/2014:07:01:12 +0000] "GET /robots.txt HTTP/1.1" 200 40 "-" "siclab (...)"

awk works as you’d expect it from a shell prompt: it takes stdin as input, and writes to stdout by default. Now, we want to have a look at the IP addresses accessing our webserver:

root:> awk '{print $1}' logfile
119.63.193.131
211.129.81.174

Ok, that was easy, right? awk assigns each field per line, separated by default by whitespaces, to variables starting with $1, and going up to the number of fields in a line. $0 is the entire line, and NF is the “number of fields” count and can print the last field or any field going backwards from the last. To see how NF works, here is an example:

root:> awk '{print $(NF-12)}' logfile
119.63.193.131
211.129.81.174

Another useful variable is NR, which is the current row number, so let’s go a step further: display row numbers, IP addresses, and the status code, and let’s also format the output a bit:

root:> awk '{print NR " : " $1 " : " $9}' logfile
1 : 119.63.193.131 : 302
2 : 211.129.81.174 : 200

You could also add up fields, for example the total number of bytes transferred:

root:> awk '{ total += $10; print $10 " bytes in this line -> current total: " total}' logfile
211 bytes in this line -> current total: 211
40 bytes in this line -> current total: 251

You could also just display the output after processing the last line:

root:> awk '{ total += $10; print $10 " bytes in this line." } END { print "final total: " total }' logfile
211 bytes in this line.
40 bytes in this line.
final total: 251

There is more to it of course. On most linux systems, ps aux will display a nice processlist of the underlying system, including memory and cpu time used, etc. Column 6 contains the resident set size, a very useful indicator. Let’s sum it up quickly:

root:> ps aux | awk '{ rss += $6 } END { print "total rss: " rss }'
total rss: 132256

Faster than using a calculator, right?

A final one before I let you embark on your awk explorations on your own: assume you have a runaway/zombie/whatever httpd that you need to get rid of as fast as possible , and you want to just kill all processes that have httpd in their command column:

root:> ps aux | grep httpd | awk '{ print "kill -9 " $2}'
kill -9 740
kill -9 4629
kill -9 9365
kill -9 9366
kill -9 9368
kill -9 10589
kill -9 19518
kill -9 19689
kill -9 20126
kill -9 21925
kill -9 23486
kill -9 24635

NB: this just prints, but does not do anything. To make it happen, you need to pipe that output through the shell, i.e. use the same command line as above, but add ” | sh ” at the end.

Have fun exploring, and possibly use Wikipedia to get started: http://en.wikipedia.org/wiki/AWK

 

Forgotten Unix Commands: lsblk

UNIX and its various flavours have a lot of commands that every admin uses on a near permanent basis, such as ls, cp, cat, grep, mv, rm, gzip, tar, and so on.

Today, however, we are starting a series titled ‘Forgotten Unix Commands‘. These can come in very handy, and they often produce effects like “oh, I didn’t know you could do that on Linux!”. Such commands can brighten up the day of every dedicated server or virtual private server administrator. One of these commands is lsblk.

lsblk lists information about all or the specified block devices. The lsblk command reads the sysfs filesystem to gather information.

The command prints all block devices (except RAM disks) in a tree-like format by default.

This command can be very useful to check how the different partitions and/or disks are mounted in the system, following is an example of a desktop computer:

$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda       8:0    0 465.8G  0 disk
├─sda1    8:1    0  46.6G  0 part   /
├─sda2    8:2    0     1K  0 part
├─sda5    8:5    0   3.7G  0 part   [SWAP]
└─sda6    8:6    0 415.5G  0 part
sdb       8:16   0 465.8G  0 disk
└─sdb1    8:17   0 465.8G  0 part
  └─md0   9:0    0 465.7G  0 raid10 /data
sdc       8:32   0 465.8G  0 disk
└─sdc1    8:33   0 465.8G  0 part
  └─md0   9:0    0 465.7G  0 raid10 /data
sdd       8:48   1  14.9G  0 disk
└─sdd1    8:49   1  14.9G  0 part   /media/mnt/KINGSTON
sr0      11:0    1  1024M  0 rom

From this output you can see that on this desktop we have the following disks:

SDA is the fist “scsi” disk, in our case a SATA disk. On this disk the first partition is used for the / filesystem, the second partition is an extended partition, the third is a swap partition, and the last one (sda6) is used for /home, it’s a BTRFS filesystem, which the command doesn’t recognise it, though.

SDB and SDC are the two disks that are used in a RAID 10 setup, and the filesystem is mounted as /data.

SDD is a 16GB USB stick mounted under the directory /media/mnt/KINGSTON

Another example, this time output from a server with LVM:

$ lsblk
NAME                       MAJ:MIN RM   SIZE RO MOUNTPOINT
sda                          8:0    0 298.1G  0
├─sda1                       8:1    0   500M  0 /boot
└─sda2                       8:2    0 297.6G  0
  ├─vg_main-lv_swap (dm-0) 253:0    0   5.8G  0 [SWAP]
  ├─vg_main-lv_root (dm-1) 253:1    0    50G  0 /
  └─vg_main-lv_home (dm-2) 253:2    0 241.8G  0
    └─home (dm-3)          253:3    0 241.8G  0 /home
sr0                         11:0    1  1024M  0

As you can see, the command is easy to use, and the output is rather nifty with its tree style: on a glance you can see the partition layout, logical volumes, and other, additional useful information about your disks.

As opposed to lsblk, the better known fdisk -l gives similar data, however, it requires root privileges and does not recognise dm or lvm partitions.

 

This article was originally published on linuxaria. Castlegem has permission to republish. Thank you, linuxaria!

Needle in a haystack, or grep revisited: tre-agrep

Probably everyone who uses a terminal knows the command grep, cf. this excerpt from its man page:

grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.

So this is the best tool to search in a big file for a specific pattern, or a specific process in the complete list of running processes, but it has its limitations: it searches for the exact string that you search for, but sometimes it could be useful to do an “approximate” or “fuzzy” search instead.

For this goal the program agrep was firstly developed, from wikipedia we can gain some details about this software:

agrep (approximate grep) is a proprietary approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu’s bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

So it’s closed source, but luckily there is an open source source alternative: tre-agrep

Tre Library

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time to the length of the used regular expression. In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic to the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only in pathological cases which are probably very rare in practice.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

Installation

Tre-agrep it’s usually not installed by default by any distribution but it’s available in many repositories so you can easily install it with the package manager of your distribution, e.g. for Debian/Ubuntu and Mint you can use the command:

apt-get install tre-agrep

Basic Usage

The usage is best demonstrated with some simple example of this powerfulcommand, given the file example.txt that contains:

Résumé
RÉSUMÉ
resume
Resümee
rèsümê
Resume
linuxaria

Following is he output of the command tre-agrep with different options:

 mint-desktop tmp # tre-agrep resume example.txt
resume

mint-desktop tmp # tre-agrep -i resume example.txt
resume
Resume

mint-desktop tmp # tre-agrep -1 -i resume example.txt
resume
Resümee
Resume

mint-desktop tmp # tre-agrep -2 -i resume example.txt
Résumé
RÉSUMÉ
resume
Resümee
Resume

As you can see, without any option it returned the same result as a normal grep, the -i option is used to ignore case sensitivity, with the interesting options being -1 and -2: these are the distances allowed in the search, so the larger the number the more results you’ll get since you allow a greater “distance” from the original pattern.

To see the distance of each match you can use the option -s: it prints each match’s cost:

mint-desktop tmp # tre-agrep -5 -s -i resume example.txt
2:Résumé
2:RÉSUMÉ
0:resume
1:Resümee
3:rèsümê
0:Resume
5:linuxaria

So in this example the string Resume has a cost of 0, while linuxaria has a cost of 5.

Further interesting options are those that assign a cost for different operations:

-D NUM, –delete-cost=NUM – Set cost of missing characters to NUM.
-I NUM, –insert-cost=NUM – Set cost of extra characters to NUM.
-S NUM, –substitute-cost=NUM – Set cost of incorrect characters to NUM. Note that a deletion (a missing character) and an insertion (an extra character) together constitute a substituted character, but the cost will be the that of a deletion and an insertion added together.

Conclusions

The command tre-agrep is yet another small tool that can save your day if you work a lot with terminals and bash scripts.

 

This article was originally published on linuxaria. Castlegem has permission to republish. Thank you, linuxaria!