dtrace | TheStaticVoid

# dmesg ... Mar 30 12:09:00 ragno nfs: [ID 808668 kern.notice] NFS write error on host nasfb: Read-only file system. Mar 30 12:09:00 ragno nfs: [ID 702911 kern.notice] (file handle: 1d010000 1000000 e10d1300 a5122c4b 1d010000 1000000 15000000 5c179c4a) ...

# lsof -N -z ragno /zones/ragno/root/home lsof: WARNING: can't stat() 5 zone file systems; using dev= options COMMAND PID ZONE USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 3065 ragno webservd cwd VDIR 256,65718 0 0 /zones/ragno (ragno) httpd 3065 ragno webservd rtd VDIR 256,65718 0 0 /zones/ragno (ragno) httpd 3065 ragno webservd txt VREG 256,65718 0 0 /zones/ragno (ragno) httpd 3065 ragno webservd txt VREG 256,65718 0 0 /zones/ragno (ragno) ...

# dtrace -n 'fbt:nfs:nfs*_write_error:entry /zonename == "ragno"/ {vp = (vnode_t*) arg0; printf("%s", stringof(vp->v_path));}' dtrace: description 'fbt:nfs:nfs*_write_error:entry ' matched 2 probes CPU ID FUNCTION:NAME 0 56891 nfs_write_error:entry /zones/ragno/root/home/rmserver/Logs/rmaccess.log 0 56891 nfs_write_error:entry /zones/ragno/root/home/rmserver/Logs/rmaccess.log 0 56891 nfs_write_error:entry /zones/ragno/root/home/rmserver/Logs/rmaccess.log 0 56891 nfs_write_error:entry /zones/ragno/root/home/rmserver/Logs/rmaccess.log 0 56891 nfs_write_error:entry /zones/ragno/root/home/rmserver/Logs/rmaccess.log

DTrace is one of those magical tools that I sort of know how to use but never had the chance to. Usually, firing up truss is sufficient to figure out what’s happening in a process.

But recently, a team at GW was setting up a web application and having some problems. It would hang whenever it was accessed. The application is highly modular and accesses several other servers for resources, so I suspected there must have been some sort of network problem, probably a blocked firewall port somewhere. Unfortunately, the application produces no logs (!) so it was pretty much guesswork.

I started snoop to watch for traffic generated by the application. I knew from watching the production application that there should be communication between two servers over port 9300, but I wasn’t seeing anything. I thought of using truss to determine if the application was at least trying to connect, but being a web application, the process was way too short-lived to attach to. Even if I was able to attach to it, all I might have learned was that the application was calling the ‘connect’ system call and possibly getting a return value.

DTrace, on the other hand, doesn’t have to attach to a particular process. It just sits in the kernel and dynamically observes what your scripts tell it to. It also has the ability to look into data structures so you can actually see what values are passed to functions and system calls. I used this ability to watch every call to the connect system call.

#!/usr/sbin/dtrace -qs

syscall::connect:entry
{
socks = (struct sockaddr*) copyin(arg1, arg2);
hport = (uint_t) socks->sa_data[0];
lport = (uint_t) socks->sa_data[1];
hport <<= 8;
port = hport + lport;

printf("%s: %d.%d.%d.%d:%d\n", execname, socks->sa_data[2], socks->sa_data[3], socks->sa_data[4], socks->sa_data[5], port);
}

This script copies arg2 bytes from the address pointed to by arg1 into kernel space and uses the data to determine the port (big endian order) and destination address.

When run, the script immediately revealed that the web application was trying to connect to itself rather than a database server, a simple configuration mistake made difficult to diagnose due to poor logging facilities in the application.

In the future, there will be a network provider for DTrace which will simplify the job of extracting data from network calls. It should then be possible to rewrite the script to simply:

#!/usr/sbin/dtrace -qs

tcp:::connect-request
{
printf("%s: %s:%d\n", execname, args[2]->ip_daddr, args[4]->tcp_dport);
}

Hopefully I’ll have more opportunities to use DTrace in the future.

TheStaticVoid

Tag Archives: dtrace

DTrace to the Rescue Again!

Watching Network Connections with DTrace