Example of how DTrace can be used to fix problems by Derek Crudgington, dacrud@utdallas.edu Introduction: DTrace is a new feature in Solaris 10 that is used to better understand the system and programs run on the system. It can allow you to look and see what your system is doing and locate causes of poor behavior. This is done by using the "D programming language" which doesn't take much programming experience to understand. In the example, I don't even have to create any DTrace scripts as there is a package of them already made called the DTraceToolkit (by Brendan Gregg) which contains over 80 scripts. The Problem: Trying to get the mod_owa.so module to load with apache without flipping the machine out is the problem. This module is used through apache to connect to an oracle database and retrieve information. Whenever I was loading the module with apache the machine was showing high activity (for just apache running) and processes: *defunct processes are processes that have become corrupted where they can't talk with the parent or child process. {callacct:/u01/app/oracle} /usr/apache/bin/apachectl start /usr/apache/bin/apachectl start: httpd started {callacct:/u01/app/oracle} vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m1 m1 m1 m2 in sy cs us sy id 0 0 0 17075280 447352 89 105 14 6 6 0 2 1 1 1 0 432 2566 671 1 1 98 0 0 0 16958384 157736 326 1242 23 31 31 0 0 40 43 43 0 2319 670 409 1 7 92 0 0 0 16945808 154256 93 1962 24 0 0 0 0 110 113 111 0 4168 1224 736 2 14 83 0 0 0 16945720 149888 1 139 8 0 0 0 0 247 256 261 0 5067 64 716 0 10 90 0 0 0 16948864 151448 3 601 55 879 879 0 0 186 186 174 0 4885 77 637 0 9 91 ^C The machine before starting apache was idling at 100. We know that something is wrong here if its dipping into the low 90's and 80's. ps -ef output: {callacct:/u01/app/oracle} ps -ef UID PID PPID C STIME TTY TIME CMD root 15038 14830 0 - ? 0:00 oracle 16343 16341 0 Sep 19 pts/3 0:00 -sh oracle 16341 16338 0 Sep 19 ? 0:00 /usr/lib/ssh/sshd root 15042 12675 0 Sep 19 ? 0:00 /usr/lib/ssh/sshd root 15041 14830 1 09:14:45 ? 0:00 /usr/apache/bin/httpd root 14830 1 0 09:11:06 ? 0:01 /usr/apache/bin/httpd root 15044 15511 0 09:14:48 pts/2 0:00 ps -ef root 15043 14830 1 09:14:45 ? 0:00 /usr/apache/bin/httpd root 15040 14830 0 - ? 0:00 You can see above the defunct processes and the load time has jumped from nothing to .22: {callacct:/u01/app/oracle} w 9:13am up 4 day(s), 21:49, 3 users, load average: 0.22, 0.12, 0.05 So it's clear that when apache starts with this module, the machine is flipping out and something is going wrong. The machine was also acting laggy and taking forever to respond to simple commands. The Solution: I started looking into getting the source for the mod_owa.so module and compiling it (the one I had before was a binary) and thinking that might be the problem, but I didn't have much luck with this and decided that might not even fix the problem anyway. So I decided to look at the system and see what was actually happening. Here I use execsnoop from the DTraceToolkit to see if anything obvious is showing up: execsnoop is a DTrace script which tells us file executions as they occur on the system. {callacct:/root/DTraceToolkit-0.83} ./execsnoop -a TIME STRTIME ZONE PROJ UID PID PPID ARGS [wait 10-15 seconds] ^C Nothing showed up at all. So now we try opensnoop: opensnoop is a DTrace script which tells us file opens as they occur on the system. {callacct:/root/DTraceToolkit-0.83} ./opensnoop -a TIME STRTIME UID PID FD ERR PATH ARGS 425004142906 2005 Sep 21 09:27:27 0 15268 3 0 /etc/default/nss /usr/apache/bin/httpd\0 425004143217 2005 Sep 21 09:27:27 0 15268 3 0 /etc/nsswitch.conf /usr/apache/bin/httpd\0 425004144087 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/nss_files.so.1 /usr/apache/bin/httpd\0 425004146092 2005 Sep 21 09:27:27 0 15268 3 0 /etc/group /usr/apache/bin/httpd\0 425004146403 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/nss_ldap.so.1 /usr/apache/bin/httpd\0 425004146681 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/libsldap.so.1 /usr/apache/bin/httpd\0 425004147142 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/libldap.so.5 /usr/apache/bin/httpd\0 425004147924 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/libsasl.so.1 /usr/apache/bin/httpd\0 425004148358 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libnspr4.so /usr/apache/bin/httpd\0 425004148667 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libplc4.so /usr/apache/bin/httpd\0 425004148876 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libnss3.so /usr/apache/bin/httpd\0 425004149175 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libssl3.so /usr/apache/bin/httpd\0 425004149507 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/libthread.so.1 /usr/apache/bin/httpd\0 425004149871 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libsoftokn3.so /usr/apache/bin/httpd\0 425004150274 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/libplds4.so /usr/apache/bin/httpd\0 425004151866 2005 Sep 21 09:27:27 0 15268 3 0 /usr/lib/mps/cpu/sparcv8plus/libnspr_flt4.so /usr/apache/bin/httpd\0 425004158239 2005 Sep 21 09:27:27 0 15268 3 0 /var/run/ldap_cache_door /usr/apache/bin/httpd\0 425004161287 2005 Sep 21 09:27:27 0 15268 -1 2 /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb /usr/apache/bin/httpd\0 425004214337 2005 Sep 21 09:27:27 0 15260 4 0 /usr/share/lib/zoneinfo/US/Central /usr/sbin/dtrace -n \n /*\n * Command line arguments\n */\n inline int OPT_dump \t\0 425005154629 2005 Sep 21 09:27:28 0 15270 3 0 /etc/default/nss /usr/apache/bin/httpd\0 425005154935 2005 Sep 21 09:27:28 0 15270 3 0 /etc/nsswitch.conf /usr/apache/bin/httpd\0 425005155827 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/nss_files.so.1 /usr/apache/bin/httpd\0 425005157754 2005 Sep 21 09:27:28 0 15270 3 0 /etc/group /usr/apache/bin/httpd\0 425005158066 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/nss_ldap.so.1 /usr/apache/bin/httpd\0 425005158365 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/libsldap.so.1 /usr/apache/bin/httpd\0 425005158899 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/libldap.so.5 /usr/apache/bin/httpd\0 425005159578 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/libsasl.so.1 /usr/apache/bin/httpd\0 425005160012 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libnspr4.so /usr/apache/bin/httpd\0 425005161877 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libplc4.so /usr/apache/bin/httpd\0 425005162117 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libnss3.so /usr/apache/bin/httpd\0 425005162434 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libssl3.so /usr/apache/bin/httpd\0 425005162795 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/libthread.so.1 /usr/apache/bin/httpd\0 425005163160 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libsoftokn3.so /usr/apache/bin/httpd\0 425005163621 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/libplds4.so /usr/apache/bin/httpd\0 425005165246 2005 Sep 21 09:27:28 0 15270 3 0 /usr/lib/mps/cpu/sparcv8plus/libnspr_flt4.so /usr/apache/bin/httpd\0 425005173810 2005 Sep 21 09:27:28 0 15270 3 0 /var/run/ldap_cache_door /usr/apache/bin/httpd\0 425005176723 2005 Sep 21 09:27:28 0 15270 -1 2 /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb /usr/apache/bin/httpd\0 425005152831 2005 Sep 21 09:27:28 0 15269 3 0 /etc/default/nss /usr/apache/bin/httpd\0 425005153135 2005 Sep 21 09:27:28 0 15269 3 0 /etc/nsswitch.conf /usr/apache/bin/httpd\0 425005154016 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/nss_files.so.1 /usr/apache/bin/httpd\0 425005155926 2005 Sep 21 09:27:28 0 15269 3 0 /etc/group /usr/apache/bin/httpd\0 425005156238 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/nss_ldap.so.1 /usr/apache/bin/httpd\0 425005156540 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/libsldap.so.1 /usr/apache/bin/httpd\0 425005157045 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/libldap.so.5 /usr/apache/bin/httpd\0 425005157713 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/libsasl.so.1 /usr/apache/bin/httpd\0 425005158164 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libnspr4.so /usr/apache/bin/httpd\0 425005158493 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libplc4.so /usr/apache/bin/httpd\0 425005158708 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libnss3.so /usr/apache/bin/httpd\0 425005159021 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libssl3.so /usr/apache/bin/httpd\0 425005159381 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/libthread.so.1 /usr/apache/bin/httpd\0 425005159749 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libsoftokn3.so /usr/apache/bin/httpd\0 425005160265 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/libplds4.so /usr/apache/bin/httpd\0 425005162040 2005 Sep 21 09:27:28 0 15269 3 0 /usr/lib/mps/cpu/sparcv8plus/libnspr_flt4.so /usr/apache/bin/httpd\0 425005168873 2005 Sep 21 09:27:28 0 15269 3 0 /var/run/ldap_cache_door /usr/apache/bin/httpd\0 425005171924 2005 Sep 21 09:27:28 0 15269 -1 2 /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb /usr/apache/bin/httpd\0 We can see the httpd process is opening all these files over and over again. What we're interested in here is the FD (file descriptor) has a '-1' value (which means error) for the '/u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb' file. I check to see if that file exists: {callacct:/u01/app/oracle} ls /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb: No such file or directory I find where that file is: {callacct:/u01/app/oracle} find /u01 -name *lx1boot.nlb* /u01/app/oracle/OraHome1/ocommon/nls/admin/data/lx1boot.nlb The problem is the program is looking for /u01/app/oracle/ocommon when it should be /u01/app/oracle/OraHome1/ocommon This should be simple, just create a symlink to link it to that directory: {callacct:/u01/app/oracle} ln -s /u01/app/oracle/OraHome1/ocommon ocommon {callacct:/u01/app/oracle} ls -al total 16 drwxrwxr-x 6 oracle oinstall 512 Sep 21 09:46 . drwx--x--x 3 oracle root 512 Sep 19 09:46 .. -rwxr-xr-x 1 oracle oinstall 12 Sep 19 10:05 .login drwxr-xr-x 28 oracle oinstall 512 Sep 19 10:57 OraHome1 drwxrwxr-x 4 oracle oinstall 512 Sep 19 10:55 jre lrwxrwxrwx 1 root root 32 Sep 21 09:46 ocommon -> /u01/app/oracle/OraHome1/ocommon drwxrwxr-x 12 oracle oinstall 512 Sep 19 10:58 oraInventory drwxrwxr-x 6 oracle oinstall 512 Sep 19 10:55 oui Now the link is made, run opensnoop again to see if its still complaining: {callacct:/root/DTraceToolkit-0.83} ./opensnoop -a TIME STRTIME UID PID FD ERR PATH ARGS 426269122833 2005 Sep 21 09:48:32 0 16167 3 0 /etc/default/nss /usr/apache/bin/httpd\0 426269123107 2005 Sep 21 09:48:32 0 16167 3 0 /etc/nsswitch.conf /usr/apache/bin/httpd\0 426269123948 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/nss_files.so.1 /usr/apache/bin/httpd\0 426269125739 2005 Sep 21 09:48:32 0 16167 3 0 /etc/group /usr/apache/bin/httpd\0 426269126027 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/nss_ldap.so.1 /usr/apache/bin/httpd\0 426269126375 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/libsldap.so.1 /usr/apache/bin/httpd\0 426269126819 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/libldap.so.5 /usr/apache/bin/httpd\0 426269127479 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/libsasl.so.1 /usr/apache/bin/httpd\0 426269127914 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libnspr4.so /usr/apache/bin/httpd\0 426269128235 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libplc4.so /usr/apache/bin/httpd\0 426269128437 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libnss3.so /usr/apache/bin/httpd\0 426269128746 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libssl3.so /usr/apache/bin/httpd\0 426269129088 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/libthread.so.1 /usr/apache/bin/httpd\0 426269129445 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libsoftokn3.so /usr/apache/bin/httpd\0 426269129865 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/libplds4.so /usr/apache/bin/httpd\0 426269142191 2005 Sep 21 09:48:32 0 16165 4 0 /usr/share/lib/zoneinfo/US/Central /usr/sbin/dtrace -n \n /*\n * Command line arguments\n */\n inline int OPT_dump \t\0 426269131883 2005 Sep 21 09:48:32 0 16167 3 0 /usr/lib/mps/cpu/sparcv8plus/libnspr_flt4.so /usr/apache/bin/httpd\0 426269138505 2005 Sep 21 09:48:32 0 16167 3 0 /var/run/ldap_cache_door /usr/apache/bin/httpd\0 426269141428 2005 Sep 21 09:48:32 0 16167 5 0 /u01/app/oracle/ocommon/nls/admin/data/lx1boot.nlb /usr/apache/bin/httpd\0 426269141760 2005 Sep 21 09:48:32 0 16167 5 0 /u01/app/oracle/ocommon/nls/admin/data/lx00001.nlb /usr/apache/bin/httpd\0 426269141930 2005 Sep 21 09:48:32 0 16167 5 0 /u01/app/oracle/ocommon/nls/admin/data/lx20369.nlb /usr/apache/bin/httpd\0 426269142360 2005 Sep 21 09:48:32 0 16167 5 0 /u01/app/oracle/ocommon/nls/admin/data/lx10001.nlb /usr/apache/bin/httpd\0 426269142722 2005 Sep 21 09:48:32 0 16167 5 0 /u01/app/oracle/ocommon/nls/admin/data/lx20001.nlb /usr/apache/bin/httpd\0 426269142811 2005 Sep 21 09:48:32 0 16167 -1 2 /u01/app/oracle/oracore/mesg/lrmus.msb /usr/apache/bin/httpd\0 426269142849 2005 Sep 21 09:48:32 0 16167 -1 2 /u01/app/oracle/oracore/mesg/lrmus.msb /usr/apache/bin/httpd\0 426269143055 2005 Sep 21 09:48:32 0 16167 -1 2 /u01/app/oracle/ldap/mesg/ldapus.msb /usr/apache/bin/httpd\0 426269143088 2005 Sep 21 09:48:32 0 16167 -1 2 /u01/app/oracle/ldap/mesg/ldapus.msb /usr/apache/bin/httpd\0 ^C It's not complaining about our 'lx1boot.nlb' file anymore but now it is about files in /u01/app/oracle/ldap and /u01/app/oracle/oracore From looking at the previous directories in /u01/app/oracle we know that oracore and ldap don't exist and probably need to symlink those too: {callacct:/u01/app/oracle} ls OraHome1 jre ocommon oraInventory oui {callacct:/u01/app/oracle} ln -s /u01/app/oracle/OraHome1/oracore oracore {callacct:/u01/app/oracle} ln -s /u01/app/oracle/OraHome1/ldap ldap {callacct:/u01/app/oracle} ls -al total 20 drwxrwxr-x 6 oracle oinstall 512 Sep 21 09:54 . drwx--x--x 3 oracle root 512 Sep 19 09:46 .. -rwxr-xr-x 1 oracle oinstall 12 Sep 19 10:05 .login drwxr-xr-x 28 oracle oinstall 512 Sep 19 10:57 OraHome1 drwxrwxr-x 4 oracle oinstall 512 Sep 19 10:55 jre lrwxrwxrwx 1 root root 29 Sep 21 09:54 ldap -> /u01/app/oracle/OraHome1/ldap lrwxrwxrwx 1 root root 32 Sep 21 09:46 ocommon -> /u01/app/oracle/OraHome1/ocommon drwxrwxr-x 12 oracle oinstall 512 Sep 19 10:58 oraInventory lrwxrwxrwx 1 root root 32 Sep 21 09:54 oracore -> /u01/app/oracle/OraHome1/oracore drwxrwxr-x 6 oracle oinstall 512 Sep 19 10:55 oui Running opensnoop again there's no more errors showing up and vmstat shows cpu idle back down to 100: {callacct:/u01/app/oracle} vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m1 m1 m1 m2 in sy cs us sy id 0 0 0 17074528 445528 89 107 14 8 8 0 2 2 2 2 0 446 2551 669 1 1 98 0 0 0 16934600 154056 2 26 0 0 0 0 0 0 0 0 0 227 165 128 0 0 99 0 0 0 16934600 154056 0 0 0 0 0 0 0 0 0 0 0 229 85 117 0 0 100 0 0 0 16934600 154056 0 0 0 0 0 0 0 0 0 0 0 225 58 118 0 0 100 0 0 0 16934600 154056 0 0 0 0 0 0 0 0 0 0 0 231 63 107 0 0 100 Obviously, load averages go back down as well: {callacct:/u01/app/oracle} w 9:57am up 4 day(s), 22:33, 3 users, load average: 0.02, 0.10, 0.12 And no more defunct processes.