Troubleshooting TFA Start

Oracle Clusterware installation 21.3 sets up Oracle Trace File Analyzer (TFA). TFA is controlled through a dedicated service:

systemctl list-units *tfa*
UNIT               LOAD   ACTIVE SUB     DESCRIPTION
oracle-tfa.service loaded active running Oracle Trace File Analyzer

One of the TFA benefits is OSWatcher, an extremely useful Oracle utility for collecting and archiving performance data:

ps -e -o cmd | grep OSW
/bin/sh ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
/bin/sh ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive

TFA wasn’t initialized properly on one of our servers – OSWatcher wasn’t running:

ps -e -o cmd | grep OSW | grep -v grep

, although the TFA service was started:

systemctl list-units oracle-tfa.service
UNIT               LOAD   ACTIVE SUB     DESCRIPTION
oracle-tfa.service loaded active running Oracle Trace File Analyzer

I traced the creation of new processes during service startup to get an idea where the workflow might had been stuck:

sudo  /usr/share/bcc/tools/execsnoop

The init process started:

PCOMM            PID    PPID   RET ARGS
init.tfa         1462210 1        0 /etc/init.d/init.tfa run >/dev/null 2>&1 

But all it did was sleeping immediately after reading the file /opt/oracle.ahf/tfa/install/TFAMainrun:

PCOMM            PID    PPID   RET ARGS
init.tfa         1462210 1        0 /etc/init.d/init.tfa run >/dev/null 2>&1 1462210   0 /bin/cat /opt/oracle.ahf/tfa/install/TFAMainrun
sleep            1462233 1462210   0 /bin/sleep 10
sleep            1462343 1462210   0 /bin/sleep 30

The file TFAMainrun isn’t documented. Since the process went sleeping after reading this file, I assumed that the file contained some control information that made the process go to sleep. The file contained the string “stop“:

cat /opt/oracle.ahf/tfa/install/TFAMainrun
stop

The timestamp of the last change was the time just before the server crashed.
In contrast, on another server, where everything was running properly, the file contained the text “start“:

cat /opt/oracle.ahf/tfa/install/TFAMainrun
start

After inserting “start” into TFAMainrun on the faulty server and restarting the TFA service, the OSWatcher started properly:

systemctl stop oracle-tfa.service
echo start > /opt/oracle.ahf/tfa/install/TFAMainrun
systemctl start oracle-tfa.service
ps -e -o pid,ppid,cmd | grep OSW
1488483       1 /bin/sh ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
1488967 1488483 /bin/sh ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive

execsnoop captured TFA starting OSWatcher:

PCOMM            PID    PPID   RET ARGS
tfactl.tfa       1487812 1487810   0 /opt/oracle.ahf/tfa/bin/tfactl.tfa -initstart

perl             1487831 1487812   0 /bin/perl /opt/oracle.ahf/tfa/bin/tfactl.pl -initstart

sh               1488463 1487831   0 /bin/sh -c cd /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/oswbb;su grid -c './startOSWbb.sh 30 48 NONE
su               1488464 1488463   0 /usr/bin/su grid -c ./startOSWbb.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive

OSWatcher.sh     1488483 1        0 ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
OSWatcherFM.sh   1488967 1488483   0 ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive

In summary, the content of the file /opt/oracle.ahf/tfa/install/TFAMainrun has an impact on TFA initialization. This file is being rewritten on every TFA start and stop. If an operation performed by tfactl.ctl was interrupted due to a server crash, for example, the file might contain a wrong information that could perturb the next service start. You can recover from the error by writing the string “start” into the TFAMainrun file. Disclaimer: this is an undocumented procedure.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.