Oracle Clusterware installation 21.3 sets up Oracle Trace File Analyzer (TFA). TFA is controlled through a dedicated service:
systemctl list-units *tfa*
UNIT LOAD ACTIVE SUB DESCRIPTION
oracle-tfa.service loaded active running Oracle Trace File Analyzer
One of the TFA benefits is OSWatcher, an extremely useful Oracle utility for collecting and archiving performance data:
ps -e -o cmd | grep OSW
/bin/sh ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
/bin/sh ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
TFA wasn’t initialized properly on one of our servers – OSWatcher wasn’t running:
ps -e -o cmd | grep OSW | grep -v grep
, although the TFA service was started:
systemctl list-units oracle-tfa.service
UNIT LOAD ACTIVE SUB DESCRIPTION
oracle-tfa.service loaded active running Oracle Trace File Analyzer
I traced the creation of new processes during service startup to get an idea where the workflow might had been stuck:
sudo /usr/share/bcc/tools/execsnoop
The init process started:
PCOMM PID PPID RET ARGS
init.tfa 1462210 1 0 /etc/init.d/init.tfa run >/dev/null 2>&1
But all it did was sleeping immediately after reading the file /opt/oracle.ahf/tfa/install/TFAMainrun:
PCOMM PID PPID RET ARGS
init.tfa 1462210 1 0 /etc/init.d/init.tfa run >/dev/null 2>&1 1462210 0 /bin/cat /opt/oracle.ahf/tfa/install/TFAMainrun
sleep 1462233 1462210 0 /bin/sleep 10
sleep 1462343 1462210 0 /bin/sleep 30
The file TFAMainrun isn’t documented. Since the process went sleeping after reading this file, I assumed that the file contained some control information that made the process go to sleep. The file contained the string “stop“:
cat /opt/oracle.ahf/tfa/install/TFAMainrun
stop
The timestamp of the last change was the time just before the server crashed.
In contrast, on another server, where everything was running properly, the file contained the text “start“:
cat /opt/oracle.ahf/tfa/install/TFAMainrun
start
After inserting “start” into TFAMainrun on the faulty server and restarting the TFA service, the OSWatcher started properly:
systemctl stop oracle-tfa.service
echo start > /opt/oracle.ahf/tfa/install/TFAMainrun
systemctl start oracle-tfa.service
ps -e -o pid,ppid,cmd | grep OSW
1488483 1 /bin/sh ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
1488967 1488483 /bin/sh ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
execsnoop captured TFA starting OSWatcher:
PCOMM PID PPID RET ARGS
tfactl.tfa 1487812 1487810 0 /opt/oracle.ahf/tfa/bin/tfactl.tfa -initstart
perl 1487831 1487812 0 /bin/perl /opt/oracle.ahf/tfa/bin/tfactl.pl -initstart
sh 1488463 1487831 0 /bin/sh -c cd /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/oswbb;su grid -c './startOSWbb.sh 30 48 NONE
su 1488464 1488463 0 /usr/bin/su grid -c ./startOSWbb.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
OSWatcher.sh 1488483 1 0 ./OSWatcher.sh 30 48 NONE /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
OSWatcherFM.sh 1488967 1488483 0 ./OSWatcherFM.sh 48 /u00/oracle/GI/gridbase/oracle.ahf/data/repository/suptools/host/oswbb/grid/archive
In summary, the content of the file /opt/oracle.ahf/tfa/install/TFAMainrun has an impact on TFA initialization. This file is being rewritten on every TFA start and stop. If an operation performed by tfactl.ctl was interrupted due to a server crash, for example, the file might contain a wrong information that could perturb the next service start. You can recover from the error by writing the string “start” into the TFAMainrun file. Disclaimer: this is an undocumented procedure.