Troubleshooting a Networking Issue from the Database Perspective

SQL*Net message from client

Firewall blocking an incoming connection is one of the most commonly networking issues reported to database teams. This problem is easy to diagnose because the behavior is consistent and easily reproducible – session creation hangs every time. Database administrators usually confirm this problem by trying to open a telnet session to the listener port.

Recently, we were troubleshooting another type of networking problem, which was occurring randomly. The application session connected to the database, successfully executed several queries, but a query execution hung at some point. After realizing that it required some time of inactivity for the issue to happen, we managed to reproduce it with sqlplus:

sqlplus user/password@db

SQL*Plus: Release 19.0.0.0.0 - Production on Fri Sep 17 15:38:00 2021
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Fri Sep 17 2021 15:34:44 +02:00

Connected to:
Oracle Database 21c Enterprise Edition Release 21.0.0.0.0 - Production
Version 21.3.0.0.0

SQL> -- WAIT SOME TIME BEFORE EXECUTING THE QUERY
SQL> select 1 from dual ;

The query was hanging, but the session was INACTIVE waiting on SQL*Net message from the client. That was the first indication of broken communication.

Stacks

The stacks from the sqlplus and the dedicated server process, respectively, confirmed the communication problem – both processes were expecting something on the TCP socket:

#0  0x00007f814bfb875d in read () from /lib64/libpthread.so.0
#1  0x00007f815051e070 in snttread () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#2  0x00007f814da0b243 in nttrd () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#3  0x00007f814d8deb17 in nsprecv () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#4  0x00007f814d8e7492 in nsrdr () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#5  0x00007f814ec642c9 in nsfull_pkt_rcv () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#6  0x00007f814ec6b58e in nsfull_brc () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#7  0x00007f8150503116 in nioqrc () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#8  0x00007f815053e376 in ttcdrv () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#9  0x00007f815050d2ad in nioqwa () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#10 0x00007f81504e1467 in upirtrc () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#11 0x00007f81504ec824 in kpurcsc () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#12 0x00007f81504e4d69 in kpuexec () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#13 0x00007f81504e0b79 in OCIStmtExecute () from /u00/oracle/orabase/Ora193000/lib/libclntsh.so.19.1
#14 0x00007f8151442182 in afioci () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#15 0x00007f815143a6a2 in afinbu () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#16 0x00007f815146efca in afixeqqry () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#17 0x00007f815146d3d2 in afixeqsql () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#18 0x00007f815146ce01 in afixeqr () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#19 0x00007f8151401998 in aficfd () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#20 0x00007f8151400b8c in aficdr () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#21 0x00007f815142e4c3 in afidrv () from /u00/oracle/orabase/Ora193000/lib/libsqlplus.so
#22 0x0000000000400d50 in main ()
#0  0x00007f19cb24d8c2 in read () from /lib64/libpthread.so.0
#1  0x0000000015414d85 in read ()
#2  0x000000001540eae0 in snttread ()
#3  0x0000000006f7ffe2 in nttrd ()
#4  0x0000000006e6c8d7 in nsprecv ()
#5  0x0000000006e738f0 in nsrdr ()
#6  0x00000000144173ec in nsfull_pkt_rcv ()
#7  0x000000001441ecce in nsfull_brc ()
#8  0x00000000153f703f in nioqrc ()
#9  0x0000000014eec366 in opikndf2 ()
#10 0x0000000003001988 in opitsk ()
#11 0x00000000030070c0 in opiino ()
#12 0x0000000014eef578 in opiodr ()
#13 0x0000000002ffda4b in opidrv ()
#14 0x0000000003e8b4b5 in sou2o ()
#15 0x0000000000e91280 in opimai_real ()
#16 0x0000000003e988fc in ssthrdmain ()
#17 0x0000000000e910c4 in main ()

But we still didn’t know who stopped communicating.

SQL*Net Trace

Therefore, we configured SQL*Net trace on both client and server. The client entries below didn’t match any entries in the server trace:

(4182797376) [17-SEP-2021 15:49:38:923] nspsend: 88 48 8E 24 CE E3 30 F0  |.H.$..0.|
(4182797376) [17-SEP-2021 15:49:38:923] nspsend: B5 D2 D6 DB 4A 63 D6 A5  |....Jc..|
(4182797376) [17-SEP-2021 15:49:38:923] nspsend: 54 91 76 CA D3 E7 87 07  |T.v.....|
(4182797376) [17-SEP-2021 15:49:38:923] nspsend: 1B 6C 06 01              |.l..    |
(4182797376) [17-SEP-2021 15:49:38:923] nspsend: 348 bytes to transport
(4182797376) [17-SEP-2021 15:49:38:923] nspsend: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsdofls: exit (0)
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: acquired the bit
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsdo: nsctxrnk=0
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsdo: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: entry
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: cid=0, opcode=85, *bl=0, *what=0, uflgs=0x0, cflgs=0x0
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: acquired the bit
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: rank=64, nsctxrnk=0
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: nsctx: state=8, flg=0x400d, mvd=0
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: gtn=943, gtc=943, ptn=10, ptc=8074
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: acquired the bit
(4182797376) [17-SEP-2021 15:49:38:923] snsbitts_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: entry
(4182797376) [17-SEP-2021 15:49:38:923] snsbitcl_ts: normal exit
(4182797376) [17-SEP-2021 15:49:38:923] nsfull_pkt_rcv: switching to application buffer
(4182797376) [17-SEP-2021 15:49:38:923] nsrdr: entry
(4182797376) [17-SEP-2021 15:49:38:923] nsrdr: recving a packet
(4182797376) [17-SEP-2021 15:49:38:923] nsprecv: entry
(4182797376) [17-SEP-2021 15:49:38:923] nsprecv: reading from transport...
(4182797376) [17-SEP-2021 15:49:38:923] nttrd: entry

As we can see, the client sent data and then started reading from the TCP socket. At the same time, the server didn’t receive anything. The SQL*Net trace confirmed the interruption in the network layer.

TCP Dump

So far, we’ve used the Oracle utilities to reproduce the problem. Since we suspected the problem with TCP, we tried to reproduce the issue with netcat, and observe the communication with tcpdump. (netcat does simple client/server communication.) It’s always good to keep the test case as lean as possible.

First, start the server in the verbose mode on port 50000:

nc -vl 50000
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Listening on :::50000
Ncat: Listening on 0.0.0.0:50000

Then, send data from the client to the server:

nc servername 50000
Hallo1
Hallo2    make a break here
Hallo3

The netcat server shows that it received the first two messages, but didn’t get the third:

Ncat: Connection from 10.144.221.18.
Ncat: Connection from 10.144.221.18:36578.
Hallo1
Hallo2

You can run tcpdump in concurrent sessions on the client and the server, respectively, to observe the TCP packet flow on both machines:

sudo tcpdump -nnnvv -i any host servername
sudo tcpdump -nnvvv -i any host clientname

ARP table

The packets for the first message came through:

...
15:21:08.625809 IP (tos 0x0, ttl 64, id 48380, offset 0, flags [DF], proto TCP (6), length 47)
    x.x.221.18.46406 > x.x.225.51.50000: Flags [P.], cksum 0xd387 (incorrect -> 0x89c7), seq 1:8, ack 1, win 229, length 7
15:21:08.626177 IP (tos 0x0, ttl 64, id 63851, offset 0, flags [DF], proto TCP (6), length 40)
    x.x.225.51.50000 > x.x.221.18.46406: Flags [.], cksum 0xd6bd (correct), seq 1, ack 8, win 502, length 0

(x.x.221.18 is the client; x.x.225.51 is the database server.)

The packets from the database server stopped arriving to the client after the following ARP request:

15:21:13.930354 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has x.x.221.18 tell x.x.221.119, length 42
15:21:13.930369 ARP, Ethernet (len 6), IPv4 (len 4), Reply x.x.221.18 is-at z:z:z:z:z:z, length 28

A new address appeared in the tcpdump output – x.x.221.119 – even though we’ve never referenced it explicitly. It is the IP address of the second network interface card (NIC) eth1 on the database server.

eth1: flags=4163  mtu 1500
        inet x.x.221.119  netmask 255.255.255.0  broadcast x.x.221.255
        ether y:y:y:y:y:07  txqueuelen 1000  (Ethernet)

The primary NIC – the one which we talked to when opening the connection – is:

eth0: flags=4163  mtu 1500
        inet x.x.225.51  netmask 255.255.255.0  broadcast x.x.225.255
        ether y:y:y:y:y:06  txqueuelen 1000  (Ethernet)

The entry of the IP address on eth0 was updated with the MAC address of the eth1 NIC:

arp -a
...
server-alias-eth0 (x.x.225.51) at y:y:y:y:y:07 [ether] on eth0
server-alias-eth1 (x.x.221.119) at y:y:y:y:y:07 [ether] on eth0
...

We can see the retries after the ARP table update:

15:21:21.393795 IP (tos 0x0, ttl 64, id 48381, offset 0, flags [DF], proto TCP (6), length 47)
    x.x.221.18.46406 > x.x.225.51.50000: Flags [P.], cksum 0xd387 (incorrect -> 0x89bf), seq 8:15, ack 1, win 229, length 7
15:21:21.594648 IP (tos 0x0, ttl 64, id 48382, offset 0, flags [DF], proto TCP (6), length 47)
    x.x.221.18.46406 > x.x.225.51.50000: Flags [P.], cksum 0xd387 (incorrect -> 0x89bf), seq 8:15, ack 1, win 229, length 7
15:21:21.795684 IP (tos 0x0, ttl 64, id 48383, offset 0, flags [DF], proto TCP (6), length 47)
    x.x.221.18.46406 > x.x.225.51.50000: Flags [P.], cksum 0xd387 (incorrect -> 0x89bf), seq 8:15, ack 1, win 229, length 7
...

But why were only some clients having the problem? And why could the same clients communicate with other database servers?

Since all three IP addresses involved in this communication were belonging to the same subnet, a special routing, presumably, kicked in and produced the bad ARP table entry. Instead of drilling into the genesis of the wrong ARP table entry, we assigned an IP address from another subnet to the NIC eth1, which fixed the issue. A separate subnet for the second interface was the requirement anyway, which we overlooked when configuring the server.

Summary

The following symptoms might indicate that TCP packets are not reaching their destination:

  • The application is executing a query, but the session is waiting on SQL*Net message from client.
  • Both the application and the database server process are reading from the TCP socket.
  • The SQL*Net trace shows that data was sent, but the corresponding SQL*Net trace entries are missing on the server side.

You can reproduce the issue with netcat and capture the traffic with tcpdump to see if TCP packets are not arriving and investigate further.

In our case, a wrong ARP table entry was causing the problem.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.