Recently, I got involved in troubleshooting a communication problem, which was reported as “large messages are consistently failing when sent to an externally managed service”.
Before doing anything else, I wanted the evidence that we were dealing with a problem on the TCP/IP stack. Therefore, I started the network trace on the Windows box where the messaging client was running:
netsh trace start persistent=yes capture=yes tracefile=c:\temp\nettrace-boot.etl
Second, I opened the trace file in Microsoft Network Monitor and applied the filter tcp.port == dst_port to remove the noise.
It became immediately obvious that the following packet gets retransmitted because the ACK from the server has never been received:
TCP: Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
TCP:[ReTransmit #3417] Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
TCP:[ReTransmit #3417] Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
TCP:[ReTransmit #3417] Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
TCP:[ReTransmit #3417] Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
TCP:[ReTransmit #3417] Flags=...AP..., SrcPort=src_port, DstPort=dst_port, PayloadLen=1456, Seq=1079708713 - 1079710169, Ack=2984094674, Win=7244
...
TCP: Flags=...A.R.., SrcPort=src_port, DstPort=dst_port, PayloadLen=0, Seq=1079710169, Ack=2984094749, Win=0
At the end, the client sent a reset after 5 unacknowledged retransmits. The smaller packets where transfered correctly, just the big ones were failing.
The packet in the example above had 1456 bytes, which was very close, though still below the usual maximum transfer unit (MTU) size of 1500 bytes.
In the trace it could also be seen that the “Do not fragment flag” was set, which will prevent the network components of automatically fragmenting the packets once they exceed the MTU:
DF: (.1..............) Do not fragment
The network topology was rather complex and included a VPN. The thing with VPN is that it adds its own header to each packet. So the packets that get out close to 1500 in size can easily exceed the MTU and therefore get rejected by a network component on the other end.
At this point, it became apparent that the MTU size on the Windows machine needs to be adjusted to compensate for the VPN overhead. Setting the MTU size to 1380 indeed resolved the issue.