Today I was running into a problem with NFS stalling out on a copy from a remote machine to IRIX. NFS worked generally, but while copying the file it would always stall out at exactly the same place. The place it would stall was different for different files, but always completely repeatable for the same file and much greater than the MTU (e.g. one file would always stall out at 44 mbyte).
The client / receiving side was IRIX is 6.5.19 and the machine is a Challenge S on a 10mbit half-duplex ethernet AUI transceiver, and the server / sending side was FreeBSD 13.0 on 1gbit full duplex Ethernet.
What this looked like on the network (from the sender's perspective) was normal NFS traffic with no retries up until the point of the stall, e.g.:
Then, the next read would be issued by the receiver, with a similar response from the sender:
However, the receiver seemingly ignores the sent data and keeps re-issuing the call for the set maximum number of retries, before timing out and giving up:
The number of retries can be increased to any amount but it will never succeed, and this is with no other network traffic, always repeatable to a certain position in the file being read. Other NFS traffic (e.g. a directory listing) would succeed while this read operation was in timeout / retransmission, indicating that it was not a problem with general NFS connectivity.
It will get one or two reads past that sometimes but it seems more like there are just a few in-flight than any actual progression.
I am not entirely sure what causes this, whether IRIX bug or an artifact of the fast sender (GigE) to slow receiver (10M) and potentially overflowing switch buffers? However reducing
I did not reduce
Notably I did not have these issues on other IRIX machines connected by faster links, but I don't know whether I was using the same IRIX release (may have been using 6.5.26).
The client / receiving side was IRIX is 6.5.19 and the machine is a Challenge S on a 10mbit half-duplex ethernet AUI transceiver, and the server / sending side was FreeBSD 13.0 on 1gbit full duplex Ethernet.
What this looked like on the network (from the sender's perspective) was normal NFS traffic with no retries up until the point of the stall, e.g.:
Code:
36996 43.296638 172.18.0.5 172.18.0.3 NFS 178 V3 READ Call (Reply In 37008), FH: 0xc071e444 Offset: 46612480 Len: 16384
36997 43.296701 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=0, ID=133e) [Reassembled in #37008]
36998 43.296704 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=1480, ID=133e) [Reassembled in #37008]
36999 43.296706 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=2960, ID=133e) [Reassembled in #37008]
37000 43.296708 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=4440, ID=133e) [Reassembled in #37008]
37001 43.296711 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=5920, ID=133e) [Reassembled in #37008]
37002 43.296713 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=7400, ID=133e) [Reassembled in #37008]
37003 43.296716 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=8880, ID=133e) [Reassembled in #37008]
37004 43.296718 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=10360, ID=133e) [Reassembled in #37008]
37005 43.296720 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=11840, ID=133e) [Reassembled in #37008]
37006 43.296723 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=13320, ID=133e) [Reassembled in #37008]
37007 43.296726 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=14800, ID=133e) [Reassembled in #37008]
37008 43.296728 172.18.0.3 172.18.0.5 NFS 274 V3 READ Reply (Call In 36996) Len: 16384
Code:
37009 43.313577 172.18.0.5 172.18.0.3 NFS 178 V3 READ Call (Reply In 37021), FH: 0xc071e444 Offset: 46628864 Len: 16384
37010 43.313640 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=0, ID=133f) [Reassembled in #37021]
37011 43.313643 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=1480, ID=133f) [Reassembled in #37021]
37012 43.313646 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=2960, ID=133f) [Reassembled in #37021]
37013 43.313648 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=4440, ID=133f) [Reassembled in #37021]
37014 43.313650 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=5920, ID=133f) [Reassembled in #37021]
37015 43.313653 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=7400, ID=133f) [Reassembled in #37021]
37016 43.313656 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=8880, ID=133f) [Reassembled in #37021]
37017 43.313658 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=10360, ID=133f) [Reassembled in #37021]
37018 43.313660 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=11840, ID=133f) [Reassembled in #37021]
37019 43.313663 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=13320, ID=133f) [Reassembled in #37021]
37020 43.313666 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=14800, ID=133f) [Reassembled in #37021]
37021 43.313668 172.18.0.3 172.18.0.5 NFS 274 V3 READ Reply (Call In 37009) Len: 16384
Code:
37035 44.411279 172.18.0.5 172.18.0.3 NFS 178 [RPC retransmission of #37009]V3 READ Call (Reply In 37021), FH: 0xc071e444 Offset: 46628864 Len: 16384
37036 44.411365 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=0, ID=1341) [Reassembled in #37047]
37037 44.411368 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=1480, ID=1341) [Reassembled in #37047]
37038 44.411371 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=2960, ID=1341) [Reassembled in #37047]
37039 44.411373 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=4440, ID=1341) [Reassembled in #37047]
37040 44.411375 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=5920, ID=1341) [Reassembled in #37047]
37041 44.411378 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=7400, ID=1341) [Reassembled in #37047]
37042 44.411381 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=8880, ID=1341) [Reassembled in #37047]
37043 44.411383 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=10360, ID=1341) [Reassembled in #37047]
37044 44.411386 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=11840, ID=1341) [Reassembled in #37047]
37045 44.411389 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=13320, ID=1341) [Reassembled in #37047]
37046 44.411392 172.18.0.3 172.18.0.5 IPv4 1514 Fragmented IP protocol (proto=UDP 17, off=14800, ID=1341) [Reassembled in #37047]
37047 44.411394 172.18.0.3 172.18.0.5 NFS 274 [RPC duplicate of #37021]V3 READ Reply (Call In 37009) Len: 16384
It will get one or two reads past that sometimes but it seems more like there are just a few in-flight than any actual progression.
I am not entirely sure what causes this, whether IRIX bug or an artifact of the fast sender (GigE) to slow receiver (10M) and potentially overflowing switch buffers? However reducing
rsize
in the IRIX (client) mount options to reduce the number of outstanding fragments made things work. rsize=4096
was the closest doubling number up from rsize=1024
(first below MTU) that worked for me. I am curious if anyone has encountered similar issues?I did not reduce
wsize
on the IRIX client because presumably the fast link / modern FreeBSD NFS server (which would be the recipient of any writes) has no such issues.Notably I did not have these issues on other IRIX machines connected by faster links, but I don't know whether I was using the same IRIX release (may have been using 6.5.26).
Last edited: