IBM Spectrum Scale troubleshooting
IBM requests to get analysis data using the follwing procedure in case of initial data collection.
The steps below will gather all the docs you could provide in terms
of first time data capture given an unknown problem. Do these steps
for all your performance/hang/unknown GPFS issues WHEN the
problem is occurring. Commands are executed from one node. Collection
of the docs will vary based on the working collective created below.
1) Gather waiters and create working collective. It can be good to get multiple looks at what the waiters are and how they have changed, so doing the first mmlsnode command (with the -L) numerous times as you proceed through the steps below might be helpful (specially if issue is pure performance, no hangs).
mmlsnode -N waiters > /tmp/waiters.wcoll mmdsh -N /tmp/waiters.wcoll "mkdir /tmp/mmfs 2>/dev/null" mmlsnode -N waiters -L | sort -nk 4,4 > /tmp/mmfs/service.allwaiters.$(date +"%m%d%H%M%S")
View allwaiters and waiters.wcoll files to verify that these files are <hi #fff200>not empty</hi>.
If either (or both) file(s) are empty, this indicates that the issues seen are not GPFS waiting on any of it's threads. Data to be gathered in this case will vary. Do not continue with steps. Tell Service person and they will determine the best course of action and what docs will be needed.
2) Gather internaldump from all nodes in the working collective
For performance/non-hangs:
mmdsh -N /tmp/waiters.wcoll "/usr/lpp/mmfs/bin/mmfsadm saferdump all > /tmp/mmfs/service.\$(hostname -s).safer.dumpall.\$(date +"%m%d%H%M%S")"
3) If this is a performance problem, get 60 seconds mmfs trace from the nodes in the working collective.
mmtracectl --start --aix-trace-buffer-size=64M --trace-file-size=128M -N /tmp/waiters.wcoll ; sleep 60; mmtracectl --stop -N /tmp/waiters.wcoll
4) Gather gpfs.snap from same nodes.