[Issues] Groningen Millipede cluster end of maintenance
Ewout Helmich
helmich at astro.rug.nl
Tue Feb 18 09:36:55 CET 2014
The Groningen millipede cluster is up and running again.
Regards,
Ewout Helmich
-------------------------------------------------------------------------------------------
Dear millipede-user,
Today we did some scheduled maintenance on our millipede-cluster:
set access /data_old to read-only
changed storage-parameters /data (num of threads 32 -> 128 per
storage-node)
powered down all compute-nodes
IB switch check
restart IB switch (error-counters set to 0)
IB switch check
reboot head-node
powered on all compute-nodes
check computes-nodes (2 nodes bad IB connection)
openup scheduler (enable queues)
check IB-logs (error-counters)
2 nodes (node210 and node182) were identified as being broken (lots of
errors on IB interface). If a single node has a bad connection the IB
switch
it can cause lots of problems inside the switch, which also affects
other nodes/storage.
All other nodes booted up normally and after some checks we opened up
the queues.
System is now fully active/busy. Hopefully stability of the
nodes/IB-interconnect will now be better.
Please let us know if you see problems/misbehaving jobs/nodes.
--
vr.gr. Ger Strikwerda
Opérateur d'ordinateur
Rijksuniversiteit Groningen
Donald Smits Centrum voor Informatie Technologie
Unit Serverinfrastructuur
Zernikeborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair, some men he gave brains,
others he gave hair"
More information about the Issues
mailing list