[Issues] Groningen Millipede cluster end of maintenance

Ewout Helmich helmich at astro.rug.nl
Tue Feb 18 09:36:55 CET 2014


The Groningen millipede cluster is up and running again.

Regards,
Ewout Helmich

-------------------------------------------------------------------------------------------

Dear millipede-user,

Today we did some scheduled maintenance on our millipede-cluster:

set access /data_old to read-only
changed storage-parameters /data (num of threads 32 -> 128 per 
storage-node)
powered down all compute-nodes
IB switch check
restart IB switch (error-counters set to 0)
IB switch check
reboot head-node
powered on all compute-nodes
check computes-nodes (2 nodes bad IB connection)
openup scheduler (enable queues)
check IB-logs (error-counters)

2 nodes (node210 and node182) were identified as being broken (lots of 
errors on IB interface). If a single node has a bad connection the IB 
switch
it can cause lots of problems inside the switch, which also affects 
other nodes/storage.

All other nodes booted up normally and after some  checks we opened up 
the queues.
System is now fully active/busy. Hopefully stability of the 
nodes/IB-interconnect will now be better.

Please let us know if you see problems/misbehaving jobs/nodes.

-- 
vr.gr. Ger Strikwerda

Opérateur d'ordinateur
Rijksuniversiteit Groningen
Donald Smits Centrum voor Informatie Technologie
Unit Serverinfrastructuur

Zernikeborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276

"God is hard, God is fair, some men he gave brains,
  others he gave hair"


More information about the Issues mailing list