============================ Ceph Cluster Troubleshooting ============================ This is a stub for troubleshooting and other various tasks in/on a ceph cluster. PG States ========= http://docs.ceph.com/docs/master/rados/operations/pg-states/ http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ Check cluster health ==================== .. code-block:: bash # Check cluster health ceph health detail Fix inconsistent PGs ==================== .. code-block:: bash # Check which PG is inconsistent ceph health detail # Repair PG ceph pg repair ${PG} Powering on a (failed) Cluster node =================================== First, check that everything is ok: .. code-block:: none * Hardware ok * Container Host ok: * has network * has fsck'ed * all containers are started * storage*.ceph.bfh.ch: has ceph osds mounted * Container ok: * is started * has network Then, start the ceph daemons manually: .. code-block:: bash # master.ceph.bfh.ch check if ceph-log and ceph-info are working # mon${NUMBER}.ceph.bfh.ch service ceph-mon@mon${NUMBER} start service ceph-mgr@mon${NUMBER} start # mds${NUMBER}.ceph.bfh.ch service ceph-mds@mds${NUMBER} start # rgw${NUMBER}.ceph.bfh.ch service ceph-radosgw@rgw.rgw${NUMBER} start # smb.ceph.bfh.ch service smbd start # storage${NUMBER}.ceph.bfh.ch # - start all osds systemctl start ceph.target # - check inividual osds systemctl status ceph-osd@${NUMBER} Stop OSD Rebalance ================== Before powercycling a storage node, always set noout. After successful restart and peering of the daemons, unset it. .. code-block:: bash # stop any automatic rebalancing ceph osd set norecover ceph osd set norebalance ceph osd set nobackfill ceph osd set nodown ceph osd set pause # enable normal rebalancing again ceph osd unset pause ceph osd unset nodown ceph osd unset nobackfill ceph osd unset norebalance ceph osd unset norecover Stop Scrubbing ============== .. code-block:: bash # stop any automatic scrubbing ceph set noscrub ceph set nodeep­scrub MDS Fails ========= There is a bug in Ceph 11.2 (kraken) which makes the MDSs fail completly under some circumstances, see: * Short: https://lists.bfh.ch/pipermail/bfh-linux-announce/2017-May/000040.html * Long: https://lists.bfh.ch/pipermail/bfh-linux-announce/2017-May/000041.html If all MDSs fail or are unresponsive due to above problem, the effect is that cephfs is down and restarting an MDS does not work. This is how to recover: .. code-block:: none 1. stop all ceph mds processes (not the containers, just the ceph mds services) 2. reboot the host systems of heavy cephfs using containers in order to empty the cephfs request queues: - moodle.bfh.ch resp. compute{3,4}.linux.bfh.ch - *.lfe.bfh.ch resp. compute{1,2}.linux.bfh.ch 3. stop the heavy cephfs using services in order to empty the cephfs request queues: - nfs daemon on nfs.ceph.bfh.ch - smb daemon on smb.ceph.bfh.ch 4. start the first mds manually to try to recover things (this logs extensivly to /var/log/ceph): # ceph-mds -i mds1 -f --debug-mds=20 --debug-journaler=10 5. if this helps, start mds2 and mds3 instances as usual. 6. if this doesn't help and mds1 doesn't stay up:active, then save the journal on master.ceph.bfh.ch: # cephfs-journal-tool journal export backup.bin try to recover some entries: # cephfs-journal-tool event recover_dentries summary and then flush it eventually: #cephfs-journal-tool journal reset 7. after journal flushing, you should be able to start mds2 and mds3 normally. once they're up, you can terminate mds1 (which is started in the foreground as above), and start it again as usual. Replace a disk ============== First, let's remove the failed OSD completly. .. code-block:: bash # remove the failed OSD ceph osd out ${NUMBER} # stop the osd.${NUMBER} daemon on the respective storage host service ceph-osd@${NUMBER} stop # unmount the disk umount /var/lib/ceph/osd/ceph-${NUMBER} # remove osd from crush map ceph osd crush remove osd.${NUMBER} # remove cephx key ceph auth del osd.${NUMBER} # mark osd as down ceph osd down osd.${NUMBER} # remove the osd ceph osd rm osd.${NUMBER} Second, let's go to the datacenter and replace the disk. Third, let's add the new OSD: .. code-block:: bash # partition new disk (from master) ceph-deploy osd prepare --zap-disk ${HOST}:/dev/${DEVICE} # activate OSD ceph-deploy osd activate ${HOST}:/dev/${DEVICE} Note: if re-adding an existing ceph disk fails (ceph-deploy says already mounted disk and errors out), then before zapping make sure that the partition is unmounted and empty the first chunks: .. code-block:: bash dd if=/dev/zero of=/dev/${DEVICE} count=8192 bs=1024k Resetting a SATA device ======================= .. code-block:: bash # HOST="$(readlink /sys/block/${DEVICE} | awk -F/ '{ print $6 }') # echo 1 > /sys/block/${DEVICE}/device/delete # echo "- - -" > /sys/class/scsi_host/${HOST}/scan Moving to SSD journal ===================== This is just a stub, as we're doing this only once anyway.. .. code-block:: bash # update system apt-get update apt-get upgrade --yes apt-get dist-upgrade --yes apt-get clean apt-get autoremove --purge # keep stretch kernel as backup apt-get install --yes linux-image-4.9.0-3-amd64 .. code-block:: bash # shuffle SSDs service netdata stop mv /srv/local /root # comment /dev/md2 vi /etc/fstab umount /srv mv /root/local /srv/local mdadm --stop /dev/md2 # repartition SSDs cfdisk /dev/sda cfdisk /dev/sdb # reboot update-grub update-initramfs -t -c -k all reboot .. code-block:: bash # remove old kernels apt-get remove --yes --purge linux-image-4.9.0-1-amd64 apt-get remove --yes --purge linux-image-4.9.0-2-amd64 # reworking /srv mdadm --create --verbose /dev/md2 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3 mkfs.ext4 -Ldata /dev/md2 tune2fs -c0 -i0 -m0 /dev/md2 service netdata stop mv /srv/local /root # uncomment /dev/md2 vi /etc/fstab mount /srv mv /root/local /srv/local service netdata start .. code-block:: bash # Stopping OSD rebalance ceph osd set noout # Stopping OSDs for OSD in $(seq 1 N); do echo "OSD ${OSD}" && service ceph-osd@${OSD} stop && sleep 1 && ceph-osd -i ${OSD} --flush-journal && sleep 1 && umount /var/lib/ceph/osd/ceph-${OSD}; done # check no OSD is mounted anymore df -h # Creating journal devices mdadm --create --verbose /dev/md3 --level=1 --raid-devices=2 /dev/sda5 /dev/sdb5 mdadm --create --verbose /dev/md4 --level=1 --raid-devices=2 /dev/sda6 /dev/sdb6 mdadm --create --verbose /dev/md5 --level=1 --raid-devices=2 /dev/sda7 /dev/sdb7 mdadm --create --verbose /dev/md6 --level=1 --raid-devices=2 /dev/sda8 /dev/sdb8 mdadm --create --verbose /dev/md7 --level=1 --raid-devices=2 /dev/sda9 /dev/sdb9 mdadm --create --verbose /dev/md8 --level=1 --raid-devices=2 /dev/sda10 /dev/sdb10 mdadm --create --verbose /dev/md9 --level=1 --raid-devices=2 /dev/sda11 /dev/sdb11 mdadm --create --verbose /dev/md10 --level=1 --raid-devices=2 /dev/sda12 /dev/sdb12 mdadm --create --verbose /dev/md11 --level=1 --raid-devices=2 /dev/sda13 /dev/sdb13 mdadm --create --verbose /dev/md12 --level=1 --raid-devices=2 /dev/sda14 /dev/sdb14 mdadm --create --verbose /dev/md13 --level=1 --raid-devices=2 /dev/sda15 /dev/sdb15 mdadm --create --verbose /dev/md14 --level=1 --raid-devices=2 /dev/sda16 /dev/sdb16 for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_max; done for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_min; done chown ceph:ceph /dev/md{3,4,5,6,7,8,9,10,11,12,13,14} .. code-block:: bash # add OSD entries with custom journal devices vi /etc/ceph/ceph.conf # add stubs in /etc/fstab for documentation vi /etc/fstab .. code-block:: bash for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && ceph-osd -i ${OSD} --mkjournal; done for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && service ceph-osd@${OSD} start; done Remove mds damage (DANGEROUS, DON'T DO THAT) ============================================ .. code-block:: bash for i in $(ceph tell mds.0 damage ls |sed -e 's|ino|\n|g' | awk -F"id" '{ print $2 }' | sed -e 's|":||g' -e 's|,"||g'); do ceph tell mds.0 damage rm $i; done Reformat a cephfs ================= There is no such thing as a mkfs.ceph and the usual way to "format" a cephfs is to remove all involved pools and re-create them. However, if you don't want to do that (because you want to increase the pool IDs for cosmetical reasons), here's how to do it: * stop all mds daemons * remove all objects in the involved pools: for OBJECT in $(rados -p foo.cephfs.metadata); do rados -p foo.cephfs.metadata rm ${OBJECT}; done for OBJECT in $(rados -p foo.cephfs.data); do rados -p foo.cephfs.data rm ${OBJECT}; done * re-create the fs: ceph fs new foo.cephfs foo.cephfs.metadata foo.cephfs.data * start all mds daemons