Ceph Cluster Troubleshooting

This is a stub for troubleshooting and other various tasks in/on a ceph cluster.

Check cluster health

# Check cluster health
ceph health detail

Fix inconsistent PGs

# Check which PG is inconsistent
ceph health detail

# Repair PG
ceph pg repair ${PG}

Powering on a (failed) Cluster node

First, check that everything is ok:

* Hardware ok
* Container Host ok:
  * has network
  * has fsck'ed
  * all containers are started
  * storage*.ceph.bfh.ch: has ceph osds mounted
* Container ok:
  * is started
  * has network

Then, start the ceph daemons manually:

# master.ceph.bfh.ch
check if ceph-log and ceph-info are working

# mon${NUMBER}.ceph.bfh.ch
service ceph-mon@mon${NUMBER} start
service ceph-mgr@mon${NUMBER} start

# mds${NUMBER}.ceph.bfh.ch
service ceph-mds@mds${NUMBER} start

# rgw${NUMBER}.ceph.bfh.ch
service ceph-radosgw@rgw.rgw${NUMBER} start

# smb.ceph.bfh.ch
service smbd start

# storage${NUMBER}.ceph.bfh.ch
# - start all osds
systemctl start ceph.target
# - check inividual osds
systemctl status ceph-osd@${NUMBER}

Stop OSD Rebalance

Before powercycling a storage node, always set noout. After successful restart and peering of the daemons, unset it.

# stop any automatic rebalancing
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set pause
# enable normal rebalancing again
ceph osd unset pause
ceph osd unset nodown
ceph osd unset nobackfill
ceph osd unset norebalance
ceph osd unset norecover

Stop Scrubbing

# stop any automatic scrubbing
ceph set noscrub
ceph set nodeep­scrub

MDS Fails

There is a bug in Ceph 11.2 (kraken) which makes the MDSs fail completly under some circumstances, see:

If all MDSs fail or are unresponsive due to above problem, the effect is that cephfs is down and restarting an MDS does not work.

This is how to recover:

1. stop all ceph mds processes (not the containers, just the ceph mds services)

2. reboot the host systems of heavy cephfs using containers in order to empty the cephfs request queues:
   - moodle.bfh.ch resp. compute{3,4}.linux.bfh.ch
   - *.lfe.bfh.ch resp. compute{1,2}.linux.bfh.ch

3. stop the heavy cephfs using services in order to empty the cephfs request queues:
   - nfs daemon on nfs.ceph.bfh.ch
   - smb daemon on smb.ceph.bfh.ch

4. start the first mds manually to try to recover things (this logs extensivly to /var/log/ceph):
   # ceph-mds -i mds1 -f --debug-mds=20 --debug-journaler=10

5. if this helps, start mds2 and mds3 instances as usual.

6. if this doesn't help and mds1 doesn't stay up:active, then save the journal on master.ceph.bfh.ch:
   # cephfs-journal-tool journal export backup.bin

   try to recover some entries:
   # cephfs-journal-tool event recover_dentries summary

   and then flush it eventually:
   #cephfs-journal-tool journal reset

7. after journal flushing, you should be able to start mds2 and mds3 normally.
   once they're up, you can terminate mds1 (which is started in the foreground as above),
   and start it again as usual.

Replace a disk

First, let’s remove the failed OSD completly.

# remove the failed OSD
ceph osd out ${NUMBER}

# stop the osd.${NUMBER} daemon on the respective storage host
service ceph-osd@${NUMBER} stop

# unmount the disk
umount /var/lib/ceph/osd/ceph-${NUMBER}

# remove osd from crush map
ceph osd crush remove osd.${NUMBER}

# remove cephx key
ceph auth del osd.${NUMBER}

# mark osd as down
ceph osd down osd.${NUMBER}

# remove the osd
ceph osd rm osd.${NUMBER}

Second, let’s go to the datacenter and replace the disk.

Third, let’s add the new OSD:

# partition new disk (from master)
ceph-deploy osd prepare --zap-disk ${HOST}:/dev/${DEVICE}

# activate OSD
ceph-deploy osd activate ${HOST}:/dev/${DEVICE}

Note: if re-adding an existing ceph disk fails (ceph-deploy says already mounted disk and errors out), then before zapping make sure that the partition is unmounted and empty the first chunks:

dd if=/dev/zero of=/dev/${DEVICE} count=8192 bs=1024k

Resetting a SATA device

# HOST="$(readlink /sys/block/${DEVICE} | awk -F/ '{ print $6 }')
# echo 1 > /sys/block/${DEVICE}/device/delete
# echo "- - -" > /sys/class/scsi_host/${HOST}/scan

Moving to SSD journal

This is just a stub, as we’re doing this only once anyway..

# update system
apt-get update
apt-get upgrade --yes
apt-get dist-upgrade --yes
apt-get clean
apt-get autoremove --purge

# keep stretch kernel as backup
apt-get install --yes linux-image-4.9.0-3-amd64
# shuffle SSDs
service netdata stop
mv /srv/local /root
# comment /dev/md2
vi /etc/fstab
umount /srv
mv /root/local /srv/local
mdadm --stop /dev/md2

# repartition SSDs
cfdisk /dev/sda
cfdisk /dev/sdb

# reboot
update-grub
update-initramfs -t -c -k all
reboot
# remove old kernels
apt-get remove --yes --purge linux-image-4.9.0-1-amd64
apt-get remove --yes --purge linux-image-4.9.0-2-amd64

# reworking /srv
mdadm --create --verbose /dev/md2 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3
mkfs.ext4 -Ldata /dev/md2
tune2fs -c0 -i0 -m0 /dev/md2

service netdata stop
mv /srv/local /root
# uncomment /dev/md2
vi /etc/fstab
mount /srv
mv /root/local /srv/local
service netdata start
# Stopping OSD rebalance
ceph osd set noout

# Stopping OSDs
for OSD in $(seq 1 N); do echo "OSD ${OSD}" && service ceph-osd@${OSD} stop && sleep 1 && ceph-osd -i ${OSD} --flush-journal && sleep 1 && umount /var/lib/ceph/osd/ceph-${OSD}; done
# check no OSD is mounted anymore
df -h

# Creating journal devices
mdadm --create --verbose /dev/md3 --level=1 --raid-devices=2 /dev/sda5 /dev/sdb5
mdadm --create --verbose /dev/md4 --level=1 --raid-devices=2 /dev/sda6 /dev/sdb6
mdadm --create --verbose /dev/md5 --level=1 --raid-devices=2 /dev/sda7 /dev/sdb7
mdadm --create --verbose /dev/md6 --level=1 --raid-devices=2 /dev/sda8 /dev/sdb8
mdadm --create --verbose /dev/md7 --level=1 --raid-devices=2 /dev/sda9 /dev/sdb9
mdadm --create --verbose /dev/md8 --level=1 --raid-devices=2 /dev/sda10 /dev/sdb10
mdadm --create --verbose /dev/md9 --level=1 --raid-devices=2 /dev/sda11 /dev/sdb11
mdadm --create --verbose /dev/md10 --level=1 --raid-devices=2 /dev/sda12 /dev/sdb12
mdadm --create --verbose /dev/md11 --level=1 --raid-devices=2 /dev/sda13 /dev/sdb13
mdadm --create --verbose /dev/md12 --level=1 --raid-devices=2 /dev/sda14 /dev/sdb14
mdadm --create --verbose /dev/md13 --level=1 --raid-devices=2 /dev/sda15 /dev/sdb15
mdadm --create --verbose /dev/md14 --level=1 --raid-devices=2 /dev/sda16 /dev/sdb16

for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_max; done
for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_min; done

chown ceph:ceph /dev/md{3,4,5,6,7,8,9,10,11,12,13,14}
# add OSD entries with custom journal devices
vi /etc/ceph/ceph.conf

# add stubs in /etc/fstab for documentation
vi /etc/fstab
for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && ceph-osd -i ${OSD} --mkjournal; done
for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && service ceph-osd@${OSD} start; done

Remove mds damage (DANGEROUS, DON’T DO THAT)

for i in $(ceph tell mds.0 damage ls |sed -e 's|ino|\n|g' | awk -F"id" '{ print $2 }' | sed -e 's|":||g' -e 's|,"||g'); do ceph tell mds.0 damage rm $i; done

Reformat a cephfs

There is no such thing as a mkfs.ceph and the usual way to “format” a cephfs is to remove all involved pools and re-create them.

However, if you don’t want to do that (because you want to increase the pool IDs for cosmetical reasons), here’s how to do it:

  • stop all mds daemons
  • remove all objects in the involved pools: for OBJECT in $(rados -p foo.cephfs.metadata); do rados -p foo.cephfs.metadata rm ${OBJECT}; done for OBJECT in $(rados -p foo.cephfs.data); do rados -p foo.cephfs.data rm ${OBJECT}; done
  • re-create the fs: ceph fs new foo.cephfs foo.cephfs.metadata foo.cephfs.data
  • start all mds daemons