============================
Ceph Cluster Troubleshooting
============================

This is a stub for troubleshooting and other various tasks in/on a ceph cluster.

PG States
=========

http://docs.ceph.com/docs/master/rados/operations/pg-states/
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

Check cluster health
====================

.. code-block:: bash

   # Check cluster health
   ceph health detail

Fix inconsistent PGs
====================

.. code-block:: bash

   # Check which PG is inconsistent
   ceph health detail

   # Repair PG
   ceph pg repair ${PG}

Powering on a (failed) Cluster node
===================================

First, check that everything is ok:

.. code-block:: none

   * Hardware ok
   * Container Host ok:
     * has network
     * has fsck'ed
     * all containers are started
     * storage*.ceph.bfh.ch: has ceph osds mounted
   * Container ok:
     * is started
     * has network

Then, start the ceph daemons manually:

.. code-block:: bash

   # master.ceph.bfh.ch
   check if ceph-log and ceph-info are working

   # mon${NUMBER}.ceph.bfh.ch
   service ceph-mon@mon${NUMBER} start
   service ceph-mgr@mon${NUMBER} start

   # mds${NUMBER}.ceph.bfh.ch
   service ceph-mds@mds${NUMBER} start

   # rgw${NUMBER}.ceph.bfh.ch
   service ceph-radosgw@rgw.rgw${NUMBER} start

   # smb.ceph.bfh.ch
   service smbd start

   # storage${NUMBER}.ceph.bfh.ch
   # - start all osds
   systemctl start ceph.target
   # - check inividual osds
   systemctl status ceph-osd@${NUMBER}

Stop OSD Rebalance
==================

Before powercycling a storage node, always set noout. After successful restart and peering of the daemons, unset it.

.. code-block:: bash

  # stop any automatic rebalancing
  ceph osd set norecover
  ceph osd set norebalance
  ceph osd set nobackfill
  ceph osd set nodown
  ceph osd set pause
  # enable normal rebalancing again
  ceph osd unset pause
  ceph osd unset nodown
  ceph osd unset nobackfill
  ceph osd unset norebalance
  ceph osd unset norecover

Stop Scrubbing
==============

.. code-block:: bash

  # stop any automatic scrubbing
  ceph set noscrub
  ceph set nodeep­scrub

MDS Fails
=========

There is a bug in Ceph 11.2 (kraken) which makes the MDSs fail completly under some circumstances, see:

  * Short: https://lists.bfh.ch/pipermail/bfh-linux-announce/2017-May/000040.html
  * Long: https://lists.bfh.ch/pipermail/bfh-linux-announce/2017-May/000041.html

If all MDSs fail or are unresponsive due to above problem, the effect is that cephfs is down and restarting an MDS does not work.

This is how to recover:

.. code-block:: none

   1. stop all ceph mds processes (not the containers, just the ceph mds services)

   2. reboot the host systems of heavy cephfs using containers in order to empty the cephfs request queues:
      - moodle.bfh.ch resp. compute{3,4}.linux.bfh.ch
      - *.lfe.bfh.ch resp. compute{1,2}.linux.bfh.ch

   3. stop the heavy cephfs using services in order to empty the cephfs request queues:
      - nfs daemon on nfs.ceph.bfh.ch
      - smb daemon on smb.ceph.bfh.ch

   4. start the first mds manually to try to recover things (this logs extensivly to /var/log/ceph):
      # ceph-mds -i mds1 -f --debug-mds=20 --debug-journaler=10

   5. if this helps, start mds2 and mds3 instances as usual.

   6. if this doesn't help and mds1 doesn't stay up:active, then save the journal on master.ceph.bfh.ch:
      # cephfs-journal-tool journal export backup.bin

      try to recover some entries:
      # cephfs-journal-tool event recover_dentries summary

      and then flush it eventually:
      #cephfs-journal-tool journal reset

   7. after journal flushing, you should be able to start mds2 and mds3 normally.
      once they're up, you can terminate mds1 (which is started in the foreground as above),
      and start it again as usual.

Replace a disk
==============

First, let's remove the failed OSD completly.

.. code-block:: bash

  # remove the failed OSD
  ceph osd out ${NUMBER}

  # stop the osd.${NUMBER} daemon on the respective storage host
  service ceph-osd@${NUMBER} stop

  # unmount the disk
  umount /var/lib/ceph/osd/ceph-${NUMBER}

  # remove osd from crush map
  ceph osd crush remove osd.${NUMBER}

  # remove cephx key
  ceph auth del osd.${NUMBER}

  # mark osd as down
  ceph osd down osd.${NUMBER}

  # remove the osd
  ceph osd rm osd.${NUMBER}

Second, let's go to the datacenter and replace the disk.

Third, let's add the new OSD:

.. code-block:: bash

  # partition new disk (from master)
  ceph-deploy osd prepare --zap-disk ${HOST}:/dev/${DEVICE}

  # activate OSD
  ceph-deploy osd activate ${HOST}:/dev/${DEVICE}

Note: if re-adding an existing ceph disk fails (ceph-deploy says already mounted disk and errors out),
then before zapping make sure that the partition is unmounted and empty the first chunks:

.. code-block:: bash

   dd if=/dev/zero of=/dev/${DEVICE} count=8192 bs=1024k

Resetting a SATA device
=======================

.. code-block:: bash

   # HOST="$(readlink /sys/block/${DEVICE} | awk -F/ '{ print $6 }')
   # echo 1 > /sys/block/${DEVICE}/device/delete
   # echo "- - -" > /sys/class/scsi_host/${HOST}/scan

Moving to SSD journal
=====================

This is just a stub, as we're doing this only once anyway..

.. code-block:: bash

  # update system
  apt-get update
  apt-get upgrade --yes
  apt-get dist-upgrade --yes
  apt-get clean
  apt-get autoremove --purge

  # keep stretch kernel as backup
  apt-get install --yes linux-image-4.9.0-3-amd64

.. code-block:: bash

  # shuffle SSDs
  service netdata stop
  mv /srv/local /root
  # comment /dev/md2
  vi /etc/fstab
  umount /srv
  mv /root/local /srv/local
  mdadm --stop /dev/md2

  # repartition SSDs
  cfdisk /dev/sda
  cfdisk /dev/sdb

  # reboot
  update-grub
  update-initramfs -t -c -k all
  reboot

.. code-block:: bash

  # remove old kernels
  apt-get remove --yes --purge linux-image-4.9.0-1-amd64
  apt-get remove --yes --purge linux-image-4.9.0-2-amd64

  # reworking /srv
  mdadm --create --verbose /dev/md2 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3
  mkfs.ext4 -Ldata /dev/md2
  tune2fs -c0 -i0 -m0 /dev/md2

  service netdata stop
  mv /srv/local /root
  # uncomment /dev/md2
  vi /etc/fstab
  mount /srv
  mv /root/local /srv/local
  service netdata start

.. code-block:: bash

  # Stopping OSD rebalance
  ceph osd set noout

  # Stopping OSDs
  for OSD in $(seq 1 N); do echo "OSD ${OSD}" && service ceph-osd@${OSD} stop && sleep 1 && ceph-osd -i ${OSD} --flush-journal && sleep 1 && umount /var/lib/ceph/osd/ceph-${OSD}; done
  # check no OSD is mounted anymore
  df -h

  # Creating journal devices
  mdadm --create --verbose /dev/md3 --level=1 --raid-devices=2 /dev/sda5 /dev/sdb5
  mdadm --create --verbose /dev/md4 --level=1 --raid-devices=2 /dev/sda6 /dev/sdb6
  mdadm --create --verbose /dev/md5 --level=1 --raid-devices=2 /dev/sda7 /dev/sdb7
  mdadm --create --verbose /dev/md6 --level=1 --raid-devices=2 /dev/sda8 /dev/sdb8
  mdadm --create --verbose /dev/md7 --level=1 --raid-devices=2 /dev/sda9 /dev/sdb9
  mdadm --create --verbose /dev/md8 --level=1 --raid-devices=2 /dev/sda10 /dev/sdb10
  mdadm --create --verbose /dev/md9 --level=1 --raid-devices=2 /dev/sda11 /dev/sdb11
  mdadm --create --verbose /dev/md10 --level=1 --raid-devices=2 /dev/sda12 /dev/sdb12
  mdadm --create --verbose /dev/md11 --level=1 --raid-devices=2 /dev/sda13 /dev/sdb13
  mdadm --create --verbose /dev/md12 --level=1 --raid-devices=2 /dev/sda14 /dev/sdb14
  mdadm --create --verbose /dev/md13 --level=1 --raid-devices=2 /dev/sda15 /dev/sdb15
  mdadm --create --verbose /dev/md14 --level=1 --raid-devices=2 /dev/sda16 /dev/sdb16

  for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_max; done
  for MD in md{3,4,5,6,7,8,9,10,11,12,13,14}; do echo 999999 > /sys/block/${MD}/md/sync_speed_min; done

  chown ceph:ceph /dev/md{3,4,5,6,7,8,9,10,11,12,13,14}

.. code-block:: bash

  # add OSD entries with custom journal devices
  vi /etc/ceph/ceph.conf

  # add stubs in /etc/fstab for documentation
  vi /etc/fstab

.. code-block:: bash

  for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && ceph-osd -i ${OSD} --mkjournal; done
  for OSD in $(seq 1 N); do echo "OSD: ${OSD}" && service ceph-osd@${OSD} start; done


Remove mds damage (DANGEROUS, DON'T DO THAT)
============================================

.. code-block:: bash

   for i in $(ceph tell mds.0 damage ls |sed -e 's|ino|\n|g' | awk -F"id" '{ print $2 }' | sed -e 's|":||g' -e 's|,"||g'); do ceph tell mds.0 damage rm $i; done

Reformat a cephfs
=================

There is no such thing as a mkfs.ceph and the usual way to "format" a cephfs is to remove all involved pools and re-create them.

However, if you don't want to do that (because you want to increase the pool IDs for cosmetical reasons), here's how to do it:

* stop all mds daemons
* remove all objects in the involved pools:
  for OBJECT in $(rados -p foo.cephfs.metadata); do rados -p foo.cephfs.metadata rm ${OBJECT}; done
  for OBJECT in $(rados -p foo.cephfs.data); do rados -p foo.cephfs.data rm ${OBJECT}; done
* re-create the fs:
  ceph fs new foo.cephfs foo.cephfs.metadata foo.cephfs.data
* start all mds daemons