ceph-bluestore-tool 使用说明

参考链接:
https://github.com/ceph/ceph/blob/master/doc/man/8/ceph-bluestore-tool.rst
https://blog.csdn.net/weixin_39757040/article/details/111702162

1. 简介

ceph-bluestore-tool 是一个对 BlueStore 实例执行低级管理操作的实用工具。

[root@node-1 ceph-objectstore-tool-test]# ceph-bluestore-tool -h
All options:

Options:
  -h [ --help ]          produce help message
  --path arg             bluestore path
  --out-dir arg          output directory
  -l [ --log-file ] arg  log file
  --log-level arg        log level (30=most, 20=lots, 10=some, 1=little)
  --dev arg              device(s)
  --devs-source arg      bluefs-dev-migrate source device(s)
  --dev-target arg       target/resulting device
  --deep arg             deep fsck (read all data)
  -k [ --key ] arg       label metadata key name
  -v [ --value ] arg     label metadata value
  --allocator arg        allocator to inspect: 'block'/'bluefs-wal'/'bluefs-db'
                         /'bluefs-slow'

Positional options:
  --command arg          fsck, repair, quick-fix, bluefs-export, 
                         bluefs-bdev-sizes, bluefs-bdev-expand, 
                         bluefs-bdev-new-db, bluefs-bdev-new-wal, 
                         bluefs-bdev-migrate, show-label, set-label-key, 
                         rm-label-key, prime-osd-dir, bluefs-log-dump, 
                         free-dump, free-score
                         
Synopsis
ceph-bluestore-tool command [ --dev device ... ] [ --path osd path ] [ --out-dir dir ] [ --log-file | -l filename ] [ --deep ]
ceph-bluestore-tool fsck|repair --path osd path [ --deep ]
ceph-bluestore-tool show-label --dev device ...
ceph-bluestore-tool prime-osd-dir --dev device --path osd path
ceph-bluestore-tool bluefs-export --path osd path --out-dir dir
ceph-bluestore-tool bluefs-bdev-new-wal --path osd path --dev-target new-device
ceph-bluestore-tool bluefs-bdev-new-db --path osd path --dev-target new-device
ceph-bluestore-tool bluefs-bdev-migrate --path osd path --dev-target new-device --devs-source device1 [--devs-source device2]
ceph-bluestore-tool free-dump|free-score --path osd path [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]
ceph-bluestore-tool reshard --path osd path --sharding new sharding [ --sharding-ctrl control string ]
ceph-bluestore-tool show-sharding --path osd path          

使用前,关闭对应 OSD 服务

[root@node-1 ceph-objectstore-tool-test]# systemctl stop ceph-osd@0

使用完,开启对应 OSD 服务

[root@node-1 ceph-objectstore-tool-test]# systemctl restart ceph-osd@0

2. 示例

元数据、数据校验(–deep true,开启数据校验)

这实际是调用bluestore中的fsck()函数,有deep、repair等选项,支持不同程度的检查工作。具体工作有:更新kvdb中super前缀信息(ondisk_format、min_alloc_size等,具体可查阅BlueStore::_upgrade_super()函数)、重放deferred_transaction、删除失效blob、删除错误pextent、更新bluefs文件系统信息等。
所以fsck主要针对bluestore中的元数据信息的修复和检查。

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ fsck [--deep true]


[root@localhost build]# ./bin/ceph-bluestore-tool --log-level 10 --log-file /root/bluestore-tool.log --path /var/lib/ceph/osd/ceph-admin/ fsck --deep true
fsck success

修复

调用bluestore中的fsck()函数实现功能。

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ [repair|quick-fix] [--deep true]

[root@localhost build]# ./bin/ceph-bluestore-tool --log-level 10 --log-file /root/bluestore-tool.log --path /var/lib/ceph/osd/ceph-admin/ repair --deep true
repair success

导出 bluefs

把整个bluefs中的data数据(不包括bluefs自身元数据)导出,即rocksdb的数据。

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --out-dir ./bluefs-exportbluefs-export

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ --out-dir /root/bluefs_export bluefs-export
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-admin/block -> /var/lib/ceph/osd/ceph-admin/block
db/
db/000028.sst
db/000031.sst
db/000032.sst
db/CURRENT
db/IDENTITY
db/LOCK
db/MANIFEST-000061
db/OPTIONS-000061
db/OPTIONS-000064
db.slow/
db.wal/
db.wal/000062.log
sharding/
sharding/def

查询 bluefs 设备大小

bluefs中规定了5种设备:BDEV_WAL、BDEV_DB、BDEV_SLOW、BDEV_NEWWAL、BDEV_NEWDB。此命令可以依次查询这5类设备的空间使用情况,若没有该设备,则不显示。
如下示例,只有一种设备。

# 总大小 = get_block_device_size(id) - block_reserved[id]
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ bluefs-bdev-sizes

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ bluefs-bdev-sizes
1 : device size 0x1900000000 : using 0x4a2000(4.6 MiB)

bluefs 扩容(WAL,DB),扩容盘后使用该命令更新块设备大小

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ bluefs-bdev-expand

添加 WAL,只有在 WAL 设备不存在时可以添加,扩容使用 bluefs-bdev-expand 命令

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --dev-target new-device bluefs-bdev-new-wal

添加 DB,只有在 DB 设备不存在时可以添加,扩容使用 bluefs-bdev-expand 命令

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --dev-target new-device bluefs-bdev-new-db

迁移 bluefs,只有在原设备路径中已经含有 DB 或 WAL 设备时可以迁移,否则使用 new-db 或者 new-wal 命令

:command:`bluefs-bdev-migrate --dev-target new-device --devs-source device1 [--devs-source device2]

#这里实际是有个坑的,在代码里需要指bluestore-block-db-size或者bluestore-block-wal-size这个参数大小,但实际ceph-bluestore-tool使用过程中没有解析本地的ceph.conf配置文件, 所以无论把这个配置项添加到哪个配置文件,不会生效。会提示你"please set Ceph bluestore-block-wal-size config parameter "
#bluestore_block_db_size和bluestore_block_wal_size的配置并不会影响创建的db和wal设备的大小, 默认会使用设备的所有空间(除保留空间)
CEPH_ARGS="--bluestore_block_db_size=1073741824 --bluestore_clock_db_create=true"    ceph-bluestore-tool    --log-level 30     --path   /var/lib/ceph/osd/ceph-0   bluefs-bdev-new-db   --dev-target   /dev/new_db/new_db

查询块设备信息

# 获取的是SLOW设备的超级块中包含的所有信息
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ show-label

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ show-label
inferring bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-admin/block": {
        "osd_uuid": "f120452a-2b51-4b34-9f1a-75830c6f4c94",
        "size": 107374182400,
        "btime": "2021-06-08T13:38:12.390345+0800",
        "description": "main",
        "bfm_blocks": "26214400",
        "bfm_blocks_per_key": "128",
        "bfm_bytes_per_block": "4096",
        "bfm_size": "107374182400",
        "bluefs": "1",
        "kv_backend": "rocksdb",
        "mkfs_done": "yes"
    }
}

查询块设备空闲碎片

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ] free-dump

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ --allocator block free-dump
block:                                  //设备类型
{
    "capacity": 107374182400,           //总容量
    "alloc_unit": 4096,                 //min_alloc_size
    "alloc_type": "hybrid",             //Allocator type
    "alloc_name": "block",              //Allocator name
    "extents": [                        //所有空闲PExents
        {
            "offset": "0x2000",
            "length": "0xe000"
        },
        {
            "offset": "0x460000",
            "length": "0x10000"
        },
        {
            "offset": "0x480000",
            "length": "0x10000"
        },
        {
            "offset": "0x4c0000",
            "length": "0x60000"
        },
        {
            "offset": "0x530000",
            "length": "0x18ffad0000"
        }
    ]
}

查询块设备空闲碎片率,比率越小,空闲碎片数量越少(空闲块更大),性能越好

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ] free-score

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ --allocator block free-score
block:
{
    "fragmentation_rating": 5.3269319151622657e-06
}

更新 OSD 工作目录(把块设备 label 信息重写到 /var/lib/ceph/osd/ceph-0)

# 如果符号链接指向正确的位置,则什么也不做。如果不是,就换掉它。如果不是符号链接,用EEXIST退出。
ceph-bluestore-tool prime-osd-dir --dev /dev/ceph-c7590218-d31d-4b95-9ec9-16c4ee38812b/osd-block-e8164817-4103-47e6-b451-37d30d9785f8 --path /var/lib/ceph/osd/ceph-0

设置块设备 label 条目 (label 总大小不得超过 4k)

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 --dev /var/lib/ceph/osd/ceph-0/block set-label-key -k key -v value

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ --dev /var/lib/ceph/osd/ceph-admin/block set-label-key -k mykey -v myvalue
[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ show-label
inferring bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-admin/block": {
        "osd_uuid": "f120452a-2b51-4b34-9f1a-75830c6f4c94",
        "size": 107374182400,
        "btime": "2021-06-08T13:38:12.390345+0800",
        "description": "main",
        "bfm_blocks": "26214400",
        "bfm_blocks_per_key": "128",
        "bfm_bytes_per_block": "4096",
        "bfm_size": "107374182400",
        "bluefs": "1",
        "kv_backend": "rocksdb",
        "mkfs_done": "yes",
        "mykey": "myvalue"
    }
}

删除块设备 label 条目

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 --dev /var/lib/ceph/osd/ceph-0/block rm-label-key -k key

[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ --dev /var/lib/ceph/osd/ceph-admin/block rm-label-key -k mykey
[root@localhost bin]# ./ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-admin/ show-label
inferring bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-admin/block": {
        "osd_uuid": "f120452a-2b51-4b34-9f1a-75830c6f4c94",
        "size": 107374182400,
        "btime": "2021-06-08T13:38:12.390345+0800",
        "description": "main",
        "bfm_blocks": "26214400",
        "bfm_blocks_per_key": "128",
        "bfm_bytes_per_block": "4096",
        "bfm_size": "107374182400",
        "bluefs": "1",
        "kv_backend": "rocksdb",
        "mkfs_done": "yes"
    }
}

3. 源码分析

main

// src/os/bluestore/bluestore.cc
int main(int argc, char **argv) {
  // 解析参数,并校验合法性
  // 初始化 global_context
  auto cct = global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT,
                         CODE_ENVIRONMENT_UTILITY,
                         CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);

  common_init_finish(cct.get());
  // 初始化 bluestore
  BlueStore bluestore(cct.get(), path);
  // 根据命令行参数执行方法
}

fsck|repair|quick-fix

  int fsck(bool deep) override {
    return _fsck(deep ? FSCK_DEEP : FSCK_REGULAR, false);
  }

  int repair(bool deep) override {
    return _fsck(deep ? FSCK_DEEP : FSCK_REGULAR, true);
  }

  int quick_fix() override {
    return _fsck(FSCK_SHALLOW, true);
  }

/** 这里贴下 _fsck() 函数的介绍,具体代码不做分析
An overview for currently implemented repair logics 
performed in fsck in two stages: detection(+preparation) and commit.
Detection stage (in processing order):
  (Issue -> Repair action to schedule)
  - Detect undecodable keys for Shared Blobs -> Remove
  - Detect undecodable records for Shared Blobs -> Remove 
    (might trigger missed Shared Blob detection below)
  - Detect stray records for Shared Blobs -> Remove
  - Detect misreferenced pextents -> Fix
    Prepare Bloom-like filter to track cid/oid -> pextent 
    Prepare list of extents that are improperly referenced
    Enumerate Onode records that might use 'misreferenced' pextents
    (Bloom-like filter applied to reduce computation)
      Per each questinable Onode enumerate all blobs and identify broken ones 
      (i.e. blobs having 'misreferences')
      Rewrite each broken blob data by allocating another extents and 
      copying data there
      If blob is shared - unshare it and mark corresponding Shared Blob 
      for removal
      Release previously allocated space
      Update Extent Map
  - Detect missed Shared Blobs -> Recreate
  - Detect undecodable deferred transaction -> Remove
  - Detect Freelist Manager's 'false free' entries -> Mark as used
  - Detect Freelist Manager's leaked entries -> Mark as free
  - Detect statfs inconsistency - Update
  Commit stage (separate DB commit per each step):
  - Apply leaked FM entries fix
  - Apply 'false free' FM entries fix
  - Apply 'Remove' actions
  - Apply fix for misreference pextents
  - Apply Shared Blob recreate 
    (can be merged with the step above if misreferences were dectected)
  - Apply StatFS update
*/

prime-osd-dir

else if (action == "prime-osd-dir") {
  bluestore_bdev_label_t label;
  // 读取 OSD 第一个 4k 块数据,用 bluestore_bdev_label_t 结构体保存
  /*  
    struct bluestore_bdev_label_t {
    uuid_d osd_uuid;     ///< osd uuid
    uint64_t size = 0;   ///< device size
    utime_t btime;       ///< birth time
    std::string description;  ///< device description

    std::map<std::string,std::string> meta; ///< {read,write}_meta() content from ObjectStore

    void encode(ceph::buffer::list& bl) const;
    void decode(ceph::buffer::list::const_iterator& p);
    void dump(ceph::Formatter *f) const;
    static void generate_test_instances(std::list<bluestore_bdev_label_t*>& o);
  };
  */
  int r = BlueStore::_read_bdev_label(cct.get(), devs.front(), &label);

  // kludge some things into the map that we want to populate into
  // target dir
  label.meta["path_block"] = devs.front();
  label.meta["type"] = "bluestore";
  label.meta["fsid"] = stringify(label.osd_uuid);
  // 在指定得目录下,新建 whoami、osd_key等文件,并写入对应信息:label.meta
  for (auto kk : {
        "whoami",
        "osd_key",
        "ceph_fsid",
        "fsid",
        "type",
        "ready"}) {
    string k = kk;
    auto i = label.meta.find(k);
    if (i == label.meta.end()) {
      continue;
    }
    string p = path + "/" + k;
    string v = i->second;
    if (k == "osd_key") {
      p = path + "/keyring";
      v = "[osd.";
      v += label.meta["whoami"];
      v += "]\nkey = " + i->second;
    }
    v += "\n";
    int fd = ::open(p.c_str(), O_CREAT | O_TRUNC | O_WRONLY | O_CLOEXEC, 0600);
    }
    int r = safe_write(fd, v.c_str(), v.size());
      ::close(fd);
  }
}

show-label

// 以 json 格式,打印 label 保存得信息(打印 bluestore_bdev_label_t 结构体)
else if (action == "show-label") {
    JSONFormatter jf(true);
    jf.open_object_section("devices");
    for (auto &i : devs) {
      bluestore_bdev_label_t label;
      int r = BlueStore::_read_bdev_label(cct.get(), i, &label);
      if (r < 0) {
        cerr << "unable to read label for " << i << ": "
             << cpp_strerror(r) << std::endl;
        exit(EXIT_FAILURE);
      }
      jf.open_object_section(i.c_str());
      label.dump(&jf);
      jf.close_section();
    }
    jf.close_section();
    jf.flush(cout);
  }

bluefs-bdev-sizes

// reserve: label (4k) + bluefs super (4k), which means we start at 8k.
// bluefs-bdev-sizes = block_size_total - block_size_reserve
// bluefs-bdev-sizes-used = bluefs-bdev-sizes - block_size_free
else if (action == "bluefs-bdev-sizes") {
  BlueStore bluestore(cct.get(), path);
  bluestore.dump_bluefs_sizes(cout);
}

void BlueFS::dump_block_extents(ostream& out)
{
  for (unsigned i = 0; i < MAX_BDEV; ++i) {
    if (!bdev[i]) {
      continue;
    }
    auto total = get_total(i);
    auto free = get_free(i);

    out << i << " : device size 0x" << std::hex << total
        << " : using 0x" << total - free
	<< std::dec << "(" << byte_u_t(total - free) << ")";
    out << "\n";
  }
}
uint64_t BlueFS::_get_total(unsigned id) const
{
  ceph_assert(id < bdev.size());
  ceph_assert(id < block_reserved.size());
  return get_block_device_size(id) - block_reserved[id];
}

bluefs-bdev-expand

// 注意:仅支持扩容,并且在原来的盘上扩容,不支持缩容
// bluestore实例对象中的bdev会重新识别设备的大小,然后把新增的空间添加到FreelistManager的实例对象中,以事务的形式持久化到数据库中(RocksDB),最后更新设备的label
int BlueStore::expand_devices(ostream& out)
{
  int r = _open_db_and_around(true);
  for (auto devid : { BlueFS::BDEV_WAL, BlueFS::BDEV_DB}) {
    if (devid == bluefs_layout.shared_bdev ) {
      continue;
    }
    // 获取新设备的大小
    uint64_t size = bluefs->get_block_device_size(devid);
    if (size == 0) {
      // no bdev
      continue;
    }

    string p = get_device_path(devid);
    const char* path = p.c_str();

    if (bluefs->bdev_support_label(devid)) {
      // 更新新设备 label 信息(向第一个 4k 块 写入信息)
      if (_set_bdev_label_size(p, size) >= 0) {

      }
    }
  }
  // 获取旧的 FreelistManager 中保存的块大小
  uint64_t size0 = fm->get_size();
  uint64_t size = bdev->get_size();
  if (size0 < size) {
   // 更改 FreelistManager 的块大小
    _write_out_fm_meta(size);
    if (bdev->supported_bdev_label()) {
      if (_set_bdev_label_size(path, size) >= 0) {
        out << bluefs_layout.shared_bdev
          << " : size label updated to " << size
          << std::endl;
      }
    }
    _close_db_and_around(true);
    // 重新挂载 bluestore,同步扩容信息
    // mount in read/write to sync expansion changes
    r = _mount();
    ceph_assert(r == 0);
    umount();
  } else {
    _close_db_and_around(true);
  }
  return r;
}

bluefs-export

// 遍历bluefs文件系统里的所有所有文件(和目录),拼接出文件路径,把bluefs文件系统里的所有文件的内容"导出"到本地指定的目录下面
// bluefs 只有 二级目录,可参考《ceph设计原理与实现》
else if (action == "bluefs-export") {
    BlueFS *fs = open_bluefs_readonly(cct.get(), path, devs);

    vector <string> dirs;
    int r = fs->readdir("", &dirs);
    if (r < 0) {
      cerr << "readdir in root failed: " << cpp_strerror(r) << std::endl;
      exit(EXIT_FAILURE);
    }

    if (::access(out_dir.c_str(), F_OK)) {
      r = ::mkdir(out_dir.c_str(), 0755);
      if (r < 0) {
        r = -errno;
        cerr << "mkdir " << out_dir << " failed: " << cpp_strerror(r) << std::endl;
        exit(EXIT_FAILURE);
      }
    }

    for (auto &dir : dirs) {
      if (dir[0] == '.')
        continue;
      cout << dir << "/" << std::endl;
      vector <string> ls;
      r = fs->readdir(dir, &ls);
      if (r < 0) {
        cerr << "readdir " << dir << " failed: " << cpp_strerror(r) << std::endl;
        exit(EXIT_FAILURE);
      }
      string full = out_dir + "/" + dir;
      if (::access(full.c_str(), F_OK)) {
        r = ::mkdir(full.c_str(), 0755);
        if (r < 0) {
          r = -errno;
          cerr << "mkdir " << full << " failed: " << cpp_strerror(r) << std::endl;
          exit(EXIT_FAILURE);
        }
      }
      for (auto &file : ls) {
        if (file[0] == '.')
          continue;
        cout << dir << "/" << file << std::endl;
        uint64_t size;
        utime_t mtime;
        r = fs->stat(dir, file, &size, &mtime);
        if (r < 0) {
          cerr << "stat " << file << " failed: " << cpp_strerror(r) << std::endl;
          exit(EXIT_FAILURE);
        }
        string path = out_dir + "/" + dir + "/" + file;
        int fd = ::open(path.c_str(), O_CREAT | O_WRONLY | O_TRUNC | O_CLOEXEC, 0644);
        if (fd < 0) {
          r = -errno;
          cerr << "open " << path << " failed: " << cpp_strerror(r) << std::endl;
          exit(EXIT_FAILURE);
        }
        if (size > 0) {
          BlueFS::FileReader *h;
          r = fs->open_for_read(dir, file, &h, false);
          if (r < 0) {
            cerr << "open_for_read " << dir << "/" << file << " failed: "
                 << cpp_strerror(r) << std::endl;
            exit(EXIT_FAILURE);
          }
          int pos = 0;
          int left = size;
          while (left) {
            bufferlist bl;
            r = fs->read(h, pos, left, &bl, NULL);
            if (r <= 0) {
              cerr << "read " << dir << "/" << file << " from " << pos
                   << " failed: " << cpp_strerror(r) << std::endl;
              exit(EXIT_FAILURE);
            }
            int rc = bl.write_fd(fd);
            if (rc < 0) {
              cerr << "write to " << path << " failed: "
                   << cpp_strerror(r) << std::endl;
              exit(EXIT_FAILURE);
            }
            pos += r;
            left -= r;
          }
          delete h;
        }
        ::close(fd);
      }
    }
    fs->umount();
    delete fs;
  }

bluefs-bdev-new-db|bluefs-bdev-new-wal

// 根据输入的 OSD 目录,推测出块设备连接文件
// block:path/block
// block.wal: path/block.wal
// block.db: path/block.db
void inferring_bluefs_devices(vector <string> &devs, std::string &path) {
  cout << "inferring bluefs devices from bluestore path" << std::endl;
  for (auto fn : {"block", "block.wal", "block.db"}) {
    string p = path + "/" + fn;
    struct stat st;
    if (::stat(p.c_str(), &st) == 0) {
      devs.push_back(p);
    }
  }
}

// 判断是否有 db 或者 wal 盘
// 通过读取块设备 lebal.description,获取块设备描述信息
// 当块设备只有一块 main(slow db)时,默认把 main 也当作 db 设备
void parse_devices(
    CephContext *cct,
    const vector <string> &devs,
    map<string, int> *got,
    bool *has_db,
    bool *has_wal) {
  string main;
  bool was_db = false;
  if (has_wal) {
    *has_wal = false;
  }
  if (has_db) {
    *has_db = false;
  }
  for (auto &d : devs) {
    bluestore_bdev_label_t label;
    int r = BlueStore::_read_bdev_label(cct, d, &label);

    int id = -1;
    if (label.description == "main")
      main = d;
    else if (label.description == "bluefs db") {
      id = BlueFS::BDEV_DB;
      was_db = true;
      if (has_db) {
        *has_db = true;
      }
    } else if (label.description == "bluefs wal") {
      id = BlueFS::BDEV_WAL;
      if (has_wal) {
        *has_wal = true;
      }
    }
    if (id >= 0) {
      got->emplace(d, id);
    }
  }
  if (main.length()) {
    // 只有 main 时,main 充当 BlueFS::BDEV_DB
    int id = was_db ? BlueFS::BDEV_SLOW : BlueFS::BDEV_DB;
    got->emplace(main, id);
  }
}

// 只能添加新设备,即 wal 或者 db 设备只能有一块
else if (action == "bluefs-bdev-new-db" || action == "bluefs-bdev-new-wal") {
  map<string, int> cur_devs_map;
  bool need_db = action == "bluefs-bdev-new-db";

  bool has_wal = false;
  bool has_db = false;
  char target_path[PATH_MAX] = "";
  // 判断设备是否存在
  parse_devices(cct.get(), devs, &cur_devs_map, &has_db, &has_wal);

  // Create either DB or WAL volume
  int r = EXIT_FAILURE;
  // 注意:须提前在命令行中指定 db 或 wal 的大小:bluestore_block_db_size,bluestore_block_wal_size,
  if (need_db && cct->_conf->bluestore_block_db_size == 0) {
    cerr << "DB size isn't specified, "
            "please set Ceph bluestore-block-db-size config parameter "
         << std::endl; 
  } else if (!need_db && cct->_conf->bluestore_block_wal_size == 0) {
    cerr << "WAL size isn't specified,   
        "please set Ceph bluestore-block-wal-size config parameter "
         << std::endl;
  } else {
    BlueStore bluestore(cct.get(), path);
    // 添加 bluefs 新设备
    r = bluestore.add_new_bluefs_device(
        need_db ? BlueFS::BDEV_NEWDB : BlueFS::BDEV_NEWWAL,
        target_path);
    return r;
  }
} 

// 为 bluefs 添加新设备 
int BlueStore::add_new_bluefs_device(int id, const string &dev_path) {
  int r;

  // 打开 kvdb(rocksdb)及相关 超级块、freelistmanager、allocator
  r = _open_db_and_around(true);

  if (id == BlueFS::BDEV_NEWWAL) {
    string p = path + "/block.wal";
    // 新建 path/block.wal -> dev_path 的软链接
    r = _setup_block_symlink_or_file("block.wal", dev_path,
                                     cct->_conf->bluestore_block_wal_size,
                                     true);
    // 添加块设备,写入 label 信息,标识为 WAl
    r = bluefs->add_block_device(BlueFS::BDEV_NEWWAL, p,
                                 cct->_conf->bdev_enable_discard,
                                 BDEV_LABEL_BLOCK_SIZE);

    if (bluefs->bdev_support_label(BlueFS::BDEV_NEWWAL)) {
      r = _check_or_set_bdev_label(
          p,
          bluefs->get_block_device_size(BlueFS::BDEV_NEWWAL),
          "bluefs wal",
          true);
    }
    // 标识 bluefs 拥有专用 wal
    bluefs_layout.dedicated_wal = true;
  } else if (id == BlueFS::BDEV_NEWDB) {
    string p = path + "/block.db";
    r = _setup_block_symlink_or_file("block.db", dev_path,
                                     cct->_conf->bluestore_block_db_size,
                                     true);
    ceph_assert(r == 0);

    r = bluefs->add_block_device(BlueFS::BDEV_NEWDB, p,
                                 cct->_conf->bdev_enable_discard,
                                 SUPER_RESERVED);
    ceph_assert(r == 0);

    if (bluefs->bdev_support_label(BlueFS::BDEV_NEWDB)) {
      r = _check_or_set_bdev_label(
          p,
          bluefs->get_block_device_size(BlueFS::BDEV_NEWDB),
          "bluefs db",
          true);
      ceph_assert(r == 0);
    }
    // 标识 shared_bdev、dedicated_db
    bluefs_layout.shared_bdev = BlueFS::BDEV_SLOW;
    bluefs_layout.dedicated_db = true;
  }
  // 重启 bluefs,
  bluefs->umount();
  bluefs->mount();
  // 把 bluefs layout 和相关日志信息同步刷新到磁盘,固化设置操作
  r = bluefs->prepare_new_device(id, bluefs_layout);
  ceph_assert(r == 0);
  // 关闭 kvdb 及相关
  _close_db_and_around(true);
  return r;
}

bluefs-bdev-migrate

// 不允许迁移到旧 wal 设备,支持以下三种迁移:
// 1. wal -> old db
// 2. db -> new db
// 3. wal -> new wal
else if (action == "bluefs-bdev-migrate") {
    map<string, int> cur_devs_map;
    set<int> src_dev_ids;
    map<string, int> src_devs;
    // 推测设备路径 wal、db、main
    parse_devices(cct.get(), devs, &cur_devs_map, nullptr, nullptr);
    // 获取要迁移的原设备路径
    auto i = cur_devs_map.find(dev_target);

    // 迁移分为两种:
    // 第一种:目标设备已经为 bluefs 设备
    // 第二种:目标设备不是 bluefs 已有的设备
    if (i != cur_devs_map.end()) {
      // Migrate to an existing BlueFS volume

      auto dev_target_id = i->second;
      // 目标设备不能为已有的 wal 设备
      if (dev_target_id == BlueFS::BDEV_WAL) {
        // currently we're unable to migrate to WAL device since there is no space
        // reserved for superblock
        cerr << "Migrate to WAL device isn't supported." << std::endl;
        exit(EXIT_FAILURE);
      }

      BlueStore bluestore(cct.get(), path);
      // 调用 BlueFS::device_migrate_to_existing
      int r = bluestore.migrate_to_existing_bluefs_device(
          src_dev_ids,
          dev_target_id);
      } 
      return r;
    } else {
      // Migrate to a new BlueFS volume
      // via creating either DB or WAL volume
      char target_path[PATH_MAX] = "";
      int dev_target_id;
      if (src_dev_ids.count(BlueFS::BDEV_DB)) {
        // if we have DB device in the source list - we create DB device
        // (and may be remove WAL).
        dev_target_id = BlueFS::BDEV_NEWDB;
      } else if (src_dev_ids.count(BlueFS::BDEV_WAL)) {
        dev_target_id = BlueFS::BDEV_NEWWAL;
      } else {
        // 不支持迁移 slow 设备
        cerr << "Unable to migrate Slow volume to new location, "
                "please allocate new DB or WAL with "
                "--bluefs-bdev-new-db(wal) command"
             << std::endl;
        exit(EXIT_FAILURE);
      }

      BlueStore bluestore(cct.get(), path);

      bool need_db = dev_target_id == BlueFS::BDEV_NEWDB;
      //BlueFS::device_migrate_to_new
      int r = bluestore.migrate_to_new_bluefs_device(
          src_dev_ids,
          dev_target_id,
          target_path);
      
      return r;
    }
  }

free-dump|free|score

else if (action == "free-dump" || action == "free-score") {
    AdminSocket *admin_socket = g_ceph_context->get_admin_socket();

    for (auto alloc_name : allocs_name) {
      ceph::bufferlist in, out;
      ostringstream err;
      // 调用 admin_socket 线程获取运行时的性能参数
      int r = admin_socket->execute_command(
          //{“prefix”: “bluestore allocator dump bluestore”}
          {"{\"prefix\": \"bluestore allocator " + action_name + " " + alloc_name + "\"}"},
          in, err, &out);
      cout << alloc_name << ":" << std::endl;
      cout << std::string(out.c_str(), out.length()) << std::endl;
    }
    bluestore.cold_close();
  }

// 实际调用的函数,位置:src/os/bluestore/Allocator.cc
int call(std::string_view command,
	   const cmdmap_t& cmdmap,
	   Formatter *f,
	   std::ostream& ss,
	   bufferlist& out) override {
    int r = 0;
    if (command == "bluestore allocator dump " + name) {
      f->open_object_section("allocator_dump");
      // alloc->get_capacity()获取容量、blocksize、类型、allocator 名称
      f->dump_unsigned("capacity", alloc->get_capacity());
      f->dump_unsigned("alloc_unit", alloc->get_block_size());
      f->dump_string("alloc_type", alloc->get_type());
      f->dump_string("alloc_name", name);

      f->open_array_section("extents");
      auto iterated_allocation = [&](size_t off, size_t len) {
        ceph_assert(len > 0);
        f->open_object_section("free");
        char off_hex[30];
        char len_hex[30];
        snprintf(off_hex, sizeof(off_hex) - 1, "0x%lx", off);
        snprintf(len_hex, sizeof(len_hex) - 1, "0x%lx", len);
        f->dump_string("offset", off_hex);
        f->dump_string("length", len_hex);
        f->close_section();
      };
      alloc->dump(iterated_allocation);
      f->close_section();
      f->close_section();
    } else if (command == "bluestore allocator score " + name) {
      f->open_object_section("fragmentation_score");
      // 具体见实际使用分配器的 get_fragmentation_score() 方法
      f->dump_float("fragmentation_rating", alloc->get_fragmentation_score());
      f->close_section();
    } else if (command == "bluestore allocator fragmentation " + name) {
      f->open_object_section("fragmentation");
      // 具体见实际使用的分配器 get_fragmentation() 方法
      f->dump_float("fragmentation_rating", alloc->get_fragmentation());
      f->close_section();
    } else {
      ss << "Invalid command" << std::endl;
      r = -ENOSYS;
    }
    return r;
  }
Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐