Ceph Crush算法中,Object至PG映射的源码是如何实现深入理解的?

2026-04-12 00:212阅读0评论SEO资讯
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计4669个文字,预计阅读时间需要19分钟。

Ceph Crush算法中,Object至PG映射的源码是如何实现深入理解的?

一. 前言在上篇文章《深入理解crush(2)——手动编译ceph集群并使用librados读写文件》中,初步使用了C语言客户端程序和rados_write将文件写入到ceph测试集群中。现在,开始通过使用gdb进行调试。

一. 前言


在上一篇《深入理解crush(2)—-手动编译ceph集群并使用librados读写文件》博文中,初步使用了c语言客户端程序rados_write,写入文件到ceph测试集群中,现在开始通过使用gdb debug客户端程序rados_write的整个写入流程,来分析crush的计算过程。

ceph rados对象的映射过程分为两个阶段: - 第一阶段:object 到PG的映射 - 第二阶段:PG 到OSD的映射

考虑到篇幅的问题,这一篇博文先分析第一阶段,object 到PG的映射

二. Object 到 PG的映射


接下来的函数分析会非常的冗长,所以这里先将object 到PG的映射的流程图先贴出来,方便对照着流程图去分析代码。

1. GDB的基本设置,安装gdb调试的一些依赖包

在此之前,需要学会gdb的基本使用,gdb的使用比较简单,网上资料也有比较多的,这里就不在详细说明,

设置gdb的基本配置

# cat ~/.gdbinit set history save on set print pretty on

set print pretty on: 为了美化输出

gdb debug程序的时候会依赖很多包,需要配置debuginfo 的yum 源,并的下载gdb提示的相应依赖

# cat /etc/yum.repos.d/CentOS-Debug.repo [debug] name=CentOS-$releasever - DebugInfo baseurl=debuginfo.centos.org/$releasever/$basearch/ gpgcheck=0 enabled=1 protect=1 priority=1

debugrados_write程序需要的依赖

# debuginfo-install glibc-2.17-260.el7.x86_64 \ libblkid-2.23.2-59.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 \ libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64 \ libnl-1.1.4-3.el7.x86_64 librados2-12.2.8-0.el7.x86_64 \ libstdc++-4.8.5-36.el7.x86_64 \ libuuid-2.23.2-59.el7.x86_64 lttng-ust-2.4.1-4.el7.x86_64 \ nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.el7_5.x86_64 \ nss-softokn-3.36.0-5.el7_5.x86_64 \ nss-softokn-freebl-3.36.0-5.el7_5.x86_64 \ nss-util-3.36.0-1.el7_5.x86_64 sqlite-3.7.17-8.el7.x86_64 \ userspace-rcu-0.7.16-1.el7.x86_64 \ zlib-1.2.7-18.el7.x86_64 \

2. 分析object 到 PG的函数栈

接下来会将debug的整个过程都记录下来,每一步加上解释和说明,所以篇幅会特别的长,

使用gdb 调试rados_write程序

# gdb ./rados_write GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/user/CEPH_related/librados_example/rados_write...done.

启动程序后,需要设置断点,这里选择的是crush_do_rule函数,因为这个函数是object–>到PG流程的终点,而且这个函数太出名了,笔者也是通过学习其他人的博文,才选择了这个函数断点。

(gdb) b crush_do_rule // 设置断点 Function "crush_do_rule" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (crush_do_rule) pending. (gdb) run // 开始运行程序 Starting program: /home/user/CEPH_related/librados_example/./rados_write [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffeb651700 (LWP 3600)] Created a cluster handle. Read the config file. Read the command line arguments. [New Thread 0x7fffea6d4700 (LWP 3601)] [New Thread 0x7fffe9ed3700 (LWP 3602)] [New Thread 0x7fffe96d2700 (LWP 3603)] [New Thread 0x7fffe8ed1700 (LWP 3604)] [New Thread 0x7fffdbfff700 (LWP 3605)] [New Thread 0x7fffdb7fe700 (LWP 3606)] [New Thread 0x7fffdaffd700 (LWP 3607)] [New Thread 0x7fffda7fc700 (LWP 3608)] [New Thread 0x7fffd9ffb700 (LWP 3609)] [New Thread 0x7fffd97fa700 (LWP 3610)] [New Thread 0x7fffd8ff9700 (LWP 3611)] [New Thread 0x7fffb7fff700 (LWP 3612)] [New Thread 0x7fffb77fe700 (LWP 3613)] Connected to the cluster. Created I/O context. // 在断点处卡住 Breakpoint 1, crush_do_rule (map=0x7fffcc010040, ruleno=0, x=-2119696504, result=0x7fffffffd5d8, result_max=3, weight=0x7fffcc00fea0, weight_max=3, cwin=0x7fffffffd508, choose_args=0x0) at /home/user/ceph/src/crush/mapper.c:887 887 { Missing separate debuginfos, use: debuginfo-install libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64 (gdb) bt // 查看当前的函数堆栈 #0 crush_do_rule (map=0x7fffcc010040, ruleno=0, x=-2119696504, result=0x7fffffffd5d8, result_max=3, weight=0x7fffcc00fea0, weight_max=3, cwin=0x7fffffffd508, choose_args=0x0) at /home/user/ceph/src/crush/mapper.c:887 #1 0x00007fffeee59274 in do_rule<std::vector<unsigned int, mempool::pool_allocator<(mempool::pool_index_t)15, unsigned int> > > ( choose_args_index=<optimized out>, weight=std::vector of length 3, capacity 4 = {...}, maxout=3, out=std::vector of length 0, capacity 0, x=-2119696504, rule=<optimized out>, this=<optimized out>) at /home/user/ceph/src/crush/CrushWrapper.h:1502 #2 OSDMap::_pg_to_raw_osds (this=this@entry=0x78b960, pool=..., pg=..., osds=osds@entry=0x7fffffffd6d0, ppps=ppps@entry=0x7fffffffd6c4) at /home/user/ceph/src/osd/OSDMap.cc:2061 #3 0x00007fffeee5b036 in OSDMap::_pg_to_up_acting_osds (this=0x78b960, pg=..., up=up@entry=0x7fffffffd890, up_primary=up_primary@entry=0x7fffffffd834, acting=acting@entry=0x7fffffffd8b0, acting_primary=acting_primary@entry=0x7fffffffd838, raw_pg_to_pg=raw_pg_to_pg@entry=true) at /home/user/ceph/src/osd/OSDMap.cc:2295 #4 0x00007ffff7b42377 in pg_to_up_acting_osds (acting_primary=0x7fffffffd838, acting=0x7fffffffd8b0, up_primary=0x7fffffffd834, up=0x7fffffffd890, pg=..., this=<optimized out>) at /home/user/ceph/src/osd/OSDMap.h:1159 #5 Objecter::_calc_target (this=this@entry=0x78b2b0, t=t@entry=0x7acf18, con=con@entry=0x0, any_change=any_change@entry=false) at /home/user/ceph/src/osdc/Objecter.cc:2846 #6 0x00007ffff7b4ff89 in Objecter::_op_submit (this=this@entry=0x78b2b0, op=op@entry=0x7acef0, sul=..., ptid=ptid@entry=0x7fffffffdcc8) at /home/user/ceph/src/osdc/Objecter.cc:2393 #7 0x00007ffff7b5d128 in Objecter::_op_submit_with_budget (this=this@entry=0x78b2b0, op=op@entry=0x7acef0, sul=..., ptid=ptid@entry=0x7fffffffdcc8, ctx_budget=ctx_budget@entry=0x0) at /home/user/ceph/src/osdc/Objecter.cc:2307 #8 0x00007ffff7b5d3aa in Objecter::op_submit (this=0x78b2b0, op=0x7acef0, ptid=0x7fffffffdcc8, ctx_budget=0x0) at /home/user/ceph/src/osdc/Objecter.cc:2274 #9 0x00007ffff7b0c4f0 in librados::IoCtxImpl::operate (this=this@entry=0x7a1ec0, oid=..., o=o@entry=0x7fffffffe020, pmtime=pmtime@entry=0x0, flags=flags@entry=0) at /home/user/ceph/src/librados/IoCtxImpl.cc:715 #10 0x00007ffff7b17f5a in librados::IoCtxImpl::write (this=this@entry=0x7a1ec0, oid=..., bl=..., len=len@entry=16, off=off@entry=0) at /home/user/ceph/src/librados/IoCtxImpl.cc:648 #11 0x00007ffff7ae241d in rados_write (io=0x7a1ec0, o=<optimized out>, buf=0x401039 "Hello World!", len=16, off=0) at /home/user/ceph/src/librados/librados.cc:3691 #12 0x0000000000400d0c in main (argc=1, argv=0x7fffffffe2e8) at rados_write.c:75

从上面的一系列操作下来,得到的函数流程如下

```

  1. main
  2. rados_write
  3. librados::IoCtxImpl::operate
  4. Objecter::op_submit
  5. Objecter::_op_submit_with_budget
  6. Objecter::_op_submit
  7. Objecter::_calc_target
  8. pg_to_up_acting_osds
  9. _pg_to_up_acting_osds
  10. _pg_to_raw_osds
  11. do_rule
  12. crush_do_rule ```

因为这里并不关心librados是如何封装请求,只关心object到pg的计算过程,所以这里决定从Objecter::_calc_target函数开始debug 整个过程,重新开始,然后再次设置断点

  • 1、Objecter::_calc_target: 开始的入口,计算 object的hash值ps

(gdb) b Objecter::_calc_target (gdb) run Breakpoint 1, Objecter::_calc_target (this=this@entry=0x78b280, t=t@entry=0x7ace08, con=con@entry=0x0, any_change=any_change@entry=false) at /home/user/ceph/src/osdc/Objecter.cc:2774 2774 {

卡住在断点处,现在我们打开tui跟踪一下代码,ctl + x + a可以切换到tui界面

┌──/home/user/ceph/src/osdc/Objecter.cc───────────────────────────────────────────────────────────────────────────────────────┐ │2773 int Objecter::_calc_target(op_target_t *t, Connection *con, bool any_change) │ B+ │2774 { │ │2775 // rwlock is locked │ │2776 bool is_read = t->flags & CEPH_OSD_FLAG_READ; │ │2777 bool is_write = t->flags & CEPH_OSD_FLAG_WRITE; │ │2778 t->epoch = osdmap->get_epoch(); // 获取osd当前omap 版本 │ │2779 ldout(cct,20) << __func__ << " epoch " << t->epoch │ │2780 << " base " << t->base_oid << " " << t->base_oloc │ │2781 << " precalc_pgid " << (int)t->precalc_pgid │ │2782 << " pgid " << t->base_pgid │ │2783 << (is_read ? " is_read" : "") │ │2784 << (is_write ? " is_write" : "") │ │2785 << dendl; │ │2786 │ │2787 const pg_pool_t *pi = osdmap->get_pg_pool(t->base_oloc.pool); // 根据pool id 获取pool实体 │ │2788 if (!pi) { │ │2789 t->osd = -1; │ │2790 return RECALC_OP_TARGET_POOL_DNE; │ │2791 } │ >│2792 ldout(cct,30) << __func__ << " base pi " << pi │ │2793 << " pg_num " << pi->get_pg_num() << dendl; │ │2794 │ │2795 bool force_resend = false; │ │2796 if (osdmap->get_epoch() == pi->last_force_op_resend) { │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: Objecter::_calc_target Line: 2792 PC: 0x7ffff7b41f6c (gdb)

这里按n逐行debug代码, 这里我想显示打印pg_pool_t*pop_target_t*t的信息,其中pg_pool_t是pool的结构体,包含pool相关的所有信息。

(gdb) p *pi $13 = { static APPLICATION_NAME_CEPHFS = 0x7fffef10f7f9 "cephfs", static APPLICATION_NAME_RBD = 0x7fffef10598b "rbd", static APPLICATION_NAME_RGW = 0x7fffef105954 "rgw", flags = 1, type = 1 '\001', size = 3 '\003', min_size = 1 '\001', crush_rule = 0 '\000', object_hash = 2 '\002', pg_num = 8, pgp_num = 8, properties = std::map with 0 elements, erasure_code_profile = "", last_change = 17, last_force_op_resend = 0, last_force_op_resend_preluminous = 0, snap_seq = { val = 0 }, snap_epoch = 0, auid = 0, crash_replay_interval = 0, quota_max_bytes = 0, quota_max_objects = 0, snaps = std::map with 0 elements, removed_snaps = { _size = 0, m = std::map with 0 elements }, pg_num_mask = 7, pgp_num_mask = 7, tiers = std::set with 0 elements, tier_of = -1, read_tier = -1, write_tier = -1, cache_mode = pg_pool_t::CACHEMODE_NONE, target_max_bytes = 0, target_max_objects = 0, cache_target_dirty_ratio_micro = 400000, cache_target_dirty_high_ratio_micro = 600000, cache_target_full_ratio_micro = 800000, cache_min_flush_age = 0, cache_min_evict_age = 0, hit_set_params = { _vptr.Params = 0x7ffff7dcdb90 <vtable for HitSet::Params+16>, impl = { px = 0x0 } }, hit_set_period = 0, hit_set_count = 0, use_gmt_hitset = true, min_read_recency_for_promote = 0, min_write_recency_for_promote = 0, hit_set_grade_decay_rate = 0, hit_set_search_last_n = 0, stripe_width = 0, expected_num_objects = 0, fast_read = false, opts = { opts = std::map with 0 elements }, application_metadata = std::map with 1 elements = { ["rgw"] = std::map with 0 elements }, grade_table = std::vector of length 0, capacity 0 }

op_target_t则是整个写入操作封装的结构信息,包含对象的名字,写入pool的id

(gdb) p *t $19 = { flags = 32, epoch = 24, base_oid = { name = "test-write" }, base_oloc = { pool = 3, key = "", nspace = "", hash = -1 }, target_oid = { name = "test-write" }, target_oloc = { pool = 3, key = "", nspace = "", hash = -1 }, precalc_pgid = false, pool_ever_existed = false, base_pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, actual_pgid = { pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, shard = { id = -1 '\377', static NO_SHARD = { id = -1 '\377', static NO_SHARD = <same as static member of an already seen type> } }, static calc_name_buf_size = 40 '(' }, pg_num = 0, pg_num_mask = 0, up = std::vector of length 0, capacity 0, acting = std::vector of length 0, capacity 0, up_primary = -1, acting_primary = -1, size = -1, min_size = -1, sort_bitwise = false, recovery_deletes = false, used_replica = false, paused = false, osd = -1, last_force_resend = 0 }

在这我们看看t->target_oloc的信息

target_oloc = { pool = 3, // 这个pool就是rados_write指定的pool在集群中的id key = "", // 提供的key 名字,因为我们并没有提供key的名字所以为空 nspace = "", // namespace 用来区分同个pool的不同命名空间的对象 hash = -1 // hash的算法 }

这里的namespace会跟object一起作为hash算法的输入,对namespace有疑惑的可以看看这篇博文Ceph rados namespace的设计

继续 n 单步调试,这里我们会进去osdmap->object_locator_to_pg函数

  • 1.1、OSDMap::object_locator_to_pg: 计算object 到 pg 的 seed,seed 中文翻译为种子

个人认为因为这一步还不是最终的结果,所以这里用的seed 命名

┌──/home/user/ceph/src/osdc/Objecter.cc───────────────────────────────────────────────────────────────────────────────────────┐ │2812 t->target_oloc.pool = pi->write_tier; │ │2813 pi = osdmap->get_pg_pool(t->target_oloc.pool); │ │2814 if (!pi) { │ │2815 t->osd = -1; │ │2816 return RECALC_OP_TARGET_POOL_DNE; │ │2817 } │ │2818 } │ │2819 │ │2820 pg_t pgid; │ │2821 if (t->precalc_pgid) { │ │2822 assert(t->flags & CEPH_OSD_FLAG_IGNORE_OVERLAY); │ │2823 assert(t->base_oid.name.empty()); // make sure this is a pg op │ │2824 assert(t->base_oloc.pool == (int64_t)t->base_pgid.pool()); │ │2825 pgid = t->base_pgid; │ │2826 } else { │ │2827 int ret = osdmap->object_locator_to_pg(t->target_oid, t->target_oloc, │ >│2828 pgid); │ │2829 if (ret == -ENOENT) { │ │2830 t->osd = -1; │ │2831 return RECALC_OP_TARGET_POOL_DNE; │ │2832 } │ │2833 } │ │2834 ldout(cct,20) << __func__ << " target " << t->target_oid << " " │ │2835 << t->target_oloc << " -> pgid " << pgid << dendl; │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: Objecter::_calc_target Line: 2828 PC: 0x7ffff7b42748 (gdb) n (gdb) p t->target_oid $21 = { name = "test-write" } (gdb) p t->target_oloc $22 = { pool = 3, key = "", nspace = "", hash = -1n }

这里object_locator_to_pg的输入参数为t->target_oid,t->target_oloc,pgids进入object_locator_to_pg函数

│2000 int OSDMap::object_locator_to_pg( │ │2001 const object_t& oid, const object_locator_t& loc, pg_t &pg) const │ │2002 { │ │2003 if (loc.hash >= 0) { // 从上面参数可以知道的hash的值为 -1 │ │2004 if (!get_pg_pool(loc.get_pool())) { │ │2005 return -ENOENT; │ │2006 } │ │2007 pg = pg_t(loc.hash, loc.get_pool()); │ │2008 return 0; │ │2009 } │ >│2010 return map_to_pg(loc.get_pool(), oid.name, loc.key, loc.nspace, &pg); │ │2011 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::object_locator_to_pg Line: 2010 PC: 0x7fffeee46240 } OSDMap::object_locator_to_pg (this=0x78b930, oid=..., loc=..., pg=...) at /home/user/ceph/src/osd/OSDMap.cc:2002

这里他会再次进入map_to_pg函数, 参数为pool id,obj name,obj key,namespace,pg实体

  • 1.1.1、OSDMap::map_to_pg

│1979 // mapping │ │1980 int OSDMap::map_to_pg( │ │1981 int64_t poolid, │ │1982 const string& name, │ │1983 const string& key, │ │1984 const string& nspace, │ │1985 pg_t *pg) const │ │1986 { │ │1987 // calculate ps (placement seed) │ │1988 const pg_pool_t *pool = get_pg_pool(poolid); // 根据poolid 获取pool实体 │ │1989 if (!pool) │ │1990 return -ENOENT; │ │1991 ps_t ps; │ │1992 if (!key.empty()) // 判断客户端写入的时候是否指定了key │ │1993 ps = pool->hash_key(key, nspace); │ │1994 else │ >│1995 ps = pool->hash_key(name, nspace); // 没指定就用文件名当做key │ │1996 *pg = pg_t(ps, poolid); │ │1997 return 0; │ │1998 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::map_to_pg Line: 1995 PC: 0x7fffeee4615d opts = std::map with 0 elements }, application_metadata = std::map with 1 elements = { ["rgw"] = std::map with 0 elements }, grade_table = std::vector of length 0, capacity 0 } (gdb) n (gdb) p key.empty() $25 = true

因为写入的文件并没有设置key 所以,会使用文件名,当做key 跟nspace 一起作为hash的输入 让我们看看pool->hash_key(name, nspace)会做些什么

  • 1.1.1.1、pg_pool_t::hash_key

│1374 uint32_t pg_pool_t::hash_key(const string& key, const string& ns) const │ │1375 { │ │1376 if (ns.empty()) // 判断namespace 是否为空 │ >│1377 return ceph_str_hash(object_hash, key.data(), key.length()); │ │1378 int nsl = ns.length(); │ │1379 int len = key.length() + nsl + 1; │ │1380 char buf[len]; │ │1381 memcpy(&buf[0], ns.data(), nsl); │ │1382 buf[nsl] = '\037'; │ │1383 memcpy(&buf[nsl+1], key.data(), key.length()); │ │1384 return ceph_str_hash(object_hash, &buf[0], len); │ │1385 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: pg_pool_t::hash_key Line: 1377 PC: 0x7fffeee88360 } (gdb) n (gdb) p key.empty() $25 = true

在这会根据是否有namespace进行不同的处理, 到这应该就清楚namespace的作用了吧, 这里 object_hash 就是hash 的算法类型,这里是2,是个默认值,继续进入函数ceph_str_hash(object_hash, key.data(), key.length())

  • 1.1.1.1.1、unsigned ceph_str_hash: string to hash

│95 unsigned ceph_str_hash(int type, const char *s, unsigned len) │ │96 { // 选择不同的hash算法 │ │97 switch (type) { │ │98 case CEPH_STR_HASH_LINUX: │ │99 return ceph_str_hash_linux(s, len); │ │100 case CEPH_STR_HASH_RJENKINS: │ >│101 return ceph_str_hash_rjenkins(s, len); │ │102 default: │ │103 return -1; │ │104 } │ │105 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (gdb) s ceph_str_hash (type=2, s=0x72cc48 "test-write", len=10) at /home/user/ceph/src/common/ceph_hash.cc:97

在这会调用ceph_str_hash_rjenkins(s, len)哈希算法,到此就不深入了,具体到的hash的算法实现,不是此次的目的,回到开始的入口OSDMap::map_to_pg

Ceph Crush算法中,Object至PG映射的源码是如何实现深入理解的?

│1980 int OSDMap::map_to_pg( │ │1981 int64_t poolid, │ │1982 const string& name, │ │1983 const string& key, │ │1984 const string& nspace, │ │1985 pg_t *pg) const │ │1986 { │ │1987 // calculate ps (placement seed) │ │1988 const pg_pool_t *pool = get_pg_pool(poolid); │ │1989 if (!pool) │ │1990 return -ENOENT; │ │1991 ps_t ps; │ │1992 if (!key.empty()) │ │1993 ps = pool->hash_key(key, nspace); │ │1994 else │ │1995 ps = pool->hash_key(name, nspace); │ >│1996 *pg = pg_t(ps, poolid); // 计算得到的 ps 和 poolid 组装成 pg实体 │ │1997 return 0; │ │1998 } │ │1999 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::map_to_pg Line: 1996 PC: 0x7fffeee46165 (gdb) p *pg $32 = { m_pool = 3, m_seed = 207473616, m_preferred = -1, static calc_name_buf_size = 36 '$' }

在这,函数会根据hash 生成的ps(placement seed) 和 poolid 生成一个pg实体 内容如下

pg = { m_pool = 3, // pool id m_seed = 207473616, // seed m_preferred = -1, static calc_name_buf_size = 36 '$' }

到此,我们仅仅得到了,object的hash 值

  • 2、 继续debug _calc_target 函数

│2845 osdmap->pg_to_up_acting_osds(pgid, &up, &up_primary, │ >│2846 &acting, &acting_primary); │ │2847 bool sort_bitwise = osdmap->test_flag(CEPH_OSDMAP_SORTBITWISE); │ │2848 bool recovery_deletes = osdmap->test_flag(CEPH_OSDMAP_RECOVERY_DELETES); │ │2849 unsigned prev_seed = ceph_stable_mod(pgid.ps(), t->pg_num, t->pg_num_mask); │ │2850 pg_t prev_pgid(prev_seed, pgid.pool()); │ │2851 if (any_change && PastIntervals::is_new_interval( │ │2852 t->acting_primary, │ │2853 acting_primary, │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

这一步会进入pg_to_up_acting_osds输入的参数 为,上一步计算出来的pg实体

  • 2.1、void pg_to_up_acting_osds

│1157 void pg_to_up_acting_osds(pg_t pg, vector<int> *up, int *up_primary, │ │1158 vector<int> *acting, int *acting_primary) const { │ >│1159 _pg_to_up_acting_osds(pg, up, up_primary, acting, acting_primary); │ │1160 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ pg_to_up_acting_osds (acting_primary=<optimized out>, acting=<optimized out>, up_primary=<optimized out>, up=<optimized out>, pg=..., this=<optimized out>) at /home/user/ceph/src/osd/OSDMap.h:1159 (gdb)

再进入_pg_to_up_acting_osds

  • 2.1.1、OSDMap::_pg_to_up_acting_osds

void OSDMap::_pg_to_up_acting_osds( const pg_t& pg, vector<int> *up, int *up_primary, vector<int> *acting, int *acting_primary, bool raw_pg_to_pg) const { const pg_pool_t *pool = get_pg_pool(pg.pool()); if (!pool || (!raw_pg_to_pg && pg.ps() >= pool->get_pg_num())) { if (up) up->clear(); if (up_primary) *up_primary = -1; if (acting) acting->clear(); if (acting_primary) *acting_primary = -1; return; } vector<int> raw; vector<int> _up; vector<int> _acting; int _up_primary; int _acting_primary; ps_t pps; _get_temp_osds(*pool, pg, &_acting, &_acting_primary); // 这个是判断是否有pg temp,如果有就会从使用pg_temp中的osd if (_acting.empty() || up || up_primary) { _pg_to_raw_osds(*pool, pg, &raw, &pps); // 开始计算pg 实体的hash值 _apply_upmap(*pool, pg, &raw); _raw_to_up_osds(*pool, raw, &_up); _up_primary = _pick_primary(_up); _apply_primary_affinity(pps, *pool, &_up, &_up_primary); if (_acting.empty()) { _acting = _up; if (_acting_primary == -1) { _acting_primary = _up_primary; } }

首先这里会根据是否有pg_temp做一次osd的选择,如果没有则进行正常的 PG 到 osd的选择流程 *_pg_to_raw_osds(pool, pg, &raw, &pps)

  • 2.1.1.1、OSDMap::_pg_to_raw_osds: 获取pg 到 osd的映射

│2049 void OSDMap::_pg_to_raw_osds( │ │2050 const pg_pool_t& pool, pg_t pg, │ │2051 vector<int> *osds, │ │2052 ps_t *ppps) const │ │2053 { │ │2054 // map to osds[] │ >│2055 ps_t pps = pool.raw_pg_to_pps(pg); // placement ps 这里就是pg+poolid 的hash计算过程 │ │2056 unsigned size = pool.get_size(); │ │2057 │ │2058 // what crush rule? │ │2059 int ruleno = crush->find_rule(pool.get_crush_rule(), pool.get_type(), size); │ │2060 if (ruleno >= 0) │ │2061 crush->do_rule(ruleno, pps, *osds, size, osd_weight, pg.pool()); │ │2062 │ │2063 _remove_nonexistent_osds(pool, *osds); │ │2064 │ │2065 if (ppps) │ │2066 *ppps = pps; │ │2067 } │ │2068 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::_pg_to_raw_osds Line: 2055 PC: 0x7fffeee590f8 (gdb) s OSDMap::_pg_to_raw_osds (this=this@entry=0x78b950, pool=..., pg=..., osds=osds@entry=0x7fffffffd6d0, ppps=ppps@entry=0x7fffffffd6c4) at /home/user/ceph/src/osd/OSDMap.cc:2055 (gdb) p pg $7 = {m_pool = 3, m_seed = 207473616, m_preferred = -1, static calc_name_buf_size = 36 '$'}

在这会进入pool.raw_pg_to_pps(pg), 还记得从上面获取的pgid吗?就是这里的输入,这里得到的结果就object–>pg的完整hash 这里称作pps

  • 2.1.1.1.1、pg_pool_t::raw_pg_to_pps(pg_t pg)

│1406 ps_t pg_pool_t::raw_pg_to_pps(pg_t pg) const │ │1407 { │ │1408 if (flags & FLAG_HASHPSPOOL) { │ │1409 // Hash the pool id so that pool PGs do not overlap. //为了防止不同pool的pg混淆,所以需要加上pool id的hash。 │ │1410 return │ │1411 crush_hash32_2(CRUSH_HASH_RJENKINS1, │ >│1412 ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask), │ │1413 pg.pool()); │ │1414 } else { │ │1415 // Legacy behavior; add ps and pool together. This is not a great │ │1416 // idea because the PGs from each pool will essentially overlap on │ │1417 // top of each other: 0.5 == 1.4 == 2.3 == ... │ │1418 return │ │1419 ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask) + │ │1420 pg.pool(); │ │1421 } │ │1422 } │ │1423 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: pg_pool_t::raw_pg_to_pps Line: 1412 PC: 0x7fffeee883d5 at /home/user/ceph/src/osd/OSDMap.cc:2055

在这里会先对pgid 中的seedpgp_num,pgp_num_mask进行stable mod 运算,这里是pgp_num,ceph_stable_mod 算法是PG映射和PG分裂的重要依据,这里不细说,后面会专门深入这一块。在这一步骤会得出真正的pg id, 在这我们就不进入这个函数的实现,直接进入crush_hash32_2

  • 2.1.1.1.1.1、 __u32 crush_hash32(int type, __u32 a): 根据pg id 和pool id 确定pg的唯一hash值

│103 __u32 crush_hash32_2(int type, __u32 a, __u32 b) │ │104 { │ >│105 switch (type) { │ │106 case CRUSH_HASH_RJENKINS1: │ │107 return crush_hash32_rjenkins1_2(a, b); │ │108 default: │ │109 return 0; │ │110 } │ │111 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: crush_hash32_2 Line: 105 PC: 0x7fffef057600 (gdb) crush_hash32_2 (type=0, a=0, b=3) at /home/user/ceph/src/crush/hash.c:105

这里直接调用crush_hash32_rjenkins1_2hash 算法,输入为 pgid 和 pool id 到此,可以返回, 已经得到pps也就是hash(pgid, poolid)的值,回到_pg_to_raw_osds的函数。

三. 总结


到此OBJEC–>PG映射的流程分析就完成了,简单的总结下这个过程

  • hash(obj, nspace): 获取object的哈希值ps,这里需要注意的点是namespace也参与了哈希
  • ceph_stable_mod(ps, pgp_num, pgp_mask):这一步是获取真实pg的id,而且这个函数也是ceph PG分裂和迁移非常核心的一部分
  • hash(pg, pool_id):这一步是将 pg 和poolid进行哈希得到唯一值pps, 这个数值是作为选择osd的参数。

四. 参考文档


  • ceph source code
  • CEPH CRUSH 算法源码分析 原文CEPH CRUSH algorithm source code analysis

转自:www.dovefi.com/post/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3crush3object%E8%87%B3pg%E6%98%A0%E5%B0%84%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90/


本文共计4669个文字,预计阅读时间需要19分钟。

Ceph Crush算法中,Object至PG映射的源码是如何实现深入理解的?

一. 前言在上篇文章《深入理解crush(2)——手动编译ceph集群并使用librados读写文件》中,初步使用了C语言客户端程序和rados_write将文件写入到ceph测试集群中。现在,开始通过使用gdb进行调试。

一. 前言


在上一篇《深入理解crush(2)—-手动编译ceph集群并使用librados读写文件》博文中,初步使用了c语言客户端程序rados_write,写入文件到ceph测试集群中,现在开始通过使用gdb debug客户端程序rados_write的整个写入流程,来分析crush的计算过程。

ceph rados对象的映射过程分为两个阶段: - 第一阶段:object 到PG的映射 - 第二阶段:PG 到OSD的映射

考虑到篇幅的问题,这一篇博文先分析第一阶段,object 到PG的映射

二. Object 到 PG的映射


接下来的函数分析会非常的冗长,所以这里先将object 到PG的映射的流程图先贴出来,方便对照着流程图去分析代码。

1. GDB的基本设置,安装gdb调试的一些依赖包

在此之前,需要学会gdb的基本使用,gdb的使用比较简单,网上资料也有比较多的,这里就不在详细说明,

设置gdb的基本配置

# cat ~/.gdbinit set history save on set print pretty on

set print pretty on: 为了美化输出

gdb debug程序的时候会依赖很多包,需要配置debuginfo 的yum 源,并的下载gdb提示的相应依赖

# cat /etc/yum.repos.d/CentOS-Debug.repo [debug] name=CentOS-$releasever - DebugInfo baseurl=debuginfo.centos.org/$releasever/$basearch/ gpgcheck=0 enabled=1 protect=1 priority=1

debugrados_write程序需要的依赖

# debuginfo-install glibc-2.17-260.el7.x86_64 \ libblkid-2.23.2-59.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 \ libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64 \ libnl-1.1.4-3.el7.x86_64 librados2-12.2.8-0.el7.x86_64 \ libstdc++-4.8.5-36.el7.x86_64 \ libuuid-2.23.2-59.el7.x86_64 lttng-ust-2.4.1-4.el7.x86_64 \ nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.el7_5.x86_64 \ nss-softokn-3.36.0-5.el7_5.x86_64 \ nss-softokn-freebl-3.36.0-5.el7_5.x86_64 \ nss-util-3.36.0-1.el7_5.x86_64 sqlite-3.7.17-8.el7.x86_64 \ userspace-rcu-0.7.16-1.el7.x86_64 \ zlib-1.2.7-18.el7.x86_64 \

2. 分析object 到 PG的函数栈

接下来会将debug的整个过程都记录下来,每一步加上解释和说明,所以篇幅会特别的长,

使用gdb 调试rados_write程序

# gdb ./rados_write GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/user/CEPH_related/librados_example/rados_write...done.

启动程序后,需要设置断点,这里选择的是crush_do_rule函数,因为这个函数是object–>到PG流程的终点,而且这个函数太出名了,笔者也是通过学习其他人的博文,才选择了这个函数断点。

(gdb) b crush_do_rule // 设置断点 Function "crush_do_rule" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (crush_do_rule) pending. (gdb) run // 开始运行程序 Starting program: /home/user/CEPH_related/librados_example/./rados_write [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffeb651700 (LWP 3600)] Created a cluster handle. Read the config file. Read the command line arguments. [New Thread 0x7fffea6d4700 (LWP 3601)] [New Thread 0x7fffe9ed3700 (LWP 3602)] [New Thread 0x7fffe96d2700 (LWP 3603)] [New Thread 0x7fffe8ed1700 (LWP 3604)] [New Thread 0x7fffdbfff700 (LWP 3605)] [New Thread 0x7fffdb7fe700 (LWP 3606)] [New Thread 0x7fffdaffd700 (LWP 3607)] [New Thread 0x7fffda7fc700 (LWP 3608)] [New Thread 0x7fffd9ffb700 (LWP 3609)] [New Thread 0x7fffd97fa700 (LWP 3610)] [New Thread 0x7fffd8ff9700 (LWP 3611)] [New Thread 0x7fffb7fff700 (LWP 3612)] [New Thread 0x7fffb77fe700 (LWP 3613)] Connected to the cluster. Created I/O context. // 在断点处卡住 Breakpoint 1, crush_do_rule (map=0x7fffcc010040, ruleno=0, x=-2119696504, result=0x7fffffffd5d8, result_max=3, weight=0x7fffcc00fea0, weight_max=3, cwin=0x7fffffffd508, choose_args=0x0) at /home/user/ceph/src/crush/mapper.c:887 887 { Missing separate debuginfos, use: debuginfo-install libibverbs-1.1.8mlnx1-OFED.3.3.0.0.9.33100.x86_64 (gdb) bt // 查看当前的函数堆栈 #0 crush_do_rule (map=0x7fffcc010040, ruleno=0, x=-2119696504, result=0x7fffffffd5d8, result_max=3, weight=0x7fffcc00fea0, weight_max=3, cwin=0x7fffffffd508, choose_args=0x0) at /home/user/ceph/src/crush/mapper.c:887 #1 0x00007fffeee59274 in do_rule<std::vector<unsigned int, mempool::pool_allocator<(mempool::pool_index_t)15, unsigned int> > > ( choose_args_index=<optimized out>, weight=std::vector of length 3, capacity 4 = {...}, maxout=3, out=std::vector of length 0, capacity 0, x=-2119696504, rule=<optimized out>, this=<optimized out>) at /home/user/ceph/src/crush/CrushWrapper.h:1502 #2 OSDMap::_pg_to_raw_osds (this=this@entry=0x78b960, pool=..., pg=..., osds=osds@entry=0x7fffffffd6d0, ppps=ppps@entry=0x7fffffffd6c4) at /home/user/ceph/src/osd/OSDMap.cc:2061 #3 0x00007fffeee5b036 in OSDMap::_pg_to_up_acting_osds (this=0x78b960, pg=..., up=up@entry=0x7fffffffd890, up_primary=up_primary@entry=0x7fffffffd834, acting=acting@entry=0x7fffffffd8b0, acting_primary=acting_primary@entry=0x7fffffffd838, raw_pg_to_pg=raw_pg_to_pg@entry=true) at /home/user/ceph/src/osd/OSDMap.cc:2295 #4 0x00007ffff7b42377 in pg_to_up_acting_osds (acting_primary=0x7fffffffd838, acting=0x7fffffffd8b0, up_primary=0x7fffffffd834, up=0x7fffffffd890, pg=..., this=<optimized out>) at /home/user/ceph/src/osd/OSDMap.h:1159 #5 Objecter::_calc_target (this=this@entry=0x78b2b0, t=t@entry=0x7acf18, con=con@entry=0x0, any_change=any_change@entry=false) at /home/user/ceph/src/osdc/Objecter.cc:2846 #6 0x00007ffff7b4ff89 in Objecter::_op_submit (this=this@entry=0x78b2b0, op=op@entry=0x7acef0, sul=..., ptid=ptid@entry=0x7fffffffdcc8) at /home/user/ceph/src/osdc/Objecter.cc:2393 #7 0x00007ffff7b5d128 in Objecter::_op_submit_with_budget (this=this@entry=0x78b2b0, op=op@entry=0x7acef0, sul=..., ptid=ptid@entry=0x7fffffffdcc8, ctx_budget=ctx_budget@entry=0x0) at /home/user/ceph/src/osdc/Objecter.cc:2307 #8 0x00007ffff7b5d3aa in Objecter::op_submit (this=0x78b2b0, op=0x7acef0, ptid=0x7fffffffdcc8, ctx_budget=0x0) at /home/user/ceph/src/osdc/Objecter.cc:2274 #9 0x00007ffff7b0c4f0 in librados::IoCtxImpl::operate (this=this@entry=0x7a1ec0, oid=..., o=o@entry=0x7fffffffe020, pmtime=pmtime@entry=0x0, flags=flags@entry=0) at /home/user/ceph/src/librados/IoCtxImpl.cc:715 #10 0x00007ffff7b17f5a in librados::IoCtxImpl::write (this=this@entry=0x7a1ec0, oid=..., bl=..., len=len@entry=16, off=off@entry=0) at /home/user/ceph/src/librados/IoCtxImpl.cc:648 #11 0x00007ffff7ae241d in rados_write (io=0x7a1ec0, o=<optimized out>, buf=0x401039 "Hello World!", len=16, off=0) at /home/user/ceph/src/librados/librados.cc:3691 #12 0x0000000000400d0c in main (argc=1, argv=0x7fffffffe2e8) at rados_write.c:75

从上面的一系列操作下来,得到的函数流程如下

```

  1. main
  2. rados_write
  3. librados::IoCtxImpl::operate
  4. Objecter::op_submit
  5. Objecter::_op_submit_with_budget
  6. Objecter::_op_submit
  7. Objecter::_calc_target
  8. pg_to_up_acting_osds
  9. _pg_to_up_acting_osds
  10. _pg_to_raw_osds
  11. do_rule
  12. crush_do_rule ```

因为这里并不关心librados是如何封装请求,只关心object到pg的计算过程,所以这里决定从Objecter::_calc_target函数开始debug 整个过程,重新开始,然后再次设置断点

  • 1、Objecter::_calc_target: 开始的入口,计算 object的hash值ps

(gdb) b Objecter::_calc_target (gdb) run Breakpoint 1, Objecter::_calc_target (this=this@entry=0x78b280, t=t@entry=0x7ace08, con=con@entry=0x0, any_change=any_change@entry=false) at /home/user/ceph/src/osdc/Objecter.cc:2774 2774 {

卡住在断点处,现在我们打开tui跟踪一下代码,ctl + x + a可以切换到tui界面

┌──/home/user/ceph/src/osdc/Objecter.cc───────────────────────────────────────────────────────────────────────────────────────┐ │2773 int Objecter::_calc_target(op_target_t *t, Connection *con, bool any_change) │ B+ │2774 { │ │2775 // rwlock is locked │ │2776 bool is_read = t->flags & CEPH_OSD_FLAG_READ; │ │2777 bool is_write = t->flags & CEPH_OSD_FLAG_WRITE; │ │2778 t->epoch = osdmap->get_epoch(); // 获取osd当前omap 版本 │ │2779 ldout(cct,20) << __func__ << " epoch " << t->epoch │ │2780 << " base " << t->base_oid << " " << t->base_oloc │ │2781 << " precalc_pgid " << (int)t->precalc_pgid │ │2782 << " pgid " << t->base_pgid │ │2783 << (is_read ? " is_read" : "") │ │2784 << (is_write ? " is_write" : "") │ │2785 << dendl; │ │2786 │ │2787 const pg_pool_t *pi = osdmap->get_pg_pool(t->base_oloc.pool); // 根据pool id 获取pool实体 │ │2788 if (!pi) { │ │2789 t->osd = -1; │ │2790 return RECALC_OP_TARGET_POOL_DNE; │ │2791 } │ >│2792 ldout(cct,30) << __func__ << " base pi " << pi │ │2793 << " pg_num " << pi->get_pg_num() << dendl; │ │2794 │ │2795 bool force_resend = false; │ │2796 if (osdmap->get_epoch() == pi->last_force_op_resend) { │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: Objecter::_calc_target Line: 2792 PC: 0x7ffff7b41f6c (gdb)

这里按n逐行debug代码, 这里我想显示打印pg_pool_t*pop_target_t*t的信息,其中pg_pool_t是pool的结构体,包含pool相关的所有信息。

(gdb) p *pi $13 = { static APPLICATION_NAME_CEPHFS = 0x7fffef10f7f9 "cephfs", static APPLICATION_NAME_RBD = 0x7fffef10598b "rbd", static APPLICATION_NAME_RGW = 0x7fffef105954 "rgw", flags = 1, type = 1 '\001', size = 3 '\003', min_size = 1 '\001', crush_rule = 0 '\000', object_hash = 2 '\002', pg_num = 8, pgp_num = 8, properties = std::map with 0 elements, erasure_code_profile = "", last_change = 17, last_force_op_resend = 0, last_force_op_resend_preluminous = 0, snap_seq = { val = 0 }, snap_epoch = 0, auid = 0, crash_replay_interval = 0, quota_max_bytes = 0, quota_max_objects = 0, snaps = std::map with 0 elements, removed_snaps = { _size = 0, m = std::map with 0 elements }, pg_num_mask = 7, pgp_num_mask = 7, tiers = std::set with 0 elements, tier_of = -1, read_tier = -1, write_tier = -1, cache_mode = pg_pool_t::CACHEMODE_NONE, target_max_bytes = 0, target_max_objects = 0, cache_target_dirty_ratio_micro = 400000, cache_target_dirty_high_ratio_micro = 600000, cache_target_full_ratio_micro = 800000, cache_min_flush_age = 0, cache_min_evict_age = 0, hit_set_params = { _vptr.Params = 0x7ffff7dcdb90 <vtable for HitSet::Params+16>, impl = { px = 0x0 } }, hit_set_period = 0, hit_set_count = 0, use_gmt_hitset = true, min_read_recency_for_promote = 0, min_write_recency_for_promote = 0, hit_set_grade_decay_rate = 0, hit_set_search_last_n = 0, stripe_width = 0, expected_num_objects = 0, fast_read = false, opts = { opts = std::map with 0 elements }, application_metadata = std::map with 1 elements = { ["rgw"] = std::map with 0 elements }, grade_table = std::vector of length 0, capacity 0 }

op_target_t则是整个写入操作封装的结构信息,包含对象的名字,写入pool的id

(gdb) p *t $19 = { flags = 32, epoch = 24, base_oid = { name = "test-write" }, base_oloc = { pool = 3, key = "", nspace = "", hash = -1 }, target_oid = { name = "test-write" }, target_oloc = { pool = 3, key = "", nspace = "", hash = -1 }, precalc_pgid = false, pool_ever_existed = false, base_pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, actual_pgid = { pgid = { m_pool = 0, m_seed = 0, m_preferred = -1, static calc_name_buf_size = 36 '$' }, shard = { id = -1 '\377', static NO_SHARD = { id = -1 '\377', static NO_SHARD = <same as static member of an already seen type> } }, static calc_name_buf_size = 40 '(' }, pg_num = 0, pg_num_mask = 0, up = std::vector of length 0, capacity 0, acting = std::vector of length 0, capacity 0, up_primary = -1, acting_primary = -1, size = -1, min_size = -1, sort_bitwise = false, recovery_deletes = false, used_replica = false, paused = false, osd = -1, last_force_resend = 0 }

在这我们看看t->target_oloc的信息

target_oloc = { pool = 3, // 这个pool就是rados_write指定的pool在集群中的id key = "", // 提供的key 名字,因为我们并没有提供key的名字所以为空 nspace = "", // namespace 用来区分同个pool的不同命名空间的对象 hash = -1 // hash的算法 }

这里的namespace会跟object一起作为hash算法的输入,对namespace有疑惑的可以看看这篇博文Ceph rados namespace的设计

继续 n 单步调试,这里我们会进去osdmap->object_locator_to_pg函数

  • 1.1、OSDMap::object_locator_to_pg: 计算object 到 pg 的 seed,seed 中文翻译为种子

个人认为因为这一步还不是最终的结果,所以这里用的seed 命名

┌──/home/user/ceph/src/osdc/Objecter.cc───────────────────────────────────────────────────────────────────────────────────────┐ │2812 t->target_oloc.pool = pi->write_tier; │ │2813 pi = osdmap->get_pg_pool(t->target_oloc.pool); │ │2814 if (!pi) { │ │2815 t->osd = -1; │ │2816 return RECALC_OP_TARGET_POOL_DNE; │ │2817 } │ │2818 } │ │2819 │ │2820 pg_t pgid; │ │2821 if (t->precalc_pgid) { │ │2822 assert(t->flags & CEPH_OSD_FLAG_IGNORE_OVERLAY); │ │2823 assert(t->base_oid.name.empty()); // make sure this is a pg op │ │2824 assert(t->base_oloc.pool == (int64_t)t->base_pgid.pool()); │ │2825 pgid = t->base_pgid; │ │2826 } else { │ │2827 int ret = osdmap->object_locator_to_pg(t->target_oid, t->target_oloc, │ >│2828 pgid); │ │2829 if (ret == -ENOENT) { │ │2830 t->osd = -1; │ │2831 return RECALC_OP_TARGET_POOL_DNE; │ │2832 } │ │2833 } │ │2834 ldout(cct,20) << __func__ << " target " << t->target_oid << " " │ │2835 << t->target_oloc << " -> pgid " << pgid << dendl; │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: Objecter::_calc_target Line: 2828 PC: 0x7ffff7b42748 (gdb) n (gdb) p t->target_oid $21 = { name = "test-write" } (gdb) p t->target_oloc $22 = { pool = 3, key = "", nspace = "", hash = -1n }

这里object_locator_to_pg的输入参数为t->target_oid,t->target_oloc,pgids进入object_locator_to_pg函数

│2000 int OSDMap::object_locator_to_pg( │ │2001 const object_t& oid, const object_locator_t& loc, pg_t &pg) const │ │2002 { │ │2003 if (loc.hash >= 0) { // 从上面参数可以知道的hash的值为 -1 │ │2004 if (!get_pg_pool(loc.get_pool())) { │ │2005 return -ENOENT; │ │2006 } │ │2007 pg = pg_t(loc.hash, loc.get_pool()); │ │2008 return 0; │ │2009 } │ >│2010 return map_to_pg(loc.get_pool(), oid.name, loc.key, loc.nspace, &pg); │ │2011 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::object_locator_to_pg Line: 2010 PC: 0x7fffeee46240 } OSDMap::object_locator_to_pg (this=0x78b930, oid=..., loc=..., pg=...) at /home/user/ceph/src/osd/OSDMap.cc:2002

这里他会再次进入map_to_pg函数, 参数为pool id,obj name,obj key,namespace,pg实体

  • 1.1.1、OSDMap::map_to_pg

│1979 // mapping │ │1980 int OSDMap::map_to_pg( │ │1981 int64_t poolid, │ │1982 const string& name, │ │1983 const string& key, │ │1984 const string& nspace, │ │1985 pg_t *pg) const │ │1986 { │ │1987 // calculate ps (placement seed) │ │1988 const pg_pool_t *pool = get_pg_pool(poolid); // 根据poolid 获取pool实体 │ │1989 if (!pool) │ │1990 return -ENOENT; │ │1991 ps_t ps; │ │1992 if (!key.empty()) // 判断客户端写入的时候是否指定了key │ │1993 ps = pool->hash_key(key, nspace); │ │1994 else │ >│1995 ps = pool->hash_key(name, nspace); // 没指定就用文件名当做key │ │1996 *pg = pg_t(ps, poolid); │ │1997 return 0; │ │1998 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::map_to_pg Line: 1995 PC: 0x7fffeee4615d opts = std::map with 0 elements }, application_metadata = std::map with 1 elements = { ["rgw"] = std::map with 0 elements }, grade_table = std::vector of length 0, capacity 0 } (gdb) n (gdb) p key.empty() $25 = true

因为写入的文件并没有设置key 所以,会使用文件名,当做key 跟nspace 一起作为hash的输入 让我们看看pool->hash_key(name, nspace)会做些什么

  • 1.1.1.1、pg_pool_t::hash_key

│1374 uint32_t pg_pool_t::hash_key(const string& key, const string& ns) const │ │1375 { │ │1376 if (ns.empty()) // 判断namespace 是否为空 │ >│1377 return ceph_str_hash(object_hash, key.data(), key.length()); │ │1378 int nsl = ns.length(); │ │1379 int len = key.length() + nsl + 1; │ │1380 char buf[len]; │ │1381 memcpy(&buf[0], ns.data(), nsl); │ │1382 buf[nsl] = '\037'; │ │1383 memcpy(&buf[nsl+1], key.data(), key.length()); │ │1384 return ceph_str_hash(object_hash, &buf[0], len); │ │1385 } └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: pg_pool_t::hash_key Line: 1377 PC: 0x7fffeee88360 } (gdb) n (gdb) p key.empty() $25 = true

在这会根据是否有namespace进行不同的处理, 到这应该就清楚namespace的作用了吧, 这里 object_hash 就是hash 的算法类型,这里是2,是个默认值,继续进入函数ceph_str_hash(object_hash, key.data(), key.length())

  • 1.1.1.1.1、unsigned ceph_str_hash: string to hash

│95 unsigned ceph_str_hash(int type, const char *s, unsigned len) │ │96 { // 选择不同的hash算法 │ │97 switch (type) { │ │98 case CEPH_STR_HASH_LINUX: │ │99 return ceph_str_hash_linux(s, len); │ │100 case CEPH_STR_HASH_RJENKINS: │ >│101 return ceph_str_hash_rjenkins(s, len); │ │102 default: │ │103 return -1; │ │104 } │ │105 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (gdb) s ceph_str_hash (type=2, s=0x72cc48 "test-write", len=10) at /home/user/ceph/src/common/ceph_hash.cc:97

在这会调用ceph_str_hash_rjenkins(s, len)哈希算法,到此就不深入了,具体到的hash的算法实现,不是此次的目的,回到开始的入口OSDMap::map_to_pg

Ceph Crush算法中,Object至PG映射的源码是如何实现深入理解的?

│1980 int OSDMap::map_to_pg( │ │1981 int64_t poolid, │ │1982 const string& name, │ │1983 const string& key, │ │1984 const string& nspace, │ │1985 pg_t *pg) const │ │1986 { │ │1987 // calculate ps (placement seed) │ │1988 const pg_pool_t *pool = get_pg_pool(poolid); │ │1989 if (!pool) │ │1990 return -ENOENT; │ │1991 ps_t ps; │ │1992 if (!key.empty()) │ │1993 ps = pool->hash_key(key, nspace); │ │1994 else │ │1995 ps = pool->hash_key(name, nspace); │ >│1996 *pg = pg_t(ps, poolid); // 计算得到的 ps 和 poolid 组装成 pg实体 │ │1997 return 0; │ │1998 } │ │1999 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::map_to_pg Line: 1996 PC: 0x7fffeee46165 (gdb) p *pg $32 = { m_pool = 3, m_seed = 207473616, m_preferred = -1, static calc_name_buf_size = 36 '$' }

在这,函数会根据hash 生成的ps(placement seed) 和 poolid 生成一个pg实体 内容如下

pg = { m_pool = 3, // pool id m_seed = 207473616, // seed m_preferred = -1, static calc_name_buf_size = 36 '$' }

到此,我们仅仅得到了,object的hash 值

  • 2、 继续debug _calc_target 函数

│2845 osdmap->pg_to_up_acting_osds(pgid, &up, &up_primary, │ >│2846 &acting, &acting_primary); │ │2847 bool sort_bitwise = osdmap->test_flag(CEPH_OSDMAP_SORTBITWISE); │ │2848 bool recovery_deletes = osdmap->test_flag(CEPH_OSDMAP_RECOVERY_DELETES); │ │2849 unsigned prev_seed = ceph_stable_mod(pgid.ps(), t->pg_num, t->pg_num_mask); │ │2850 pg_t prev_pgid(prev_seed, pgid.pool()); │ │2851 if (any_change && PastIntervals::is_new_interval( │ │2852 t->acting_primary, │ │2853 acting_primary, │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

这一步会进入pg_to_up_acting_osds输入的参数 为,上一步计算出来的pg实体

  • 2.1、void pg_to_up_acting_osds

│1157 void pg_to_up_acting_osds(pg_t pg, vector<int> *up, int *up_primary, │ │1158 vector<int> *acting, int *acting_primary) const { │ >│1159 _pg_to_up_acting_osds(pg, up, up_primary, acting, acting_primary); │ │1160 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ pg_to_up_acting_osds (acting_primary=<optimized out>, acting=<optimized out>, up_primary=<optimized out>, up=<optimized out>, pg=..., this=<optimized out>) at /home/user/ceph/src/osd/OSDMap.h:1159 (gdb)

再进入_pg_to_up_acting_osds

  • 2.1.1、OSDMap::_pg_to_up_acting_osds

void OSDMap::_pg_to_up_acting_osds( const pg_t& pg, vector<int> *up, int *up_primary, vector<int> *acting, int *acting_primary, bool raw_pg_to_pg) const { const pg_pool_t *pool = get_pg_pool(pg.pool()); if (!pool || (!raw_pg_to_pg && pg.ps() >= pool->get_pg_num())) { if (up) up->clear(); if (up_primary) *up_primary = -1; if (acting) acting->clear(); if (acting_primary) *acting_primary = -1; return; } vector<int> raw; vector<int> _up; vector<int> _acting; int _up_primary; int _acting_primary; ps_t pps; _get_temp_osds(*pool, pg, &_acting, &_acting_primary); // 这个是判断是否有pg temp,如果有就会从使用pg_temp中的osd if (_acting.empty() || up || up_primary) { _pg_to_raw_osds(*pool, pg, &raw, &pps); // 开始计算pg 实体的hash值 _apply_upmap(*pool, pg, &raw); _raw_to_up_osds(*pool, raw, &_up); _up_primary = _pick_primary(_up); _apply_primary_affinity(pps, *pool, &_up, &_up_primary); if (_acting.empty()) { _acting = _up; if (_acting_primary == -1) { _acting_primary = _up_primary; } }

首先这里会根据是否有pg_temp做一次osd的选择,如果没有则进行正常的 PG 到 osd的选择流程 *_pg_to_raw_osds(pool, pg, &raw, &pps)

  • 2.1.1.1、OSDMap::_pg_to_raw_osds: 获取pg 到 osd的映射

│2049 void OSDMap::_pg_to_raw_osds( │ │2050 const pg_pool_t& pool, pg_t pg, │ │2051 vector<int> *osds, │ │2052 ps_t *ppps) const │ │2053 { │ │2054 // map to osds[] │ >│2055 ps_t pps = pool.raw_pg_to_pps(pg); // placement ps 这里就是pg+poolid 的hash计算过程 │ │2056 unsigned size = pool.get_size(); │ │2057 │ │2058 // what crush rule? │ │2059 int ruleno = crush->find_rule(pool.get_crush_rule(), pool.get_type(), size); │ │2060 if (ruleno >= 0) │ │2061 crush->do_rule(ruleno, pps, *osds, size, osd_weight, pg.pool()); │ │2062 │ │2063 _remove_nonexistent_osds(pool, *osds); │ │2064 │ │2065 if (ppps) │ │2066 *ppps = pps; │ │2067 } │ │2068 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: OSDMap::_pg_to_raw_osds Line: 2055 PC: 0x7fffeee590f8 (gdb) s OSDMap::_pg_to_raw_osds (this=this@entry=0x78b950, pool=..., pg=..., osds=osds@entry=0x7fffffffd6d0, ppps=ppps@entry=0x7fffffffd6c4) at /home/user/ceph/src/osd/OSDMap.cc:2055 (gdb) p pg $7 = {m_pool = 3, m_seed = 207473616, m_preferred = -1, static calc_name_buf_size = 36 '$'}

在这会进入pool.raw_pg_to_pps(pg), 还记得从上面获取的pgid吗?就是这里的输入,这里得到的结果就object–>pg的完整hash 这里称作pps

  • 2.1.1.1.1、pg_pool_t::raw_pg_to_pps(pg_t pg)

│1406 ps_t pg_pool_t::raw_pg_to_pps(pg_t pg) const │ │1407 { │ │1408 if (flags & FLAG_HASHPSPOOL) { │ │1409 // Hash the pool id so that pool PGs do not overlap. //为了防止不同pool的pg混淆,所以需要加上pool id的hash。 │ │1410 return │ │1411 crush_hash32_2(CRUSH_HASH_RJENKINS1, │ >│1412 ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask), │ │1413 pg.pool()); │ │1414 } else { │ │1415 // Legacy behavior; add ps and pool together. This is not a great │ │1416 // idea because the PGs from each pool will essentially overlap on │ │1417 // top of each other: 0.5 == 1.4 == 2.3 == ... │ │1418 return │ │1419 ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask) + │ │1420 pg.pool(); │ │1421 } │ │1422 } │ │1423 │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: pg_pool_t::raw_pg_to_pps Line: 1412 PC: 0x7fffeee883d5 at /home/user/ceph/src/osd/OSDMap.cc:2055

在这里会先对pgid 中的seedpgp_num,pgp_num_mask进行stable mod 运算,这里是pgp_num,ceph_stable_mod 算法是PG映射和PG分裂的重要依据,这里不细说,后面会专门深入这一块。在这一步骤会得出真正的pg id, 在这我们就不进入这个函数的实现,直接进入crush_hash32_2

  • 2.1.1.1.1.1、 __u32 crush_hash32(int type, __u32 a): 根据pg id 和pool id 确定pg的唯一hash值

│103 __u32 crush_hash32_2(int type, __u32 a, __u32 b) │ │104 { │ >│105 switch (type) { │ │106 case CRUSH_HASH_RJENKINS1: │ │107 return crush_hash32_rjenkins1_2(a, b); │ │108 default: │ │109 return 0; │ │110 } │ │111 } │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ multi-thre Thread 0x7ffff In: crush_hash32_2 Line: 105 PC: 0x7fffef057600 (gdb) crush_hash32_2 (type=0, a=0, b=3) at /home/user/ceph/src/crush/hash.c:105

这里直接调用crush_hash32_rjenkins1_2hash 算法,输入为 pgid 和 pool id 到此,可以返回, 已经得到pps也就是hash(pgid, poolid)的值,回到_pg_to_raw_osds的函数。

三. 总结


到此OBJEC–>PG映射的流程分析就完成了,简单的总结下这个过程

  • hash(obj, nspace): 获取object的哈希值ps,这里需要注意的点是namespace也参与了哈希
  • ceph_stable_mod(ps, pgp_num, pgp_mask):这一步是获取真实pg的id,而且这个函数也是ceph PG分裂和迁移非常核心的一部分
  • hash(pg, pool_id):这一步是将 pg 和poolid进行哈希得到唯一值pps, 这个数值是作为选择osd的参数。

四. 参考文档


  • ceph source code
  • CEPH CRUSH 算法源码分析 原文CEPH CRUSH algorithm source code analysis

转自:www.dovefi.com/post/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3crush3object%E8%87%B3pg%E6%98%A0%E5%B0%84%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90/