如何通过 Ansible 高效实现 HBase 集群的自动化部署?
- 内容介绍
- 文章标签
- 相关推荐
本文共计7455个文字,预计阅读时间需要30分钟。
背景:为了确保数据安全,我们自主研发了一个低成本的时序数据存储系统,用于存储历史行情数据。
系统设计:- 采用InfluxDB的时序数据库和压缩策略,以降低存储成本。- 基于 HBase 实现海量存储能力。
运维优势:- 通过运维简化事务处理。
背景出于数据安全的考虑,自研了一个低成本的时序数据存储系统,用于存储历史行情数据。
系统借鉴了 InfluxDB 的列存与压缩策略,并基于 HBase 实现了海量存储能力。
由于运维同事缺乏 Hadoop 全家桶的运维经验,只能由我这个研发临时兼职,亲自指挥亲自部署了。
Hadoop 发行版选择目前可选的方案并不多,主要有:
- CDH 目前中小企业选型首选的发行版
- Amibari 最为灵活的且可定制的发行版
- Apache 最原始的发行版
CDH 的缺点:
- Hadoop 组件的版本老旧,不支持新的 API
- JDK 版本受限,无法受益于新版 JDK 带来的性能提升
- 存在大量已知且未修复的 Bug,为后续运维埋下隐患
- 新版本的 CDH 不再免费,无法免费升级
Amibari 的缺点:
- 文档较少,构建困难(前端组件版本较旧,构建直接报错)
- 该项目已经退役,未来不再进行维护
Apache 的缺陷:
- 部署流程复杂,版本兼容可能会踩坑
- 监控系统不完善,自己搭建需要一定的动手能力
系统规划现状:
- 合规严格要求,必须避免版权纠纷
- 集群规模不大,节点数量小于 50
- 没有 Hadoop 相关研发能力,无法自主修复 Bug
- 需要保证查询性能,最好能用上 ZGC 或 ShenandoahGC
最终敲定基于原始的 Apache 发行版搭建 HBase 集群。
版本选择 HBase 组件版本选择如下:
- Adoptium JDK
- HBase 2.4.11 (JDK 17)
- Hadoop 3.2.3 (JDK 8)
- Zookeeper 3.6.3 (JDK 17)
Hadoop 3.3.x 之后不再使用 native 版本的 snappy 与 lz4(相关链接),而最新的 HBase 稳定版 2.4.x 版尚未适配该变更,因此选择 3.2.x 版本。
而 Hadoop 3.2.x 依赖 Zookeeper 3.4.14 的客户端,无法运行在 JDK14 以上的环境(参考案例),因此使用 JDK 8 进行部署。
Zookeeper 版本Zookeeper 3.6.x 是自带 Prometheus 监控版本中最低的,并且高版本 Zookeeper 保证了对低版本客户端的兼容性,因此选择该版本。并且该版本已经支持 JDK 11 部署,因此可以放心的将 JRE 升级为 JDK 17 进行部署。
JDK 发行版JDK 17 是首个支持 ZGC 的 LTS 版本。因 Oracle JDK17 暂不支持 ShenandoahGC,最终选择 Adoptium JDK。网上有朋友分享过在 JDK 15 上部署 CDH 版 HBase 的经验,但需要打一个 Patch,具体步骤参考附录。
运维工具为了弥补 Apache 发行版难以运维的缺点,需要借助两个高效的开源运维工具:
Ansible一款简单易用的自动化部署工具
- 支持幂等部署,减少部署过程中出错概率
- 通过 ssh 实现通信,侵入性低,无需安装 agent
- playbook 可以将运维操作文档化,方便他人接手
Ansible 版本的分界线是 2.9.x,该版本是最后一个支持 Python 2.x 的版本。为了适应现有的运维环境,最终选择该版本。
不过有条件还是建议升级到 Python 3.x 以上,并使用更新版本的 Ansible。毕竟有些 Bug 只在新版本修复,不会同步至低版本。
Prometheus新一代监控告警平台
- 独特的 PromQL 提供灵活高效的查询能力
- 自带 TSDB 与 AlertManager,部署架构简单
- 生态组件丰富
- 通过 JMX Exporter 实现监控指标接入
- 通过 Grafana 实现监控指标的可视化
没有历史包袱,可以直接选择最新版。
配置详解为了保证配置变更的可追溯性,使用 Git 新建了一个工程来维护部署脚本,整个工程的目录结构如下:
.
├── hosts
├── ansible.cfg
├── book
│ ├── config-hadoop.yml
│ ├── config-hbase.yml
│ ├── config-metrics.yml
│ ├── config-zk.yml
│ ├── install-hadoop.yml
│ ├── sync-host.yml
│ └── vars.yml
├── conf
│ ├── hadoop
│ │ ├── core-site.xml
│ │ ├── hdfs-site.xml
│ │ ├── mapred-site.xml
│ │ ├── workers
│ │ └── yarn-site.xml
│ ├── hbase
│ │ ├── backup-masters
│ │ ├── hbase-site.xml
│ │ └── regionservers
│ ├── metrics
│ │ ├── exports
│ │ │ ├── hmaster.yml
│ │ │ ├── jmx_exporter.yml
│ │ │ └── regionserver.yml
│ │ └── targets
│ │ ├── hadoop-cluster.yml
│ │ ├── hbase-cluster.yml
│ │ └── zk-cluster.yml
│ └── zk
│ ├── myid
│ └── zoo.cfg
└── repo
├── hadoop
│ ├── apache-zookeeper-3.6.3-bin.tar.gz
│ ├── hadoop-3.2.3.tar.gz
│ ├── hbase-2.4.11-bin.tar.gz
│ ├── hbase-2.4.11-src.tar.gz
│ ├── hbase-server-2.4.11.jar
│ ├── OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz
│ ├── OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz
│ └── repo.md5
└── metrics
└── jmx_prometheus_javaagent-0.16.1.jar
各个目录的作用
- repo :存储用于部署的二进制的文件
- book :存储 ansible-playbook 的自动化脚本
- conf :存储 HBase 组件的配置模板
对主机进行分类,便于规划集群部署:
[newborn]
[nodes]
172.20.72.1 hostname='my.hadoop1 my.hbase1 my.zk1'
172.20.72.2 hostname='my.hadoop2 my.hbase2 my.zk2'
172.20.72.3 hostname='my.hadoop3 my.hbase3 my.zk3'
172.20.72.4 hostname='my.hadoop4 my.hbase4'
[zk_nodes]
my.zk1 ansible_host=172.30.73.209 myid=1
my.zk2 ansible_host=172.30.73.210 myid=2
my.zk3 ansible_host=172.30.73.211 myid=3
[hadoop_nodes]
my.hadoop[1:4]
[namenodes]
my.hadoop1 id=nn1 rpc_port=8020 {{ hdfs_name }}</value>
</property>
<!-- 指定数据存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>{{ hadoop_data_dir }}</value>
</property>
<!-- 指定 Web 用户权限(默认用户 dr.who 无法上传文件) -->
<property>
<name>hadoop.${hadoop.tmp.dir}/name</value>
</property>
<!-- DataNode 数据存储目录 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/data</value>
</property>
<!-- JournalNode 数据存储目录(绝对路径,不能带 file://) -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${hadoop.tmp.dir}/journal</value>
</property>
<!-- HDFS 集群名称 -->
<property>
<name>dfs.nameservices</name>
<value>{{ hdfs_name }}</value>
</property>
<!-- 集群 NameNode 节点列表 -->
<property>
<name>dfs.ha.namenodes.{{hdfs_name}}</name>
<value>{{ groups['namenodes'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
</property>
<!-- NameNode RPC 地址 -->
{% for host in groups['namenodes'] %}
<property>
<name>dfs.namenode.rpc-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['rpc_port']}}</value>
</property>
{% endfor %}
<!-- NameNode HTTP 地址 -->
{% for host in groups['namenodes'] %}
<property>
<name>dfs.namenode.{{groups['journalnodes'] | zip( groups['journalnodes']|map('extract', hostvars)|map(attribute='journal_port') )| map('join', ':') | join(';') }}/{{hdfs_name}}</value>
</property>
<!-- fail-over 代理类 (client 通过 proxy 来确定 Active NameNode) -->
<property>
<name>dfs.client.failover.proxy.provider.my-hdfs</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 隔离机制 (保证只存在唯一的 Active NameNode) -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- SSH 隔离机制依赖的登录秘钥 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/{{ ansible_user }}/.ssh/id_rsa</value>
</property>
<!-- 启用自动故障转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- NameNode 工作线程数量 -->
<property>
<name>dfs.namenode.handler.count</name>
<value>21</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!-- 启用 ResourceManager HA -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- YARN 集群名称 -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>{{yarn_name}}</value>
</property>
<!-- ResourceManager 节点列表 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>{{ groups['resourcemanagers'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
</property>
<!-- ResourceManager 地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.hostname.{{hostvars[host]['id']}}</name>
<value>{{host}}</value>
</property>
{% endfor %}
<!-- ResourceManager 内部通信地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['peer_port']}}</value>
</property>
{% endfor %}
<!-- NM 访问 ResourceManager 地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.resource-tracker.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['tracker_port']}}</value>
</property>
{% endfor %}
<!-- AM 向 ResourceManager 申请资源地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.scheduler.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['scheduler_port']}}</value>
</property>
{% endfor %}
<!-- ResourceManager Web 入口 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.webapp.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['web_port']}}</value>
</property>
{% endfor %}
<!-- 启用自动故障转移 -->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!-- 指定 Zookeeper 列表 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
</property>
<!-- 将状态信息存储在 Zookeeper 集群-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<!-- 减少 ResourceManager 处理 Client 请求的线程-->
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>10</value>
</property>
<!-- 禁止 NodeManager 自适应硬件配置(非独占节点)-->
<property>
<name>yarn.nodemanager.resource.detect-hardware-capbilities</name>
<value>false</value>
</property>
<!-- NodeManager 给容器分配的 CPU 核数-->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<!-- NodeManager 使用物理核计算 CPU 数量(可选)-->
<property>
<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
<value>false</value>
</property>
<!-- 减少 NodeManager 使用内存-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- 容器内存下限 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- 容器内存上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 容器CPU下限 -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 容器CPU上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 容器CPU上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 关闭虚拟内存检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 设置虚拟内存和物理内存的比例 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<!-- NodeManager 在 MR 过程中使用 Shuffle(可选)-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- MapReduce 运行在 YARN 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- MapReduce Classpath -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- MapReduce JVM 参数(不允许换行) -->
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx1024m --add-opens java.base/java.lang=ALL-UNNAMED</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>--add-opens java.base/java.lang=ALL-UNNAMED -verbose:gc -Xloggc:/tmp/@taskid@.gc</value>
</property>
</configuration>
workers
{% for host in groups['datanodes'] %}
{{ host }}
{% endfor %}
conf/hbase 目录
hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://{{ hdfs_name }}/hbase</value>
</property>
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
</property>
</configuration>
regionservers
{% for host in groups['regionservers'] %}
{{ host }}
{% endfor %}
backup-masters
{% for host in groups['hmasters'][1:] %}
{{ host }}
{% endfor %}
conf/metrics/exports 目录
jmx_exporter.yml
---
# github.com/prometheus/jmx_exporter
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# ignore service
- pattern: Hadoop<service=(\w+), name=([\w-.]+), sub=(\w+)><>([\w._]+)
name: $4
labels:
name: "$2"
group: "$3"
attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=(\w+)-([^<]+)><>([\w._]+)
name: $4
labels:
name: "$2"
entity: "$3"
attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=([^<]+)><>([\w._]+)
name: $3
labels:
name: "$2"
attrNameSnakeCase: true
- pattern: .+
hmaster.yml
---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=RegionServer,*"
rules:
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $2
labels:
group: "$1"
stat: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)
name: $2
labels:
group: "$1"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master><>([\w._]+)
name: $1
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
name: $3
labels:
name: "$1"
group: "$2"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
name: $2
labels:
name: "$1"
attrNameSnakeCase: true
- pattern: .+
regionserver.yml
---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=Master,*"
rules:
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Regions><>namespace_([\w._]+)_table_([\w._]+)_region_(\w+)_metric_([\w._]+)
name: $4
labels:
group: Regions
namespace: "$1"
table: "$2"
region: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Tables><>namespace_([\w._]+)_table_([\w._]+)_columnfamily_([\w._]+)_metric_([\w._]+)
name: $4
labels:
group: Tables
namespace: "$1"
table: "$2"
column_family: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $4
labels:
group: "$1"
namespace: "$2"
table: "$3"
stat: "$5"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)
name: $4
labels:
group: "$1"
namespace: "$2"
table: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $2
labels:
group: "$1"
stat: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)
name: $2
labels:
group: "$1"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
name: $3
labels:
name: "$1"
group: "$2"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
name: $2
labels:
name: "$1"
attrNameSnakeCase: true
- pattern: .+
conf/metrics/targets 目录
zk-cluster.yml
- targets:
{% for host in groups['zk_nodes'] %}
- {{ host }}:7000
{% endfor %}
labels:
service: zookeeper
hadoop-cluster.yml
- targets:
{% for host in groups['namenodes'] %}
- {{ host }}:{{ namenode_metrics_port }}
{% endfor %}
labels:
role: namenode
service: hdfs
- targets:
{% for host in groups['datanodes'] %}
- {{ host }}:{{ datanode_metrics_port }}
{% endfor %}
labels:
role: datanode
service: hdfs
- targets:
{% for host in groups['journalnodes'] %}
- {{ host }}:{{ journalnode_metrics_port }}
{% endfor %}
labels:
role: journalnode
service: hdfs
- targets:
{% for host in groups['resourcemanagers'] %}
- {{ host }}:{{ resourcemanager_metrics_port }}
{% endfor %}
labels:
role: resourcemanager
service: yarn
- targets:
{% for host in groups['datanodes'] %}
- {{ host }}:{{ nodemanager_metrics_port }}
{% endfor %}
labels:
role: nodemanager
service: yarn
hbase-cluster.yml
- targets:
{% for host in groups['hmasters'] %}
- {{ host }}:{{ hmaster_metrics_port }}
{% endfor %}
labels:
role: hmaster
service: hbase
- targets:
{% for host in groups['regionservers'] %}
- {{ host }}:{{ regionserver_metrics_port }}
{% endfor %}
labels:
role: regionserver
service: hbase
book 目录
vars.yml
hdfs_name: my-hdfs
yarn_name: my-yarn
sync-host.yml
---
- name: Config Hostname & SSH Keys
hosts: nodes
connection: local
gather_facts: no
any_errors_fatal: true
vars:
hostnames: |
{% for h in groups['nodes'] if hostvars[h].hostname is defined %}{{h}} {{ hostvars[h].hostname }}
{% endfor %}
tasks:
- name: test connectivity
ping:
connection: ssh
- name: change local hostname
become: true
blockinfile:
dest: '/etc/hosts'
marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
block: '{{ hostnames }}'
run_once: true
- name: sync remote hostname
become: true
blockinfile:
dest: '/etc/hosts'
marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
block: '{{ hostnames }}'
connection: ssh
- name: fetch exist status
stat:
path: '~/.ssh/id_rsa'
register: ssh_key_path
connection: ssh
- name: generate ssh key
openssh_keypair:
path: '~/.ssh/id_rsa'
comment: '{{ ansible_user }}@{{ inventory_hostname }}'
type: rsa
size: 2048
state: present
force: no
connection: ssh
when: not ssh_key_path.stat.exists
- name: collect ssh key
command: ssh {{ansible_user}}@{{ansible_host|default(inventory_hostname)}} 'cat ~/.ssh/id_rsa.pub'
register: host_keys # cache data in hostvars[hostname].host_keys
changed_when: false
- name: create temp file
tempfile:
state: file
suffix: _keys
register: temp_ssh_keys
changed_when: false
run_once: true
- name: save ssh key ({{temp_ssh_keys.path}})
blockinfile:
dest: "{{temp_ssh_keys.path}}"
block: |
{% for h in groups['nodes'] if hostvars[h].host_keys is defined %}
{{ hostvars[h].host_keys.stdout }}
{% endfor %}
changed_when: false
run_once: true
- name: deploy ssh key
vars:
ssh_keys: "{{ lookup('file', temp_ssh_keys.path).split('\n') | select('match', '^ssh') | join('\n') }}"
authorized_key:
user: "{{ ansible_user }}"
key: "{{ ssh_keys }}"
state: present
connection: ssh
install-hadoop.yml
---
- name: Install Hadoop Package
hosts: newborn
gather_facts: no
any_errors_fatal: true
vars:
local_repo: '../repo/hadoop'
remote_repo: '~/repo/hadoop'
package_info:
- {src: 'OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz', dst: 'java/jdk-17.0.2+8', home: 'jdk17'}
- {src: 'OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz', dst: 'java/jdk8u322-b06', home: 'jdk8'}
- {src: 'apache-zookeeper-3.6.3-bin.tar.gz', dst: 'apache/zookeeper-3.6.3', home: 'zookeeper'}
- {src: 'hbase-2.4.11-bin.tar.gz', dst: 'apache/hbase-2.4.11',home: 'hbase'}
- {src: 'hadoop-3.2.3.tar.gz', dst: 'apache/hadoop-3.2.3', home: 'hadoop'}
tasks:
- name: test connectivity
ping:
- name: copy hadoop package
copy:
src: '{{ local_repo }}'
dest: '~/repo'
- name: prepare directory
become: true # become root
file:
state: directory
path: '{{ deploy_dir }}/{{ item.dst }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
with_items: '{{ package_info }}'
- name: create link
become: true # become root
file:
state: link
src: '{{ deploy_dir }}/{{ item.dst }}'
dest: '{{ deploy_dir }}/{{ item.home }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
with_items: '{{ package_info }}'
- name: install package
unarchive:
src: '{{ remote_repo }}/{{ item.src }}'
dest: '{{ deploy_dir }}/{{ item.dst }}'
remote_src: yes
extra_opts:
- --strip-components=1
with_items: '{{ package_info }}'
- name: config /etc/profile
become: true
blockinfile:
dest: '/etc/profile'
marker: "# {mark} ANSIBLE MANAGED PROFILE"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk8
export HADOOP_HOME={{ deploy_dir }}/hadoop
export HBASE_HOME={{ deploy_dir }}/hbase
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PATH
- name: config zkEnv.sh
lineinfile:
path: '{{ deploy_dir }}/zookeeper/bin/zkEnv.sh'
line: 'JAVA_HOME={{ deploy_dir }}/jdk17'
insertafter: '^#\!\/usr\/bin'
firstmatch: yes
- name: config hadoop-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP ENV"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk8
- name: config hbase-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE ENV"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk17
export HBASE_MANAGES_ZK=false
export HBASE_LIBRARY_PATH={{ deploy_dir }}/hadoop/lib/native
export HBASE_OPTS="$HBASE_OPTS --add-exports=java.base/jdk.internal.access=ALL-UNNAMED --add-exports=java.base/jdk.internal=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.security.pkcs=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/jdk.internal=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-opens java.base/jdk.internal.access=ALL-UNNAMED"
- name: patch hbase
copy:
src: '{{ local_repo }}/hbase-server-2.4.11.jar'
dest: '{{ deploy_dir }}/hbase/lib'
backup: no
force: yes
- name: link hadoop config
file:
state: link
src: '{{ deploy_dir }}/hadoop/etc/hadoop/{{ item }}'
dest: '{{ deploy_dir }}/hbase/conf/{{ item }}'
with_items:
- core-site.xml
- hdfs-site.xml
- name: add epel-release repo
shell: 'sudo yum -y install epel-release && sudo yum makecache'
- name: install native libary
shell: 'sudo yum -y install snappy snappy-devel lz4 lz4-devel libzstd libzstd-devel'
- name: check hadoop native
shell: '{{ deploy_dir }}/hadoop/bin/hadoop checknative -a'
register: hadoop_checknative
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk8'
- name: hadoop native status
debug:
msg: "{{ hadoop_checknative.stdout_lines }}"
- name: check hbase native
shell: '{{ deploy_dir }}/hbase/bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker'
register: hbase_checknative
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk17'
HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
- name: hbase native status
debug:
msg: "{{ hbase_checknative.stdout_lines|select('match', '^[^0-9]') | list }}"
- name: test native compresssion
shell: '{{ deploy_dir }}/hbase/bin/hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test {{ item }}'
register: 'compression'
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk17'
HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
with_items:
- snappy
- lz4
- name: native compresssion status
vars:
results: "{{ compression | json_query('results[*].{type:item, result:stdout}') }}"
debug:
msg: |
{% for r in results %} {{ r.type }} => {{ r.result == 'SUCCESS' }} {% endfor %}
config-zk.yml
---
- name: Change Zk Config
hosts: zk_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/zk
zk_home: '{{ deploy_dir }}/zookeeper'
zk_data_dir: '{{ zk_home }}/status/data'
zk_data_log_dir: '{{ zk_home }}/status/logs'
tasks:
- name: Create data directory
file:
state: directory
path: '{{ item }}'
recurse: yes
with_items:
- '{{ zk_data_dir }}'
- '{{ zk_data_log_dir }}'
- name: Init zookeeper myid
template:
src: '{{ template_dir }}/myid'
dest: '{{ zk_data_dir }}'
- name: Update zookeeper env
become: true
blockinfile:
dest: '{{ zk_home }}/bin/zkEnv.sh'
marker: "# {mark} ANSIBLE MANAGED ZK ENV"
block: |
export SERVER_JVMFLAGS="-Xmx1G -XX:+UseShenandoahGC -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608"
notify:
- Restart zookeeper service
- name: Update zookeeper config
template:
src: '{{ template_dir }}/zoo.cfg'
dest: '{{ zk_home }}/conf'
notify:
- Restart zookeeper service
handlers:
- name: Restart zookeeper service
shell:
cmd: '{{ zk_home }}/bin/zkServer.sh restart'
config-hadoop.yml
---
- name: Change Hadoop Config
hosts: hadoop_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/hadoop
hadoop_home: '{{ deploy_dir }}/hadoop'
hadoop_conf_dir: '{{ hadoop_home }}/etc/hadoop'
hadoop_data_dir: '{{ data_dir }}/hadoop'
tasks:
- name: Include common vars
include_vars: file=vars.yml
- name: Create data directory
become: true
file:
state: directory
path: '{{ hadoop_data_dir }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
- name: Sync hadoop config
template:
src: '{{ template_dir }}/{{ item }}'
dest: '{{ hadoop_conf_dir }}/{{ item }}'
with_items:
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- yarn-site.xml
- workers
- name: Config hadoop env
blockinfile:
dest: '{{ hadoop_conf_dir }}/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED HADOOP ENV"
block: |
export HADOOP_PID_DIR={{ hadoop_home }}/pid
export HADOOP_LOG_DIR={{ hadoop_data_dir }}/logs
JVM_OPTS="-XX:+AlwaysPreTouch"
export HDFS_JOURNALNODE_OPTS="-Xmx1G $JVM_OPTS $HDFS_JOURNALNODE_OPTS"
export HDFS_NAMENODE_OPTS="-Xmx4G $JVM_OPTS $HDFS_NAMENODE_OPTS"
export HDFS_DATANODE_OPTS="-Xmx8G $JVM_OPTS $HDFS_DATANODE_OPTS"
- name: Config yarn env
blockinfile:
dest: '{{ hadoop_conf_dir }}/yarn-env.sh'
marker: "# {mark} ANSIBLE MANAGED YARN ENV"
block: |
JVM_OPTS=""
export YARN_RESOURCEMANAGER_OPTS="$JVM_OPTS $YARN_RESOURCEMANAGER_OPTS"
export YARN_NODEMANAGER_OPTS="$JVM_OPTS $YARN_NODEMANAGER_OPTS"
config-hbase.yml
---
- name: Change HBase Config
hosts: hbase_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/hbase
hbase_home: '{{ deploy_dir }}/hbase'
hbase_conf_dir: '{{ hbase_home }}/conf'
hbase_data_dir: '{{ data_dir }}/hbase'
hbase_log_dir: '{{ hbase_data_dir }}/logs'
hbase_gc_log_dir: '{{ hbase_log_dir }}/gc'
tasks:
- name: Include common vars
include_vars: file=vars.yml
- name: Create data directory
become: true
file:
state: directory
path: '{{ item }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
with_items:
- '{{ hbase_data_dir }}'
- '{{ hbase_log_dir }}'
- '{{ hbase_gc_log_dir }}'
- name: Sync hbase config
template:
src: '{{ template_dir }}/{{ item }}'
dest: '{{ hbase_conf_dir }}/{{ item }}'
with_items:
- hbase-site.xml
- backup-masters
- regionservers
- name: Config hbase env
blockinfile:
dest: '{{ hbase_conf_dir }}/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED HBASE ENV"
block: |
export HBASE_LOG_DIR={{ hbase_log_dir }}
export HBASE_OPTS="-Xss256k -XX:+UseShenandoahGC -XX:+AlwaysPreTouch $HBASE_OPTS"
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hmaster-%p-%t.log"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hregion-%p-%t.log"
config-metrics.yml
---
- name: Install Metrics Package
hosts: "{{ groups['hadoop_nodes'] + groups['hbase_nodes'] }}"
gather_facts: no
any_errors_fatal: true
vars:
local_repo: '../repo/metrics'
remote_repo: '~/repo/metrics'
template_dir: ../conf/metrics
default_conf: jmx_exporter.yml
export_tmpl: '{{template_dir}}/exports'
target_tmpl: '{{template_dir}}/targets'
metrics_dir: '{{ deploy_dir }}/prometheus'
hadoop_home: '{{ deploy_dir }}/hadoop'
hbase_home: '{{ deploy_dir }}/hbase'
jmx_exporter: 'jmx_prometheus_javaagent-0.16.1.jar'
agent_path: '{{ metrics_dir }}/{{ jmx_exporter }}'
namenode_metrics_port: 7021
datanode_metrics_port: 7022
journalnode_metrics_port: 7023
resourcemanager_metrics_port: 7024
nodemanager_metrics_port: 7025
historyserver_metrics_port: 7026
hmaster_metrics_port: 7027
regionserver_metrics_port: 7028
host_to_ip: |
{ {% for h in groups['nodes'] %} {% for n in hostvars[h]['hostname'].split() %}
"{{ n }}" : "{{ h }}" ,
{% endfor %} {% endfor %} }
hadoop_metrics:
- { env: 'HDFS_NAMENODE_OPTS', conf: 'namenode.yml', port: '{{namenode_metrics_port}}', }
- { env: 'HDFS_DATANODE_OPTS', conf: 'datanode.yml', port: '{{datanode_metrics_port}}'}
- { env: 'HDFS_JOURNALNODE_OPTS', conf: 'journalnode.yml', port: '{{journalnode_metrics_port}}' }
- { env: 'YARN_RESOURCEMANAGER_OPTS', conf: 'resourcemanager.yml', port: '{{resourcemanager_metrics_port}}' }
- { env: 'YARN_NODEMANAGER_OPTS', conf: 'nodemanager.yml', port: '{{nodemanager_metrics_port}}' }
- { env: 'MAPRED_HISTORYSERVER_OPTS', conf: 'historyserver.yml', port: '{{historyserver_metrics_port}}' }
hbase_metrics:
- { env: 'HBASE_MASTER_OPTS', conf: 'hmaster.yml', port: '{{hmaster_metrics_port}}' }
- { env: 'HBASE_REGIONSERVER_OPTS', conf: 'regionserver.yml', port: '{{regionserver_metrics_port}}'}
tasks:
- name: test connectivity
ping:
- name: copy metrics package
copy:
src: '{{ local_repo }}'
dest: '~/repo'
- name: ensure metrics dir
become: true
file:
path: '{{ metrics_dir }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
state: directory
- name: install jmx exporter
copy:
src: '{{ remote_repo }}/{{ jmx_exporter }}'
dest: '{{ metrics_dir }}/{{ jmx_exporter }}'
remote_src: yes
- name: fetch exist exporter config
stat:
path: '{{ export_tmpl }}/{{ item }}'
with_items: "{{ (hadoop_metrics + hbase_metrics) | map(attribute='conf') | list }}"
register: metric_tmpl
run_once: yes
connection: local
- name: update hadoop exporter config
vars:
metrics_ip: '{{host_to_ip[inventory_hostname]}}'
metrics_port: '{{ item.port }}'
custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
template:
src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
dest: '{{ metrics_dir }}/{{ item.conf }}'
with_items: '{{ hadoop_metrics }}'
when: inventory_hostname in groups['hadoop_nodes']
- name: update hbase exporter config
vars:
metrics_ip: '{{host_to_ip[inventory_hostname]}}'
metrics_port: '{{ item.port }}'
custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
template:
src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
dest: '{{ metrics_dir }}/{{ item.conf }}'
with_items: '{{ hbase_metrics }}'
when: inventory_hostname in groups['hbase_nodes']
- name: config hadoop-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP METRIC ENV"
block: |
{% for m in hadoop_metrics %}
export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
{% endfor %}
when: inventory_hostname in groups['hadoop_nodes']
- name: config hbase-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE METRIC ENV"
block: |
{% for m in hbase_metrics %}
export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
{% endfor %}
when: inventory_hostname in groups['hbase_nodes']
- name: ensure generated target dir
file:
path: '/tmp/gen-prometheus-targets'
state: directory
run_once: yes
connection: local
- name: generate target config to /tmp/gen-prometheus-targets
template:
src: '{{ target_tmpl }}/{{ item }}'
dest: '/tmp/gen-prometheus-targets/{{ item }}'
with_items:
- hadoop-cluster.yml
- hbase-cluster.yml
- zk-cluster.yml
run_once: yes
connection: local
操作步骤
配置中控机
- 安装 Ansible
必须禁用 SSH 登陆询问,否则后面的安装步骤可能卡住
初始化机器- 修改
hosts配置(必须为 IP 格式)
[nodes]列出集群中所有节点[newborn]列出集群中未部署安装包的节点
- 执行
ansible-playbook book/sync-host.yml - 执行
ansible-playbook book/install-hadoop.yml - 修改
hosts配置
[newborn]清空该组节点
- 修改
hosts配置(必须配置 ansible_user 与 myid)
[zk_nodes]列出集群中所有 ZK 节点
- 修改
book/config-zk.yml调整 JVM 参数 - 执行
ansible-playbook book/config-zk.yml
- 修改
hosts配置
[hadoop_nodes]列出集群中所有 Hadoop 节点[namenodes]集群中所有 NameNode(必须配置 id,rpc_port,bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py python get-pip.py --user pip -V- 安装依赖库
sudo yum install -y gcc glibc-devel zlib-devel rpm-build openssl-devel sudo yum install -y python-devel python-yaml python-jinja2 python2-jmespath编译安装而 Python2 仅支持 2.9 系列,因此无法通过 yum 进行安装
下载 ansible 2.9.27 源码,在本地编译安装
wget releases.ansible.com/ansible/ansible-2.9.27.tar.gz tar -xf ansible-2.9.27.tar.gz pushd ansible-2.9.27/ python setup.py build sudo python setup.py install popd ansible --version配置免密登陆- 在主控机生成密钥
ssh-keygen -t rsa -b 3072 cat ~/.ssh/id_rsa.pub- 受控机访问授权
cat <<EOF >> ~/.ssh/authorized_keys ssh-rsa XXX EOF- 禁用受控机 SSH 登陆询问
vim /etc/ssh/ssh_config # 在 Host * 后加上 Host * StrictHostKeyChecking no安装 Prometheus创建 prometheus 用户
sudo useradd --no-create-home --shell /bin/false prometheus # 授予sudo权限 sudo visudo prometheus ALL=(ALL) NOPASSWD:ALL在官网找到下载链接
wget github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz tar -xvf prometheus-2.35.0.linux-amd64.tar.gz && sudo mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus-2.35.0 sudo mkdir -p /data/prometheus/tsdb sudo mkdir -p /etc/prometheus sudo ln -s /usr/local/prometheus-2.35.0 /usr/local/prometheus sudo mv /usr/local/prometheus/prometheus.yml /etc/prometheus sudo chown -R prometheus:prometheus /usr/local/prometheus/ sudo chown -R prometheus:prometheus /data/prometheus sudo chown -R prometheus:prometheus /etc/prometheus添加到系统服务 (配置格式)
sudo vim /etc/systemd/system/prometheus.service # 新增以下内容 [Unit] Description=Prometheus Server Documentation=prometheus.io/docs/introduction/overview/ Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/data/prometheus/tsdb \ --web.listen-address=:9090 [Install] WantedBy=multi-user.target启动服务
sudo systemctl start prometheus.service # 查看服务状态 systemctl status prometheus.service # 查看日志 sudo journalctl -u prometheus # 测试 curl 127.0.0.1:9090修改配置 prometheus.yml
scrape_configs: - job_name: "prometheus" file_sd_configs: - files: - targets/prometheus-*.yml refresh_interval: 1m - job_name: "zookeeper" file_sd_configs: - files: - targets/zk-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - job_name: "hadoop" file_sd_configs: - files: - targets/hadoop-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["__name__"] target_label: "__name__" regex: "Hadoop_[^_]*_(.*)" replacement: "$1" - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - job_name: "hbase" file_sd_configs: - files: - targets/hbase-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - action: replace source_labels: ["stat"] target_label: "stat" regex: "(.*)th_percentile" replacement: "p$1"增加 targets
pushd /etc/prometheus/targets sudo cat <<EOF >> prometheus-servers.yml - targets: - localhost:9090 labels: service: prometheus EOF sudo cat <<EOF >> zk-cluster.yml - targets: - my.zk1:7000 - my.zk2:7000 - my.zk3:7000 labels: service: zookeeper EOF sudo cat <<EOF >> hadoop-cluster.yml - targets: - my.hadoop1:7021 - my.hadoop2:7021 labels: role: namenode service: hdfs - targets: - my.hadoop1:7022 - my.hadoop2:7022 - my.hadoop3:7022 - my.hadoop4:7022 labels: role: datanode service: hdfs - targets: - my.hadoop1:7023 - my.hadoop2:7023 - my.hadoop3:7023 labels: role: journalnode service: hdfs - targets: - my.hadoop3:7024 - my.hadoop4:7024 labels: role: resourcemanager service: yarn - targets: - my.hadoop1:7025 - my.hadoop2:7025 - my.hadoop3:7025 - my.hadoop4:7025 labels: role: nodemanager service: yarn EOF sudo cat <<EOF >> hbase-cluster.yml - targets: - my.hbase1:7027 - my.hbase2:7027 labels: app: hmaster service: hbase - targets: - my.hbase1:7028 - my.hbase2:7028 - my.hbase3:7028 - my.hbase4:7028 labels: app: regionserver service: hbase EOF安装 Grafana 安装服务在官网找到下载链接(选择 OSS 版):
wget dl.grafana.com/oss/release/grafana-8.5.0-1.x86_64.rpm sudo yum install grafana-8.5.0-1.x86_64.rpm # 查看安装后生成的配置文件 rpm -ql grafana修改配置 grafana.ini
sudo vim /etc/grafana/grafana.ini # 存储路径 [paths] data = /data/grafana/data logs = /data/grafana/logs # 管理员账号 [security] admin_user = admin admin_password = admin启动 grafana 服务
sudo mkdir -p /data/grafana/{data,logs} && sudo chown -R grafana:grafana /data/grafana sudo systemctl start grafana-server systemctl status grafana-server # 测试 curl 127.0.0.1:3000配置 LDAP修改配置文件 grafana.ini
sudo vim /etc/grafana/grafana.ini # 开启 LDAP [auth.ldap] enabled = true # 调整日志等级为 debug 方便调试(可选) [log] level = debug增加 ldap 配置 参考
sudo vim /etc/grafana/ldap.toml [[servers]] # LDAP服务 host = "ldap.service.com" port = 389 # 访问授权 bind_dn = "cn=ldap_sync,cn=Users,dc=staff,dc=my,dc=com" bind_password = """???""" # 查找范围 search_filter = "(sAMAccountName=%s)" search_base_dns = ["ou=Employees,dc=staff,dc=my,dc=com"] # 用户信息映射 [servers.attributes] name = "givenname" surname = "cn" username = "cn" email = "mail" # 权限映射相关配置,此处忽略...启动 grafana 服务
systemctl restart grafana-server # 在界面登录,并观察日志(需要 ctrl + G 定位到末尾) sudo journalctl -u grafana-server配置 Dashboard 添加数据源使用 admin 账号登录,添加 Prometehues 作为数据源:
Configuration (侧边栏) -> Data sources (进入子页面) -> Add data source (蓝色按钮) -> Prometheus (列表选项) -> 填写 www.benf.org/other/cfr/cfr-0.152.jar # 反编译 class java -jar cfr-0.152.jar hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class > A.java java -jar cfr-0.152.jar patch/org/apache/hadoop/hbase/fs/HFileSystem.class > B.java # 查看修改是否成功 diff A.java B.java # 检查完毕后,将 patch 后的 class 文件打包进 hbase-server-2.4.11.jar 包 cd patch jar -uf ../hbase-server-2.4.11.jar org/apache/hadoop/hbase/fs/HFileSystem.class
本文共计7455个文字,预计阅读时间需要30分钟。
背景:为了确保数据安全,我们自主研发了一个低成本的时序数据存储系统,用于存储历史行情数据。
系统设计:- 采用InfluxDB的时序数据库和压缩策略,以降低存储成本。- 基于 HBase 实现海量存储能力。
运维优势:- 通过运维简化事务处理。
背景出于数据安全的考虑,自研了一个低成本的时序数据存储系统,用于存储历史行情数据。
系统借鉴了 InfluxDB 的列存与压缩策略,并基于 HBase 实现了海量存储能力。
由于运维同事缺乏 Hadoop 全家桶的运维经验,只能由我这个研发临时兼职,亲自指挥亲自部署了。
Hadoop 发行版选择目前可选的方案并不多,主要有:
- CDH 目前中小企业选型首选的发行版
- Amibari 最为灵活的且可定制的发行版
- Apache 最原始的发行版
CDH 的缺点:
- Hadoop 组件的版本老旧,不支持新的 API
- JDK 版本受限,无法受益于新版 JDK 带来的性能提升
- 存在大量已知且未修复的 Bug,为后续运维埋下隐患
- 新版本的 CDH 不再免费,无法免费升级
Amibari 的缺点:
- 文档较少,构建困难(前端组件版本较旧,构建直接报错)
- 该项目已经退役,未来不再进行维护
Apache 的缺陷:
- 部署流程复杂,版本兼容可能会踩坑
- 监控系统不完善,自己搭建需要一定的动手能力
系统规划现状:
- 合规严格要求,必须避免版权纠纷
- 集群规模不大,节点数量小于 50
- 没有 Hadoop 相关研发能力,无法自主修复 Bug
- 需要保证查询性能,最好能用上 ZGC 或 ShenandoahGC
最终敲定基于原始的 Apache 发行版搭建 HBase 集群。
版本选择 HBase 组件版本选择如下:
- Adoptium JDK
- HBase 2.4.11 (JDK 17)
- Hadoop 3.2.3 (JDK 8)
- Zookeeper 3.6.3 (JDK 17)
Hadoop 3.3.x 之后不再使用 native 版本的 snappy 与 lz4(相关链接),而最新的 HBase 稳定版 2.4.x 版尚未适配该变更,因此选择 3.2.x 版本。
而 Hadoop 3.2.x 依赖 Zookeeper 3.4.14 的客户端,无法运行在 JDK14 以上的环境(参考案例),因此使用 JDK 8 进行部署。
Zookeeper 版本Zookeeper 3.6.x 是自带 Prometheus 监控版本中最低的,并且高版本 Zookeeper 保证了对低版本客户端的兼容性,因此选择该版本。并且该版本已经支持 JDK 11 部署,因此可以放心的将 JRE 升级为 JDK 17 进行部署。
JDK 发行版JDK 17 是首个支持 ZGC 的 LTS 版本。因 Oracle JDK17 暂不支持 ShenandoahGC,最终选择 Adoptium JDK。网上有朋友分享过在 JDK 15 上部署 CDH 版 HBase 的经验,但需要打一个 Patch,具体步骤参考附录。
运维工具为了弥补 Apache 发行版难以运维的缺点,需要借助两个高效的开源运维工具:
Ansible一款简单易用的自动化部署工具
- 支持幂等部署,减少部署过程中出错概率
- 通过 ssh 实现通信,侵入性低,无需安装 agent
- playbook 可以将运维操作文档化,方便他人接手
Ansible 版本的分界线是 2.9.x,该版本是最后一个支持 Python 2.x 的版本。为了适应现有的运维环境,最终选择该版本。
不过有条件还是建议升级到 Python 3.x 以上,并使用更新版本的 Ansible。毕竟有些 Bug 只在新版本修复,不会同步至低版本。
Prometheus新一代监控告警平台
- 独特的 PromQL 提供灵活高效的查询能力
- 自带 TSDB 与 AlertManager,部署架构简单
- 生态组件丰富
- 通过 JMX Exporter 实现监控指标接入
- 通过 Grafana 实现监控指标的可视化
没有历史包袱,可以直接选择最新版。
配置详解为了保证配置变更的可追溯性,使用 Git 新建了一个工程来维护部署脚本,整个工程的目录结构如下:
.
├── hosts
├── ansible.cfg
├── book
│ ├── config-hadoop.yml
│ ├── config-hbase.yml
│ ├── config-metrics.yml
│ ├── config-zk.yml
│ ├── install-hadoop.yml
│ ├── sync-host.yml
│ └── vars.yml
├── conf
│ ├── hadoop
│ │ ├── core-site.xml
│ │ ├── hdfs-site.xml
│ │ ├── mapred-site.xml
│ │ ├── workers
│ │ └── yarn-site.xml
│ ├── hbase
│ │ ├── backup-masters
│ │ ├── hbase-site.xml
│ │ └── regionservers
│ ├── metrics
│ │ ├── exports
│ │ │ ├── hmaster.yml
│ │ │ ├── jmx_exporter.yml
│ │ │ └── regionserver.yml
│ │ └── targets
│ │ ├── hadoop-cluster.yml
│ │ ├── hbase-cluster.yml
│ │ └── zk-cluster.yml
│ └── zk
│ ├── myid
│ └── zoo.cfg
└── repo
├── hadoop
│ ├── apache-zookeeper-3.6.3-bin.tar.gz
│ ├── hadoop-3.2.3.tar.gz
│ ├── hbase-2.4.11-bin.tar.gz
│ ├── hbase-2.4.11-src.tar.gz
│ ├── hbase-server-2.4.11.jar
│ ├── OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz
│ ├── OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz
│ └── repo.md5
└── metrics
└── jmx_prometheus_javaagent-0.16.1.jar
各个目录的作用
- repo :存储用于部署的二进制的文件
- book :存储 ansible-playbook 的自动化脚本
- conf :存储 HBase 组件的配置模板
对主机进行分类,便于规划集群部署:
[newborn]
[nodes]
172.20.72.1 hostname='my.hadoop1 my.hbase1 my.zk1'
172.20.72.2 hostname='my.hadoop2 my.hbase2 my.zk2'
172.20.72.3 hostname='my.hadoop3 my.hbase3 my.zk3'
172.20.72.4 hostname='my.hadoop4 my.hbase4'
[zk_nodes]
my.zk1 ansible_host=172.30.73.209 myid=1
my.zk2 ansible_host=172.30.73.210 myid=2
my.zk3 ansible_host=172.30.73.211 myid=3
[hadoop_nodes]
my.hadoop[1:4]
[namenodes]
my.hadoop1 id=nn1 rpc_port=8020 {{ hdfs_name }}</value>
</property>
<!-- 指定数据存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>{{ hadoop_data_dir }}</value>
</property>
<!-- 指定 Web 用户权限(默认用户 dr.who 无法上传文件) -->
<property>
<name>hadoop.${hadoop.tmp.dir}/name</value>
</property>
<!-- DataNode 数据存储目录 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/data</value>
</property>
<!-- JournalNode 数据存储目录(绝对路径,不能带 file://) -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${hadoop.tmp.dir}/journal</value>
</property>
<!-- HDFS 集群名称 -->
<property>
<name>dfs.nameservices</name>
<value>{{ hdfs_name }}</value>
</property>
<!-- 集群 NameNode 节点列表 -->
<property>
<name>dfs.ha.namenodes.{{hdfs_name}}</name>
<value>{{ groups['namenodes'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
</property>
<!-- NameNode RPC 地址 -->
{% for host in groups['namenodes'] %}
<property>
<name>dfs.namenode.rpc-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['rpc_port']}}</value>
</property>
{% endfor %}
<!-- NameNode HTTP 地址 -->
{% for host in groups['namenodes'] %}
<property>
<name>dfs.namenode.{{groups['journalnodes'] | zip( groups['journalnodes']|map('extract', hostvars)|map(attribute='journal_port') )| map('join', ':') | join(';') }}/{{hdfs_name}}</value>
</property>
<!-- fail-over 代理类 (client 通过 proxy 来确定 Active NameNode) -->
<property>
<name>dfs.client.failover.proxy.provider.my-hdfs</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 隔离机制 (保证只存在唯一的 Active NameNode) -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- SSH 隔离机制依赖的登录秘钥 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/{{ ansible_user }}/.ssh/id_rsa</value>
</property>
<!-- 启用自动故障转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- NameNode 工作线程数量 -->
<property>
<name>dfs.namenode.handler.count</name>
<value>21</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!-- 启用 ResourceManager HA -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- YARN 集群名称 -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>{{yarn_name}}</value>
</property>
<!-- ResourceManager 节点列表 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>{{ groups['resourcemanagers'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
</property>
<!-- ResourceManager 地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.hostname.{{hostvars[host]['id']}}</name>
<value>{{host}}</value>
</property>
{% endfor %}
<!-- ResourceManager 内部通信地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['peer_port']}}</value>
</property>
{% endfor %}
<!-- NM 访问 ResourceManager 地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.resource-tracker.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['tracker_port']}}</value>
</property>
{% endfor %}
<!-- AM 向 ResourceManager 申请资源地址 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.scheduler.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['scheduler_port']}}</value>
</property>
{% endfor %}
<!-- ResourceManager Web 入口 -->
{% for host in groups['resourcemanagers'] %}
<property>
<name>yarn.resourcemanager.webapp.address.{{hostvars[host]['id']}}</name>
<value>{{host}}:{{hostvars[host]['web_port']}}</value>
</property>
{% endfor %}
<!-- 启用自动故障转移 -->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!-- 指定 Zookeeper 列表 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
</property>
<!-- 将状态信息存储在 Zookeeper 集群-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<!-- 减少 ResourceManager 处理 Client 请求的线程-->
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>10</value>
</property>
<!-- 禁止 NodeManager 自适应硬件配置(非独占节点)-->
<property>
<name>yarn.nodemanager.resource.detect-hardware-capbilities</name>
<value>false</value>
</property>
<!-- NodeManager 给容器分配的 CPU 核数-->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<!-- NodeManager 使用物理核计算 CPU 数量(可选)-->
<property>
<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
<value>false</value>
</property>
<!-- 减少 NodeManager 使用内存-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- 容器内存下限 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- 容器内存上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 容器CPU下限 -->
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 容器CPU上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 容器CPU上限 -->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 关闭虚拟内存检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 设置虚拟内存和物理内存的比例 -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<!-- NodeManager 在 MR 过程中使用 Shuffle(可选)-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- MapReduce 运行在 YARN 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- MapReduce Classpath -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- MapReduce JVM 参数(不允许换行) -->
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx1024m --add-opens java.base/java.lang=ALL-UNNAMED</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>--add-opens java.base/java.lang=ALL-UNNAMED -verbose:gc -Xloggc:/tmp/@taskid@.gc</value>
</property>
</configuration>
workers
{% for host in groups['datanodes'] %}
{{ host }}
{% endfor %}
conf/hbase 目录
hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://{{ hdfs_name }}/hbase</value>
</property>
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
</property>
</configuration>
regionservers
{% for host in groups['regionservers'] %}
{{ host }}
{% endfor %}
backup-masters
{% for host in groups['hmasters'][1:] %}
{{ host }}
{% endfor %}
conf/metrics/exports 目录
jmx_exporter.yml
---
# github.com/prometheus/jmx_exporter
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# ignore service
- pattern: Hadoop<service=(\w+), name=([\w-.]+), sub=(\w+)><>([\w._]+)
name: $4
labels:
name: "$2"
group: "$3"
attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=(\w+)-([^<]+)><>([\w._]+)
name: $4
labels:
name: "$2"
entity: "$3"
attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=([^<]+)><>([\w._]+)
name: $3
labels:
name: "$2"
attrNameSnakeCase: true
- pattern: .+
hmaster.yml
---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=RegionServer,*"
rules:
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $2
labels:
group: "$1"
stat: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)
name: $2
labels:
group: "$1"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master><>([\w._]+)
name: $1
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
name: $3
labels:
name: "$1"
group: "$2"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
name: $2
labels:
name: "$1"
attrNameSnakeCase: true
- pattern: .+
regionserver.yml
---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=Master,*"
rules:
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Regions><>namespace_([\w._]+)_table_([\w._]+)_region_(\w+)_metric_([\w._]+)
name: $4
labels:
group: Regions
namespace: "$1"
table: "$2"
region: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Tables><>namespace_([\w._]+)_table_([\w._]+)_columnfamily_([\w._]+)_metric_([\w._]+)
name: $4
labels:
group: Tables
namespace: "$1"
table: "$2"
column_family: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $4
labels:
group: "$1"
namespace: "$2"
table: "$3"
stat: "$5"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)
name: $4
labels:
group: "$1"
namespace: "$2"
table: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
name: $2
labels:
group: "$1"
stat: "$3"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)
name: $2
labels:
group: "$1"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
name: $3
labels:
name: "$1"
group: "$2"
attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
name: $2
labels:
name: "$1"
attrNameSnakeCase: true
- pattern: .+
conf/metrics/targets 目录
zk-cluster.yml
- targets:
{% for host in groups['zk_nodes'] %}
- {{ host }}:7000
{% endfor %}
labels:
service: zookeeper
hadoop-cluster.yml
- targets:
{% for host in groups['namenodes'] %}
- {{ host }}:{{ namenode_metrics_port }}
{% endfor %}
labels:
role: namenode
service: hdfs
- targets:
{% for host in groups['datanodes'] %}
- {{ host }}:{{ datanode_metrics_port }}
{% endfor %}
labels:
role: datanode
service: hdfs
- targets:
{% for host in groups['journalnodes'] %}
- {{ host }}:{{ journalnode_metrics_port }}
{% endfor %}
labels:
role: journalnode
service: hdfs
- targets:
{% for host in groups['resourcemanagers'] %}
- {{ host }}:{{ resourcemanager_metrics_port }}
{% endfor %}
labels:
role: resourcemanager
service: yarn
- targets:
{% for host in groups['datanodes'] %}
- {{ host }}:{{ nodemanager_metrics_port }}
{% endfor %}
labels:
role: nodemanager
service: yarn
hbase-cluster.yml
- targets:
{% for host in groups['hmasters'] %}
- {{ host }}:{{ hmaster_metrics_port }}
{% endfor %}
labels:
role: hmaster
service: hbase
- targets:
{% for host in groups['regionservers'] %}
- {{ host }}:{{ regionserver_metrics_port }}
{% endfor %}
labels:
role: regionserver
service: hbase
book 目录
vars.yml
hdfs_name: my-hdfs
yarn_name: my-yarn
sync-host.yml
---
- name: Config Hostname & SSH Keys
hosts: nodes
connection: local
gather_facts: no
any_errors_fatal: true
vars:
hostnames: |
{% for h in groups['nodes'] if hostvars[h].hostname is defined %}{{h}} {{ hostvars[h].hostname }}
{% endfor %}
tasks:
- name: test connectivity
ping:
connection: ssh
- name: change local hostname
become: true
blockinfile:
dest: '/etc/hosts'
marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
block: '{{ hostnames }}'
run_once: true
- name: sync remote hostname
become: true
blockinfile:
dest: '/etc/hosts'
marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
block: '{{ hostnames }}'
connection: ssh
- name: fetch exist status
stat:
path: '~/.ssh/id_rsa'
register: ssh_key_path
connection: ssh
- name: generate ssh key
openssh_keypair:
path: '~/.ssh/id_rsa'
comment: '{{ ansible_user }}@{{ inventory_hostname }}'
type: rsa
size: 2048
state: present
force: no
connection: ssh
when: not ssh_key_path.stat.exists
- name: collect ssh key
command: ssh {{ansible_user}}@{{ansible_host|default(inventory_hostname)}} 'cat ~/.ssh/id_rsa.pub'
register: host_keys # cache data in hostvars[hostname].host_keys
changed_when: false
- name: create temp file
tempfile:
state: file
suffix: _keys
register: temp_ssh_keys
changed_when: false
run_once: true
- name: save ssh key ({{temp_ssh_keys.path}})
blockinfile:
dest: "{{temp_ssh_keys.path}}"
block: |
{% for h in groups['nodes'] if hostvars[h].host_keys is defined %}
{{ hostvars[h].host_keys.stdout }}
{% endfor %}
changed_when: false
run_once: true
- name: deploy ssh key
vars:
ssh_keys: "{{ lookup('file', temp_ssh_keys.path).split('\n') | select('match', '^ssh') | join('\n') }}"
authorized_key:
user: "{{ ansible_user }}"
key: "{{ ssh_keys }}"
state: present
connection: ssh
install-hadoop.yml
---
- name: Install Hadoop Package
hosts: newborn
gather_facts: no
any_errors_fatal: true
vars:
local_repo: '../repo/hadoop'
remote_repo: '~/repo/hadoop'
package_info:
- {src: 'OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz', dst: 'java/jdk-17.0.2+8', home: 'jdk17'}
- {src: 'OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz', dst: 'java/jdk8u322-b06', home: 'jdk8'}
- {src: 'apache-zookeeper-3.6.3-bin.tar.gz', dst: 'apache/zookeeper-3.6.3', home: 'zookeeper'}
- {src: 'hbase-2.4.11-bin.tar.gz', dst: 'apache/hbase-2.4.11',home: 'hbase'}
- {src: 'hadoop-3.2.3.tar.gz', dst: 'apache/hadoop-3.2.3', home: 'hadoop'}
tasks:
- name: test connectivity
ping:
- name: copy hadoop package
copy:
src: '{{ local_repo }}'
dest: '~/repo'
- name: prepare directory
become: true # become root
file:
state: directory
path: '{{ deploy_dir }}/{{ item.dst }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
with_items: '{{ package_info }}'
- name: create link
become: true # become root
file:
state: link
src: '{{ deploy_dir }}/{{ item.dst }}'
dest: '{{ deploy_dir }}/{{ item.home }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
with_items: '{{ package_info }}'
- name: install package
unarchive:
src: '{{ remote_repo }}/{{ item.src }}'
dest: '{{ deploy_dir }}/{{ item.dst }}'
remote_src: yes
extra_opts:
- --strip-components=1
with_items: '{{ package_info }}'
- name: config /etc/profile
become: true
blockinfile:
dest: '/etc/profile'
marker: "# {mark} ANSIBLE MANAGED PROFILE"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk8
export HADOOP_HOME={{ deploy_dir }}/hadoop
export HBASE_HOME={{ deploy_dir }}/hbase
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PATH
- name: config zkEnv.sh
lineinfile:
path: '{{ deploy_dir }}/zookeeper/bin/zkEnv.sh'
line: 'JAVA_HOME={{ deploy_dir }}/jdk17'
insertafter: '^#\!\/usr\/bin'
firstmatch: yes
- name: config hadoop-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP ENV"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk8
- name: config hbase-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE ENV"
block: |
export JAVA_HOME={{ deploy_dir }}/jdk17
export HBASE_MANAGES_ZK=false
export HBASE_LIBRARY_PATH={{ deploy_dir }}/hadoop/lib/native
export HBASE_OPTS="$HBASE_OPTS --add-exports=java.base/jdk.internal.access=ALL-UNNAMED --add-exports=java.base/jdk.internal=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.security.pkcs=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/jdk.internal=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-opens java.base/jdk.internal.access=ALL-UNNAMED"
- name: patch hbase
copy:
src: '{{ local_repo }}/hbase-server-2.4.11.jar'
dest: '{{ deploy_dir }}/hbase/lib'
backup: no
force: yes
- name: link hadoop config
file:
state: link
src: '{{ deploy_dir }}/hadoop/etc/hadoop/{{ item }}'
dest: '{{ deploy_dir }}/hbase/conf/{{ item }}'
with_items:
- core-site.xml
- hdfs-site.xml
- name: add epel-release repo
shell: 'sudo yum -y install epel-release && sudo yum makecache'
- name: install native libary
shell: 'sudo yum -y install snappy snappy-devel lz4 lz4-devel libzstd libzstd-devel'
- name: check hadoop native
shell: '{{ deploy_dir }}/hadoop/bin/hadoop checknative -a'
register: hadoop_checknative
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk8'
- name: hadoop native status
debug:
msg: "{{ hadoop_checknative.stdout_lines }}"
- name: check hbase native
shell: '{{ deploy_dir }}/hbase/bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker'
register: hbase_checknative
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk17'
HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
- name: hbase native status
debug:
msg: "{{ hbase_checknative.stdout_lines|select('match', '^[^0-9]') | list }}"
- name: test native compresssion
shell: '{{ deploy_dir }}/hbase/bin/hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test {{ item }}'
register: 'compression'
failed_when: false
changed_when: false
ignore_errors: yes
environment:
JAVA_HOME: '{{ deploy_dir }}/jdk17'
HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
with_items:
- snappy
- lz4
- name: native compresssion status
vars:
results: "{{ compression | json_query('results[*].{type:item, result:stdout}') }}"
debug:
msg: |
{% for r in results %} {{ r.type }} => {{ r.result == 'SUCCESS' }} {% endfor %}
config-zk.yml
---
- name: Change Zk Config
hosts: zk_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/zk
zk_home: '{{ deploy_dir }}/zookeeper'
zk_data_dir: '{{ zk_home }}/status/data'
zk_data_log_dir: '{{ zk_home }}/status/logs'
tasks:
- name: Create data directory
file:
state: directory
path: '{{ item }}'
recurse: yes
with_items:
- '{{ zk_data_dir }}'
- '{{ zk_data_log_dir }}'
- name: Init zookeeper myid
template:
src: '{{ template_dir }}/myid'
dest: '{{ zk_data_dir }}'
- name: Update zookeeper env
become: true
blockinfile:
dest: '{{ zk_home }}/bin/zkEnv.sh'
marker: "# {mark} ANSIBLE MANAGED ZK ENV"
block: |
export SERVER_JVMFLAGS="-Xmx1G -XX:+UseShenandoahGC -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608"
notify:
- Restart zookeeper service
- name: Update zookeeper config
template:
src: '{{ template_dir }}/zoo.cfg'
dest: '{{ zk_home }}/conf'
notify:
- Restart zookeeper service
handlers:
- name: Restart zookeeper service
shell:
cmd: '{{ zk_home }}/bin/zkServer.sh restart'
config-hadoop.yml
---
- name: Change Hadoop Config
hosts: hadoop_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/hadoop
hadoop_home: '{{ deploy_dir }}/hadoop'
hadoop_conf_dir: '{{ hadoop_home }}/etc/hadoop'
hadoop_data_dir: '{{ data_dir }}/hadoop'
tasks:
- name: Include common vars
include_vars: file=vars.yml
- name: Create data directory
become: true
file:
state: directory
path: '{{ hadoop_data_dir }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
- name: Sync hadoop config
template:
src: '{{ template_dir }}/{{ item }}'
dest: '{{ hadoop_conf_dir }}/{{ item }}'
with_items:
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- yarn-site.xml
- workers
- name: Config hadoop env
blockinfile:
dest: '{{ hadoop_conf_dir }}/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED HADOOP ENV"
block: |
export HADOOP_PID_DIR={{ hadoop_home }}/pid
export HADOOP_LOG_DIR={{ hadoop_data_dir }}/logs
JVM_OPTS="-XX:+AlwaysPreTouch"
export HDFS_JOURNALNODE_OPTS="-Xmx1G $JVM_OPTS $HDFS_JOURNALNODE_OPTS"
export HDFS_NAMENODE_OPTS="-Xmx4G $JVM_OPTS $HDFS_NAMENODE_OPTS"
export HDFS_DATANODE_OPTS="-Xmx8G $JVM_OPTS $HDFS_DATANODE_OPTS"
- name: Config yarn env
blockinfile:
dest: '{{ hadoop_conf_dir }}/yarn-env.sh'
marker: "# {mark} ANSIBLE MANAGED YARN ENV"
block: |
JVM_OPTS=""
export YARN_RESOURCEMANAGER_OPTS="$JVM_OPTS $YARN_RESOURCEMANAGER_OPTS"
export YARN_NODEMANAGER_OPTS="$JVM_OPTS $YARN_NODEMANAGER_OPTS"
config-hbase.yml
---
- name: Change HBase Config
hosts: hbase_nodes
gather_facts: no
any_errors_fatal: true
vars:
template_dir: ../conf/hbase
hbase_home: '{{ deploy_dir }}/hbase'
hbase_conf_dir: '{{ hbase_home }}/conf'
hbase_data_dir: '{{ data_dir }}/hbase'
hbase_log_dir: '{{ hbase_data_dir }}/logs'
hbase_gc_log_dir: '{{ hbase_log_dir }}/gc'
tasks:
- name: Include common vars
include_vars: file=vars.yml
- name: Create data directory
become: true
file:
state: directory
path: '{{ item }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
mode: 0775
recurse: yes
with_items:
- '{{ hbase_data_dir }}'
- '{{ hbase_log_dir }}'
- '{{ hbase_gc_log_dir }}'
- name: Sync hbase config
template:
src: '{{ template_dir }}/{{ item }}'
dest: '{{ hbase_conf_dir }}/{{ item }}'
with_items:
- hbase-site.xml
- backup-masters
- regionservers
- name: Config hbase env
blockinfile:
dest: '{{ hbase_conf_dir }}/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED HBASE ENV"
block: |
export HBASE_LOG_DIR={{ hbase_log_dir }}
export HBASE_OPTS="-Xss256k -XX:+UseShenandoahGC -XX:+AlwaysPreTouch $HBASE_OPTS"
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hmaster-%p-%t.log"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hregion-%p-%t.log"
config-metrics.yml
---
- name: Install Metrics Package
hosts: "{{ groups['hadoop_nodes'] + groups['hbase_nodes'] }}"
gather_facts: no
any_errors_fatal: true
vars:
local_repo: '../repo/metrics'
remote_repo: '~/repo/metrics'
template_dir: ../conf/metrics
default_conf: jmx_exporter.yml
export_tmpl: '{{template_dir}}/exports'
target_tmpl: '{{template_dir}}/targets'
metrics_dir: '{{ deploy_dir }}/prometheus'
hadoop_home: '{{ deploy_dir }}/hadoop'
hbase_home: '{{ deploy_dir }}/hbase'
jmx_exporter: 'jmx_prometheus_javaagent-0.16.1.jar'
agent_path: '{{ metrics_dir }}/{{ jmx_exporter }}'
namenode_metrics_port: 7021
datanode_metrics_port: 7022
journalnode_metrics_port: 7023
resourcemanager_metrics_port: 7024
nodemanager_metrics_port: 7025
historyserver_metrics_port: 7026
hmaster_metrics_port: 7027
regionserver_metrics_port: 7028
host_to_ip: |
{ {% for h in groups['nodes'] %} {% for n in hostvars[h]['hostname'].split() %}
"{{ n }}" : "{{ h }}" ,
{% endfor %} {% endfor %} }
hadoop_metrics:
- { env: 'HDFS_NAMENODE_OPTS', conf: 'namenode.yml', port: '{{namenode_metrics_port}}', }
- { env: 'HDFS_DATANODE_OPTS', conf: 'datanode.yml', port: '{{datanode_metrics_port}}'}
- { env: 'HDFS_JOURNALNODE_OPTS', conf: 'journalnode.yml', port: '{{journalnode_metrics_port}}' }
- { env: 'YARN_RESOURCEMANAGER_OPTS', conf: 'resourcemanager.yml', port: '{{resourcemanager_metrics_port}}' }
- { env: 'YARN_NODEMANAGER_OPTS', conf: 'nodemanager.yml', port: '{{nodemanager_metrics_port}}' }
- { env: 'MAPRED_HISTORYSERVER_OPTS', conf: 'historyserver.yml', port: '{{historyserver_metrics_port}}' }
hbase_metrics:
- { env: 'HBASE_MASTER_OPTS', conf: 'hmaster.yml', port: '{{hmaster_metrics_port}}' }
- { env: 'HBASE_REGIONSERVER_OPTS', conf: 'regionserver.yml', port: '{{regionserver_metrics_port}}'}
tasks:
- name: test connectivity
ping:
- name: copy metrics package
copy:
src: '{{ local_repo }}'
dest: '~/repo'
- name: ensure metrics dir
become: true
file:
path: '{{ metrics_dir }}'
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
state: directory
- name: install jmx exporter
copy:
src: '{{ remote_repo }}/{{ jmx_exporter }}'
dest: '{{ metrics_dir }}/{{ jmx_exporter }}'
remote_src: yes
- name: fetch exist exporter config
stat:
path: '{{ export_tmpl }}/{{ item }}'
with_items: "{{ (hadoop_metrics + hbase_metrics) | map(attribute='conf') | list }}"
register: metric_tmpl
run_once: yes
connection: local
- name: update hadoop exporter config
vars:
metrics_ip: '{{host_to_ip[inventory_hostname]}}'
metrics_port: '{{ item.port }}'
custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
template:
src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
dest: '{{ metrics_dir }}/{{ item.conf }}'
with_items: '{{ hadoop_metrics }}'
when: inventory_hostname in groups['hadoop_nodes']
- name: update hbase exporter config
vars:
metrics_ip: '{{host_to_ip[inventory_hostname]}}'
metrics_port: '{{ item.port }}'
custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
template:
src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
dest: '{{ metrics_dir }}/{{ item.conf }}'
with_items: '{{ hbase_metrics }}'
when: inventory_hostname in groups['hbase_nodes']
- name: config hadoop-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP METRIC ENV"
block: |
{% for m in hadoop_metrics %}
export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
{% endfor %}
when: inventory_hostname in groups['hadoop_nodes']
- name: config hbase-env.sh
blockinfile:
dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE METRIC ENV"
block: |
{% for m in hbase_metrics %}
export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
{% endfor %}
when: inventory_hostname in groups['hbase_nodes']
- name: ensure generated target dir
file:
path: '/tmp/gen-prometheus-targets'
state: directory
run_once: yes
connection: local
- name: generate target config to /tmp/gen-prometheus-targets
template:
src: '{{ target_tmpl }}/{{ item }}'
dest: '/tmp/gen-prometheus-targets/{{ item }}'
with_items:
- hadoop-cluster.yml
- hbase-cluster.yml
- zk-cluster.yml
run_once: yes
connection: local
操作步骤
配置中控机
- 安装 Ansible
必须禁用 SSH 登陆询问,否则后面的安装步骤可能卡住
初始化机器- 修改
hosts配置(必须为 IP 格式)
[nodes]列出集群中所有节点[newborn]列出集群中未部署安装包的节点
- 执行
ansible-playbook book/sync-host.yml - 执行
ansible-playbook book/install-hadoop.yml - 修改
hosts配置
[newborn]清空该组节点
- 修改
hosts配置(必须配置 ansible_user 与 myid)
[zk_nodes]列出集群中所有 ZK 节点
- 修改
book/config-zk.yml调整 JVM 参数 - 执行
ansible-playbook book/config-zk.yml
- 修改
hosts配置
[hadoop_nodes]列出集群中所有 Hadoop 节点[namenodes]集群中所有 NameNode(必须配置 id,rpc_port,bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py python get-pip.py --user pip -V- 安装依赖库
sudo yum install -y gcc glibc-devel zlib-devel rpm-build openssl-devel sudo yum install -y python-devel python-yaml python-jinja2 python2-jmespath编译安装而 Python2 仅支持 2.9 系列,因此无法通过 yum 进行安装
下载 ansible 2.9.27 源码,在本地编译安装
wget releases.ansible.com/ansible/ansible-2.9.27.tar.gz tar -xf ansible-2.9.27.tar.gz pushd ansible-2.9.27/ python setup.py build sudo python setup.py install popd ansible --version配置免密登陆- 在主控机生成密钥
ssh-keygen -t rsa -b 3072 cat ~/.ssh/id_rsa.pub- 受控机访问授权
cat <<EOF >> ~/.ssh/authorized_keys ssh-rsa XXX EOF- 禁用受控机 SSH 登陆询问
vim /etc/ssh/ssh_config # 在 Host * 后加上 Host * StrictHostKeyChecking no安装 Prometheus创建 prometheus 用户
sudo useradd --no-create-home --shell /bin/false prometheus # 授予sudo权限 sudo visudo prometheus ALL=(ALL) NOPASSWD:ALL在官网找到下载链接
wget github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz tar -xvf prometheus-2.35.0.linux-amd64.tar.gz && sudo mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus-2.35.0 sudo mkdir -p /data/prometheus/tsdb sudo mkdir -p /etc/prometheus sudo ln -s /usr/local/prometheus-2.35.0 /usr/local/prometheus sudo mv /usr/local/prometheus/prometheus.yml /etc/prometheus sudo chown -R prometheus:prometheus /usr/local/prometheus/ sudo chown -R prometheus:prometheus /data/prometheus sudo chown -R prometheus:prometheus /etc/prometheus添加到系统服务 (配置格式)
sudo vim /etc/systemd/system/prometheus.service # 新增以下内容 [Unit] Description=Prometheus Server Documentation=prometheus.io/docs/introduction/overview/ Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/data/prometheus/tsdb \ --web.listen-address=:9090 [Install] WantedBy=multi-user.target启动服务
sudo systemctl start prometheus.service # 查看服务状态 systemctl status prometheus.service # 查看日志 sudo journalctl -u prometheus # 测试 curl 127.0.0.1:9090修改配置 prometheus.yml
scrape_configs: - job_name: "prometheus" file_sd_configs: - files: - targets/prometheus-*.yml refresh_interval: 1m - job_name: "zookeeper" file_sd_configs: - files: - targets/zk-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - job_name: "hadoop" file_sd_configs: - files: - targets/hadoop-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["__name__"] target_label: "__name__" regex: "Hadoop_[^_]*_(.*)" replacement: "$1" - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - job_name: "hbase" file_sd_configs: - files: - targets/hbase-cluster.yml refresh_interval: 1m metric_relabel_configs: - action: replace source_labels: ["instance"] target_label: "instance" regex: "([^:]+):.*" replacement: "$1" - action: replace source_labels: ["stat"] target_label: "stat" regex: "(.*)th_percentile" replacement: "p$1"增加 targets
pushd /etc/prometheus/targets sudo cat <<EOF >> prometheus-servers.yml - targets: - localhost:9090 labels: service: prometheus EOF sudo cat <<EOF >> zk-cluster.yml - targets: - my.zk1:7000 - my.zk2:7000 - my.zk3:7000 labels: service: zookeeper EOF sudo cat <<EOF >> hadoop-cluster.yml - targets: - my.hadoop1:7021 - my.hadoop2:7021 labels: role: namenode service: hdfs - targets: - my.hadoop1:7022 - my.hadoop2:7022 - my.hadoop3:7022 - my.hadoop4:7022 labels: role: datanode service: hdfs - targets: - my.hadoop1:7023 - my.hadoop2:7023 - my.hadoop3:7023 labels: role: journalnode service: hdfs - targets: - my.hadoop3:7024 - my.hadoop4:7024 labels: role: resourcemanager service: yarn - targets: - my.hadoop1:7025 - my.hadoop2:7025 - my.hadoop3:7025 - my.hadoop4:7025 labels: role: nodemanager service: yarn EOF sudo cat <<EOF >> hbase-cluster.yml - targets: - my.hbase1:7027 - my.hbase2:7027 labels: app: hmaster service: hbase - targets: - my.hbase1:7028 - my.hbase2:7028 - my.hbase3:7028 - my.hbase4:7028 labels: app: regionserver service: hbase EOF安装 Grafana 安装服务在官网找到下载链接(选择 OSS 版):
wget dl.grafana.com/oss/release/grafana-8.5.0-1.x86_64.rpm sudo yum install grafana-8.5.0-1.x86_64.rpm # 查看安装后生成的配置文件 rpm -ql grafana修改配置 grafana.ini
sudo vim /etc/grafana/grafana.ini # 存储路径 [paths] data = /data/grafana/data logs = /data/grafana/logs # 管理员账号 [security] admin_user = admin admin_password = admin启动 grafana 服务
sudo mkdir -p /data/grafana/{data,logs} && sudo chown -R grafana:grafana /data/grafana sudo systemctl start grafana-server systemctl status grafana-server # 测试 curl 127.0.0.1:3000配置 LDAP修改配置文件 grafana.ini
sudo vim /etc/grafana/grafana.ini # 开启 LDAP [auth.ldap] enabled = true # 调整日志等级为 debug 方便调试(可选) [log] level = debug增加 ldap 配置 参考
sudo vim /etc/grafana/ldap.toml [[servers]] # LDAP服务 host = "ldap.service.com" port = 389 # 访问授权 bind_dn = "cn=ldap_sync,cn=Users,dc=staff,dc=my,dc=com" bind_password = """???""" # 查找范围 search_filter = "(sAMAccountName=%s)" search_base_dns = ["ou=Employees,dc=staff,dc=my,dc=com"] # 用户信息映射 [servers.attributes] name = "givenname" surname = "cn" username = "cn" email = "mail" # 权限映射相关配置,此处忽略...启动 grafana 服务
systemctl restart grafana-server # 在界面登录,并观察日志(需要 ctrl + G 定位到末尾) sudo journalctl -u grafana-server配置 Dashboard 添加数据源使用 admin 账号登录,添加 Prometehues 作为数据源:
Configuration (侧边栏) -> Data sources (进入子页面) -> Add data source (蓝色按钮) -> Prometheus (列表选项) -> 填写 www.benf.org/other/cfr/cfr-0.152.jar # 反编译 class java -jar cfr-0.152.jar hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class > A.java java -jar cfr-0.152.jar patch/org/apache/hadoop/hbase/fs/HFileSystem.class > B.java # 查看修改是否成功 diff A.java B.java # 检查完毕后,将 patch 后的 class 文件打包进 hbase-server-2.4.11.jar 包 cd patch jar -uf ../hbase-server-2.4.11.jar org/apache/hadoop/hbase/fs/HFileSystem.class

