王某某的笔记

记录我的编程之路

从这个点做一条射线,任意方向都行,计算它跟多边形边界的交点个数,如果交点个数为奇数,那么点在多边形内部,否则点在多边形外。

奇数个交点

该点在多边形内

1722438652525.png

偶数个交点

该点在多边形外

1722438672846.png

射线与顶点相交的情况

情况1

相交的顶点相连的两条边分别在射线的两侧,这种情况算 相交

1722438703868.png

射线经过顶点D,与顶点D相连的两个点是A和C,A点在射线的上面,C点在射线的下面,这种算相交

情况2

相交的顶点相连的两条边都在射线的同一侧,这种情况认为 不相交

1722438743099.png

射线经过顶点D,与顶点D相连的两个点事A和C,A和C都在射线的下面(同一侧),这种不算相交

java代码实现

算了去找别人写的吧。。。

1

随意写一个Maven项目,依赖一个不存在的构建时的测试记录

使用 mvn compile 命令测试


没有配置镜像的测试

Maven 的 settings.xml 中没有配置

一、没有配置repository

使用默认的中央仓库

依赖snapshot版本的构建

直接拒绝,因为中央仓库不支持snapshot,不会连到中央仓库

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.7-SNAPSHOT -> [Help 1]

依赖非snapshot版本的构建

尝试去中央服务器下载,然后提示不存在

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.8 in central (https://repo.maven.apache.org/maven2) -> [Help 1]

二、 配置了一个repository,且支持snapshot

1
2
3
4
5
6
7
8
9
10
11
12
<repositories>
<repository>
<id>nexusRep</id>
<url>http://nexus.vvsmart.com:8081/nexus/content/groups/public</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>

依赖snapshot版本的构建

尝试去配置的仓库下载,然后提示不存在

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:1.2-SNAPSHOT in nexusRep (http://nexus.vvsmart.com:8081/nexus/content/groups/public) -> [Help 1]

依赖非snapshot版本的构建

先从配置的仓库下载,再尝试去中央仓库下载,然后提示无法解析

1
2
3
4
5
6
7
8
9
10
11
12
13
Downloading from nexusRep: http://nexus.vvsmart.com:8081/nexus/content/groups/public/com/wwh/csv2excel/csv2excel/1.4/csv2excel-1.4.pom
Downloading from central: https://repo.maven.apache.org/maven2/com/wwh/csv2excel/csv2excel/1.4/csv2excel-1.4.pom
[WARNING] The POM for com.wwh.csv2excel:csv2excel:jar:1.4 is missing, no dependency information available
Downloading from nexusRep: http://nexus.vvsmart.com:8081/nexus/content/groups/public/com/wwh/csv2excel/csv2excel/1.4/csv2excel-1.4.jar
Downloading from central: https://repo.maven.apache.org/maven2/com/wwh/csv2excel/csv2excel/1.4/csv2excel-1.4.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11.304 s
[INFO] Finished at: 2020-05-09T21:02:10+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:1.4 in nexusRep (http://nexus.vvsmart.com:8081/nexus/content/groups/public) -> [Help 1]

三、 配置了多个repository,且支持snapshot

配置三个repository,地址实际上是不存在的,这里只验证下载构建的顺序

1
2
3
4
5
6
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexusRep: http://nexus.vvsmart.com:8081/nexus/content/groups/public/com/wwh/csv2excel/csv2excel/1.5/csv2excel-1.5.pom
Downloading from aaa: http://192.168.0.11:8081/nexus/content/groups/public/com/wwh/csv2excel/csv2excel/1.5/csv2excel-1.5.pom
Downloading from bbb: http://192.168.0.12:8081/nexus/content/groups/public/com/wwh/csv2excel/csv2excel/1.5/csv2excel-1.5.pom
Downloading from central: https://repo.maven.apache.org/maven2/com/wwh/csv2excel/csv2excel/1.5/csv2excel-1.5.pom

结论

如果repository不支持snapshot的构建,则不会去这个仓库下载
优先使用配置的repository,按照配置的顺序进行下载,最后再用mavne的中央仓库


配置了镜像的测试

1
2
3
4
5
<mirror>
<id>nexus</id>
<mirrorOf>*</mirrorOf>
<url>http://nexus.vvsmart.com:8081/nexus/content/groups/public</url>
</mirror>

一个地址镜像全部仓库

一、没有配置repository

依赖snapshot版本的构建

默认走到了中央仓库,这个直接被拒绝了,因为没有开启snapshot,提示无法解析,不会连接到镜像服务器

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.4-SNAPSHOT -> [Help 1]

依赖非snapshot版本的构建

会尝试去镜像地址下载,然后提示不存在

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.3 in nexus (http://nexus.vvsmart.com:8081/nexus/content/groups/public) -> [Help 1]

二、 配置了一个repository,且支持snapshot

都会连接到镜像地址,然后提示不存在

repository的id的影响

repository的id为central时

依赖一个不存在的snapshot版本的构建时,尝试去镜像地址下载

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.5-SNAPSHOT in nexus (http://nexus.vvsmart.com:8081/nexus/content/groups/public) -> [Help 1]
repository的id为test时

依赖一个不存在的snapshot版本的构建时,尝试去镜像地址下载

1
[ERROR] Failed to execute goal on project csv2excel: Could not resolve dependencies for project com.wwh.csv2excel:csv2excel:jar:0.1-SNAPSHOT: Could not find artifact com.wwh.csv2excel:csv2excel:jar:0.6-SNAPSHOT in nexus (http://nexus.vvsmart.com:8081/nexus/content/groups/public) -> [Help 1]

结论

Maven也会先判断本地的repository是否支持snapshot,再决定是否需要去该仓库下载,再连到镜像服务器

是否需要将repository的id设置为central( 为了覆盖超级POM中央仓库的配置)并不重要,反正都会走的镜像服务器中,只要有任意一个仓库支持snapshot就可以了

且配置了镜像之后,repository的Url并不重要,反正不会用到

问题描述

线上环境每过一段时间(一个月左右) kafka就会故障无法访问,三个服务器都正常,kafka进程正常,但是会出现一个节点和其他两个节点断开的情况,表现形式上类似脑裂,日志中显示正常节点无法连接到故障的节点

出现问题 就不会再自动恢复了

重启故障节点即可解决这个问题

环境信息

集群

三台机器组成的kafka集群

  • 192.168.1.217 RD-MQ-01
  • 192.168.1.218 RD-MQ-02
  • 192.168.1.219 RD-MQ-03

kafka 版本

kafka_2.11-0.10.1.0

zookeeper 版本

3.4.10-4181

ZK独立部署,不与其他程序共用

java 版本

1
2
3
4
[root@RD-MQ-01 ~]# java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

操作系统版本

CentOS 6.8

1
2
3
4
5
[root@RD-MQ-02 bin]# uname -a
Linux RD-MQ-02 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@RD-MQ-02 bin]# cat /etc/redhat-release
CentOS release 6.8 (Final)

相关配置

kafka

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
############################# Server Basics #############################
broker.id=218
delete.topic.enable=true

############################# Socket Server Settings #############################
host.name=192.168.1.218
listeners=PLAINTEXT://192.168.1.218:9092
num.network.threads=4
num.io.threads=4
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

############################# Log Basics #############################
log.dirs=/opt/data/kafka
num.partitions=32

############################# Internal Topic Settings #############################
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3

############################# Log Flush Policy #############################
# 每当producer写入10000条消息时,刷数据到磁盘
log.flush.interval.messages=2000
# 每间隔1秒钟时间,刷数据到磁盘
log.flush.interval.ms=1000

############################# Log Retention Policy #############################
# 日志保留的时长与大小,二者满足其一即生效,此处大小为200GB
log.retention.hours=72
log.retention.bytes=214748364800
# 消息体大小
message.max.byte=1
# 副本数,Replica过少会影响数据的可用性,太多则会白白浪费存储资源,一般建议在2~3为宜。
default.replication.factor=3
# Producer用于压缩数据的压缩类型
compression.type=gzip
# replicas每次获取数据的最大字节数
replica.fetch.max.bytes=5242880

# 在Replica上会启动若干Fetch线程把对应的数据同步到本地,而num.replica.fetchers这个参数是用来控制Fetch线程的数量。
# 每个Partition启动的多个Fetcher,通过共享offset既保证了同一时间内Consumer和Partition之间的一对一关系,又允许我们通过增多Fetch线程来提高效率。
num.replica.fetchers=4

# 段文件配置1GB,有利于快速回收磁盘空间,重启kafka加载也会加快(如果文件过小,则文件数量比较多,kafka启动时是单线程扫描目录(log.dir)下所有数据文件),此处为1G
log.segment.bytes=1073741824

log.retention.check.interval.ms=300000

############################# Zookeeper #############################
zookeeper.connect=192.168.1.217:5181,192.168.1.218:5181,192.168.1.219:5181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
rebalance.backoff.ms=2000
rebalance.max.retries=10

错误日志

开发环境下出现了该问题时相关日志信息

217上的kafka日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[2020-04-22 10:43:36,369] WARN [ReplicaFetcherThread-1-219], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5641fe72 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 219 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

218上的kafka日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[2020-04-22 10:59:42,687] WARN [ReplicaFetcherThread-1-219], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1722c608 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 219 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

219上的kafka日志

1
2
3
4
5
6
7
8
9
10
11
12
[root@RD-MQ-03 logs]# tail -f server.log
[2020-04-22 10:12:50,317] INFO Rolled new log segment for '__consumer_offsets-3' in 26 ms. (kafka.log.Log)
[2020-04-22 10:14:00,179] INFO Deleting segment 485381953 from log __consumer_offsets-3. (kafka.log.Log)
[2020-04-22 10:14:00,179] INFO Deleting segment 0 from log __consumer_offsets-3. (kafka.log.Log)
[2020-04-22 10:14:00,181] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000485381953.index.deleted (kafka.log.OffsetIndex)
[2020-04-22 10:14:00,181] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000000000000.index.deleted (kafka.log.OffsetIndex)
[2020-04-22 10:14:00,181] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000485381953.timeindex.deleted (kafka.log.TimeIndex)
[2020-04-22 10:14:00,181] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000000000000.timeindex.deleted (kafka.log.TimeIndex)
[2020-04-22 10:14:01,478] INFO Deleting segment 486188481 from log __consumer_offsets-3. (kafka.log.Log)
[2020-04-22 10:14:01,517] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000486188481.index.deleted (kafka.log.OffsetIndex)
[2020-04-22 10:14:01,517] INFO Deleting index /opt/data/kafka/__consumer_offsets-3/00000000000486188481.timeindex.deleted (kafka.log.TimeIndex)

ZK 日志

zk日志正常,zk数据文件正常(大小正常,有定期清理)

问题排查记录

检查网络

实际上这三个机器都是互通的

ping

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 217

[root@RD-MQ-01 ~]# ping 192.168.1.219
PING 192.168.1.219 (192.168.1.219) 56(84) bytes of data.
64 bytes from 192.168.1.219: icmp_seq=1 ttl=64 time=1.23 ms
64 bytes from 192.168.1.219: icmp_seq=2 ttl=64 time=1.79 ms

# 218

[root@RD-MQ-02 ~]# ping 192.168.1.219
PING 192.168.1.219 (192.168.1.219) 56(84) bytes of data.
64 bytes from 192.168.1.219: icmp_seq=1 ttl=64 time=1.76 ms
64 bytes from 192.168.1.219: icmp_seq=2 ttl=64 time=1.28 ms

# 219

[root@RD-MQ-03 ~]# ping 192.168.1.218
PING 192.168.1.218 (192.168.1.218) 56(84) bytes of data.
64 bytes from 192.168.1.218: icmp_seq=1 ttl=64 time=0.987 ms
64 bytes from 192.168.1.218: icmp_seq=2 ttl=64 time=1.18 ms

telnet

telnet 也是通的

1
2
3
4
5
6
7
8
9
[root@RD-MQ-02 ~]# telnet 192.168.1.219 9092
Trying 192.168.1.219...
Connected to 192.168.1.219.
Escape character is '^]'.

aa
^]
telnet> q
Connection closed.

netstat

1
2
3
4
5
6
7
8
9
[root@RD-MQ-02 ~]# netstat -an | grep 219:9092
tcp 0 0 ::ffff:192.168.1.218:36841 ::ffff:192.168.1.219:9092 FIN_WAIT2
tcp 0 0 ::ffff:192.168.1.218:36840 ::ffff:192.168.1.219:9092 FIN_WAIT2
tcp 0 0 ::ffff:192.168.1.218:36842 ::ffff:192.168.1.219:9092 ESTABLISHED
tcp 0 0 ::ffff:192.168.1.218:36839 ::ffff:192.168.1.219:9092 FIN_WAIT2
tcp 0 0 ::ffff:192.168.1.218:36837 ::ffff:192.168.1.219:9092 FIN_WAIT2
tcp 0 0 ::ffff:192.168.1.218:36844 ::ffff:192.168.1.219:9092 ESTABLISHED
tcp 0 0 ::ffff:192.168.1.218:36838 ::ffff:192.168.1.219:9092 FIN_WAIT2
tcp 0 0 ::ffff:192.168.1.218:36843 ::ffff:192.168.1.219:9092 ESTABLISHED

连接一直在建立,然后关闭

检查Zookeeper

ZK 能正常操作读写数据

在ZK中能看到3个kafka的brokers

1
2
[zk: 192.168.1.218:5181(CONNECTED) 4] ls /brokers/ids
[217, 218, 219]

检查kafka manager

只能查询到 Brokers 信息

无法查看topic 和 Consumers

使用kafka命令查看

topic list

分别在三个机器上查看topic信息,都能列出

1
./kafka-topics.sh --list --zookeeper 192.168.1.217:5181

topic describe

挑选一个测试队列【st.topic.yery.test】 查看详情:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@RD-MQ-01 bin]#  ./kafka-topics.sh --zookeeper 192.168.1.217:5181 --topic st.topic.yery.test --describe
Topic:st.topic.yery.test PartitionCount:32 ReplicationFactor:3 Configs:
Topic: st.topic.yery.test Partition: 0 Leader: 218 Replicas: 218,217,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 1 Leader: 219 Replicas: 219,218,217 Isr: 219
Topic: st.topic.yery.test Partition: 2 Leader: 217 Replicas: 217,219,218 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 3 Leader: 218 Replicas: 218,219,217 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 4 Leader: 219 Replicas: 219,217,218 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 5 Leader: 217 Replicas: 217,218,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 6 Leader: 218 Replicas: 218,217,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 7 Leader: 219 Replicas: 219,218,217 Isr: 219
Topic: st.topic.yery.test Partition: 8 Leader: 217 Replicas: 217,219,218 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 9 Leader: 218 Replicas: 218,219,217 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 10 Leader: 219 Replicas: 219,217,218 Isr: 219
Topic: st.topic.yery.test Partition: 11 Leader: 217 Replicas: 217,218,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 12 Leader: 218 Replicas: 218,217,219 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 13 Leader: 219 Replicas: 219,218,217 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 14 Leader: 217 Replicas: 217,219,218 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 15 Leader: 218 Replicas: 218,219,217 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 16 Leader: 219 Replicas: 219,217,218 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 17 Leader: 217 Replicas: 217,218,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 18 Leader: 218 Replicas: 218,217,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 19 Leader: 219 Replicas: 219,218,217 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 20 Leader: 217 Replicas: 217,219,218 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 21 Leader: 218 Replicas: 218,219,217 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 22 Leader: 219 Replicas: 219,217,218 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 23 Leader: 217 Replicas: 217,218,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 24 Leader: 218 Replicas: 218,217,219 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 25 Leader: 219 Replicas: 219,218,217 Isr: 219
Topic: st.topic.yery.test Partition: 26 Leader: 217 Replicas: 217,219,218 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 27 Leader: 218 Replicas: 218,219,217 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 28 Leader: 219 Replicas: 219,217,218 Isr: 219,217,218
Topic: st.topic.yery.test Partition: 29 Leader: 217 Replicas: 217,218,219 Isr: 217,218,219
Topic: st.topic.yery.test Partition: 30 Leader: 218 Replicas: 218,217,219 Isr: 218,217,219
Topic: st.topic.yery.test Partition: 31 Leader: 219 Replicas: 219,218,217 Isr: 219,217,218

consumer-groups list

查看消费者

1
2
3
4
5
6
7
8
9
10
[root@RD-MQ-02 bin]# ./kafka-consumer-groups.sh --bootstrap-server 192.168.1.217:9092 --list
by.consumer.online.storage
dev.consumer.binlog.analysis
by.consumer.alarm.platform
dp.command.consumer
qa.consumer.binlog.install
qa.consumer.binlog.zone
dp.device.consumer
by.consumer.alarm.zone.normal
qa.consumer.binlog.business

consumer-groups describe

僵死,无法列出消费者的相关信息

1
2
3
[root@RD-MQ-02 bin]# ./kafka-consumer-groups.sh --bootstrap-server 192.168.1.217:9092 --group dp.device.consumer --describe
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER

换一个消费者组
能看到两个分区的消费信息,实际上应该是32个分区

1
2
3
4
[root@RD-MQ-01 bin]# ./kafka-consumer-groups.sh --bootstrap-server 192.168.1.217:9092,192.168.1.218:9092,192.168.1.219:9092 --group by.consumer.online.storage --describe
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
by.consumer.online.storage by.topic.dp.online.device.message.info 0 515 515 0 consumer-1_/192.168.1.216
by.consumer.online.storage by.topic.dp.online.device.message.info 1 11 11 0 consumer-1_/192.168.1.216

使用Kafka命令行工具测试

用命令行生产者 和 消费者 进行测试

消费者

无法消费到数据任何,没有错误

1
./kafka-console-consumer.sh --bootstrap-server 192.168.1.217:9092,192.168.1.218:9092,192.168.1.219:9092 --topic by.topic.install.car.device --from-beginning

生产者

发送数据报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@RD-MQ-02 bin]# ./kafka-console-producer.sh --broker-list 192.168.1.217:9092,192.168.1.218:9092 --topic by.topic.install.car.device
111111
2222
3333
[2020-04-22 17:09:45,589] WARN Got error produce response with correlation id 2 on topic-partition by.topic.install.car.device-19, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:45,592] WARN Got error produce response with correlation id 2 on topic-partition by.topic.install.car.device-1, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:45,592] WARN Got error produce response with correlation id 1 on topic-partition by.topic.install.car.device-10, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
5
[2020-04-22 17:09:48,368] WARN Got error produce response with correlation id 6 on topic-partition by.topic.install.car.device-28, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:48,369] WARN Got error produce response with correlation id 5 on topic-partition by.topic.install.car.device-10, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:48,369] WARN Got error produce response with correlation id 4 on topic-partition by.topic.install.car.device-19, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:48,369] WARN Got error produce response with correlation id 4 on topic-partition by.topic.install.car.device-1, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:49,978] WARN Got error produce response with correlation id 8 on topic-partition by.topic.install.car.device-1, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:49,978] WARN Got error produce response with correlation id 8 on topic-partition by.topic.install.car.device-10, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:49,979] WARN Got error produce response with correlation id 8 on topic-partition by.topic.install.car.device-19, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:49,979] WARN Got error produce response with correlation id 8 on topic-partition by.topic.install.car.device-28, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:51,587] ERROR Error when sending message to topic by.topic.install.car.device with key: null, value: 4 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
[2020-04-22 17:09:51,593] ERROR Error when sending message to topic by.topic.install.car.device with key: null, value: 6 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
[2020-04-22 17:09:51,593] ERROR Error when sending message to topic by.topic.install.car.device with key: null, value: 4 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
[2020-04-22 17:09:51,594] WARN Got error produce response with correlation id 10 on topic-partition by.topic.install.car.device-28, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2020-04-22 17:09:53,192] ERROR Error when sending message to topic by.topic.install.car.device with key: null, value: 1 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

查进程信息

进程占用的文件句柄数

217
1
2
[root@RD-MQ-01 opt]# lsof -p 13517 | wc -l
12824
218
1
2
[root@RD-MQ-02 ~]# lsof -p 13557 | wc -l
12819
219
1
2
[root@RD-MQ-03 ~]# lsof -p 81068 | wc -l
326639

可以看到219占用了非常多的文件句柄数

网络端口9092查看

217

存在很多 time_wait 状态的连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tcp        0      0 ::ffff:192.168.1.217:9092   ::ffff:192.168.1.215:47810  ESTABLISHED 13517/java          
tcp 0 0 ::ffff:192.168.1.217:54504 ::ffff:192.168.1.218:9092 TIME_WAIT -
tcp 0 53321 ::ffff:192.168.1.217:33408 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 0 ::ffff:192.168.1.217:9092 ::ffff:192.168.1.218:55812 ESTABLISHED 13517/java
tcp 0 64905 ::ffff:192.168.1.217:39404 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 54769 ::ffff:192.168.1.217:33158 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 0 ::ffff:192.168.1.217:9092 ::ffff:192.168.1.216:35251 ESTABLISHED 13517/java
tcp 0 0 ::ffff:192.168.1.217:39720 ::ffff:192.168.1.217:9092 TIME_WAIT -
tcp 0 0 ::ffff:192.168.1.217:54466 ::ffff:192.168.1.218:9092 TIME_WAIT -
tcp 0 48081 ::ffff:192.168.1.217:33639 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 46505 ::ffff:192.168.1.217:39272 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 0 ::ffff:192.168.1.217:39646 ::ffff:192.168.1.217:9092 TIME_WAIT -
tcp 0 47953 ::ffff:192.168.1.217:33365 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 0 ::ffff:192.168.1.217:54494 ::ffff:192.168.1.218:9092 TIME_WAIT -
tcp 0 53449 ::ffff:192.168.1.217:33037 ::ffff:192.168.1.219:9092 FIN_WAIT1 -
tcp 0 0 ::ffff:192.168.1.217:40217 ::ffff:192.168.1.219:9092 FIN_WAIT2 -
1
2
[root@RD-MQ-01 ~]# netstat -antp | grep 9092 | wc -l
324
218

基本正常

1
2
3
4
5
6
7
8
9
10
11
12
13
tcp        0      0 ::ffff:192.168.1.218:9092   ::ffff:192.168.1.216:48369  ESTABLISHED 13557/java          
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.214:42207 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.216:47984 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.216:44771 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.215:42045 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.215:51056 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.214:37368 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.217:59185 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.216:58880 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.216:44758 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.219:55019 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.215:35721 ESTABLISHED 13557/java
tcp 0 0 ::ffff:192.168.1.218:9092 ::ffff:192.168.1.219:55014 ESTABLISHED 13557/java
1
2
[root@RD-MQ-02 ~]# netstat -antp | grep 9092 | wc -l
77
219

存在大量 CLOSE_WAIT 状态的连接

1
2
3
4
5
6
7
8
9
10
11
tcp      247      0 ::ffff:192.168.1.219:9092   ::ffff:192.168.1.217:58837  CLOSE_WAIT  81068/java          
tcp 178232 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:34750 ESTABLISHED 81068/java
tcp 55 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.213:55954 CLOSE_WAIT 81068/java
tcp 238 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:40862 CLOSE_WAIT 81068/java
tcp 226 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:58967 CLOSE_WAIT 81068/java
tcp 14535 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.218:49425 CLOSE_WAIT 81068/java
tcp 14986 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.218:49330 CLOSE_WAIT 81068/java
tcp 221 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:37406 CLOSE_WAIT 81068/java
tcp 223 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:36455 CLOSE_WAIT 81068/java
tcp 14620 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:59460 CLOSE_WAIT 81068/java
tcp 247 0 ::ffff:192.168.1.219:9092 ::ffff:192.168.1.217:59087 CLOSE_WAIT 81068/java

数量远多于其他两台

1
2
[root@RD-MQ-03 opt]# netstat -antp | grep 9092 | wc -l
5589

查线程堆栈内存信息

使用内存

基本正常

217
1
2
ps aux|grep java|grep "kafka"|awk '{print $6}'     
2625240
218
1
2
ps aux|grep java|grep "kafka"|awk '{print $6}'  
2071136
219
1
2
ps aux|grep java|grep "kafka"|awk '{print $6}'  
1901924

jstack

在219上找到死锁

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Found one Java-level deadlock:
=============================
"executor-Heartbeat":
waiting to lock monitor 0x00007f972c03db78 (object 0x000000070dbce4c8, a kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
"group-metadata-manager-0":
waiting to lock monitor 0x00007f979420e438 (object 0x0000000715a00000, a java.util.LinkedList),
which is held by "kafka-request-handler-3"
"kafka-request-handler-3":
waiting to lock monitor 0x00007f972c03db78 (object 0x000000070dbce4c8, a kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"

Java stack information for the threads listed above:
===================================================
"executor-Heartbeat":
at kafka.coordinator.GroupCoordinator.onExpireHeartbeat(GroupCoordinator.scala:739)
- waiting to lock <0x000000070dbce4c8> (a kafka.coordinator.GroupMetadata)
at kafka.coordinator.DelayedHeartbeat.onExpiration(DelayedHeartbeat.scala:33)
at kafka.server.DelayedOperation.run(DelayedOperation.scala:107)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"group-metadata-manager-0":
at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:308)
- waiting to lock <0x0000000715a00000> (a java.util.LinkedList)
at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:234)
at kafka.server.ReplicaManager.tryCompleteDelayedProduce(ReplicaManager.scala:199)
at kafka.cluster.Partition.tryCompleteDelayedRequests(Partition.scala:374)
at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:457)
at kafka.coordinator.GroupMetadataManager$$anonfun$cleanupGroupMetadata$1$$anonfun$apply$10.apply(GroupMetadataManager.scala:600)
at kafka.coordinator.GroupMetadataManager$$anonfun$cleanupGroupMetadata$1$$anonfun$apply$10.apply(GroupMetadataManager.scala:593)
at scala.Option.foreach(Option.scala:257)
at kafka.coordinator.GroupMetadataManager$$anonfun$cleanupGroupMetadata$1.apply(GroupMetadataManager.scala:593)
- locked <0x000000070dbce4c8> (a kafka.coordinator.GroupMetadata)
at kafka.coordinator.GroupMetadataManager$$anonfun$cleanupGroupMetadata$1.apply(GroupMetadataManager.scala:579)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at kafka.utils.Pool$$anon$1.foreach(Pool.scala:89)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at kafka.utils.Pool.foreach(Pool.scala:26)
at kafka.coordinator.GroupMetadataManager.cleanupGroupMetadata(GroupMetadataManager.scala:579)
at kafka.coordinator.GroupMetadataManager$$anonfun$1.apply$mcV$sp(GroupMetadataManager.scala:101)
at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"kafka-request-handler-3":
at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$putCacheCallback$2(GroupMetadataManager.scala:301)
- waiting to lock <0x000000070dbce4c8> (a kafka.coordinator.GroupMetadata)
at kafka.coordinator.GroupMetadataManager$$anonfun$prepareStoreOffsets$1.apply(GroupMetadataManager.scala:357)
at kafka.coordinator.GroupMetadataManager$$anonfun$prepareStoreOffsets$1.apply(GroupMetadataManager.scala:357)
at kafka.server.DelayedProduce.onComplete(DelayedProduce.scala:123)
at kafka.server.DelayedOperation.forceComplete(DelayedOperation.scala:70)
at kafka.server.DelayedProduce.tryComplete(DelayedProduce.scala:105)
at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:315)
- locked <0x0000000715a4d358> (a kafka.server.DelayedProduce)
- locked <0x0000000715a00000> (a java.util.LinkedList)
at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:234)
at kafka.server.ReplicaManager.tryCompleteDelayedProduce(ReplicaManager.scala:199)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:909)
at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:902)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:902)
at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:475)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:523)
at kafka.server.KafkaApis.handle(KafkaApis.scala:79)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.

jmap

dump 219 上的堆内存信息

1
jmap -dump:format=b,file=kafka-219-jmap.dump 81068

先保留现场,必要时再进行分析,文件大小:665M

相关资料

匹配度高的

KAFKA-3994

https://issues.apache.org/jira/browse/KAFKA-3994
executor-Heartbeat线程 与 kafka-request-handler线程 互锁
死锁信息与我们发现的有一点差异,没有group-metadata-manager线程

版本比我们用低一个版本

影响版本:0.10.0.0
修复版本:0.10.1.1, 0.10.2.0

KAFKA-4399

https://issues.apache.org/jira/browse/KAFKA-4399
死锁信息与我们发现的有一个差异,没有executor-Heartbeat线程

影响版本:0.10.1.0
修复版本:0.10.1.1,0.10.2.0

KAFKA-4478

https://issues.apache.org/jira/browse/KAFKA-4478
心跳执行器、组元数据管理器、请求处理器 相互锁死了
这个的死锁堆栈信息与我们碰到的一致

影响版本:0.10.1.0
修复版本:无 (在bug 3994中被解决了)

KAFKA-4562

https://issues.apache.org/jira/browse/KAFKA-4562

死锁情况与我们的一样

影响版本:0.10.1.0
修复版本:无 (在bug 3994中被解决了)


相关性高的

邮件列表一

https://users.kafka.apache.narkive.com/7ekMKZSX/deadlock-using-latest-0-10-1-kafka-release

3个线程的死锁堆栈信息,和我们的一样,文件句柄耗尽

作者已经很肯定 KAFKA-3994中的补丁可以很好地解决该问题。

关联:KAFKA-3994

邮件列表二

https://users.kafka.apache.narkive.com/mP9QlK2j/connectivity-problem-with-controller-breaks-cluster

从描述的问题来看,我们的一致

关联:KAFKA-4477

KAFKA-4477

https://issues.apache.org/jira/browse/KAFKA-4477

我们的故障节点 ISR 也减少了,也有文件句柄数和死锁相关的描述

影响版本:0.10.1.0
修复版本:0.10.1.1

结论

Kafka程序内部死锁导致了集群的故障问题
根据官方文档这个问题在issues [KAFKA-4399] 中被解决了
可以通过将现有的kafka升级至 kafka 0.10.1.1 来解决该问题

kafka 0.10.1.1 发版说明
https://archive.apache.org/dist/kafka/0.10.1.1/RELEASE_NOTES.html.

官方发版说明中已经记录解决了bug [KAFKA-4399]

  • [KAFKA-3994] - Deadlock between consumer heartbeat expiration and offset commit.
  • [KAFKA-4399] - Deadlock between cleanupGroupMetadata and offset commit

升级

kafka服务端

服务器端升级一个小版本到 0.10.1.1
https://archive.apache.org/dist/kafka/0.10.1.1/RELEASE_NOTES.html.
这个版本主要是修复bug,没有什么新的改变,这样我们的数据应该是完全兼容的,只需要更换服务端的包,复制配置文件,之后逐个重启kafka服务即可。

官方升级说明文档

http://kafka.apache.org/0101/documentation.html#upgrade

程序的客户端

全部的 生产者消费者 都升级一个小版本到 0.10.1.1

1
2
3
4
5
6
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.1.1</version>
</dependency>

这样相关的代码应该都无需修改,Api不会有什么改动,只需要升级依赖包即可

修改父项目的pom文件依赖,子项目梳理一下,重新打包部署即可

其他问题

可能存在一些项目使用的是更老版本的API 和 连接方式等,如不是直连kafka而是老的连接zk的,这部分的项目需要修改代码

升级步骤记录

下载文件

1
wget https://archive.apache.org/dist/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz

这里选择的scala 版本是 2.11,之前装的也是2.11的

解压到新目录
使用原来的配置位置
注意:数据文件(log)的位置

准备生产者和消费者

先准备一个生产者,产生数据

1
./kafka-console-producer.sh --broker-list 192.168.1.217:9092,192.168.1.218:9092,192.168.1.219:9092 --topic st.topic.yery.test

在准备一个消费者,消费数据

1
./kafka-console-consumer.sh --bootstrap-server 192.168.1.217:9092,192.168.1.218:9092,192.168.1.219:9092 --topic st.topic.yery.test --from-beginning

升级过程保持不变

停止节点219

先关闭节点219,观察集群状态:
217 218 报错了,但是集群整体运行正常

测试验证

生产者能正常产生数据

消费者能正常消费数据

消费者会打印一个警告

进入kafka manager 可以查看到只有两个 Broker(217、218)
查看topic st.topic.yery.test 的信息
所有Partition 的leader都分到217 和 218 上
In Sync Replicas 只有(217,218)

修改配置文件

复制原配置文件到kafka_2.11-0.10.1.1

1
2
3
4
5
6
7
8
9
10
cd kafka_2.11-0.10.1.0

# 复制启动脚本(自己写的)
cp start.sh ../kafka_2.11-0.10.1.1/

# 复制配置文件
cd config

cp server.properties ../../kafka_2.11-0.10.1.1/config/
cp log4j.properties ../../kafka_2.11-0.10.1.1/config/

在219启动新版本的kafka

直接启动219 (0.10.1.1版本的)

查看219日志,可以看到在恢复数据

数据恢复完成之后 会加入集群

加入集群后,217 和 218 会打印日志,将分区ISR从[217,218] 扩大到 [217,218,219]

在kafka manager 上可以看到Broker 219 加入了集群
查看topic [st.topic.yery.test] 的信息
Partition 的leader分到217,218 和 219 三台上了
In Sync Replicas 有(217,218,219)

再次验证测试

此时命令行 生产数据 消费数据都是正常的

应用程序测试

直接验证binlog程序
程序在不升级kafka客户端包版本的情况下测试

在数据库中修改 by_device 表的记录x,将记录x的remark修改为‘kafka update’
可以在缓存管理中看到记录x的remark改成了‘kafka update’

对应的队列 qa.topic.binlog.base-by_device
Sum of partition offsets 也加了1,表明消息发送消费正常

程序在不升级kafka-clients 包的情况下,也没有问题

再升级217 和 218

用同样的步骤更换 217 和 218 上的kafka程序,配置文件就用原来的对应的配置文件

逐个关闭kafka程序再启动,等待数据库恢复,集群恢复到正常状态

验证老数据

升级完成之后再从头开始消费队列[st.topic.yery.test]的数据

1
./kafka-console-consumer.sh --bootstrap-server 192.168.1.217:9092,192.168.1.218:9092,192.168.1.219:9092 --topic st.topic.yery.test --from-beginning

数据正常,升级完成


补充

升级说明

从0.8.x,0.9.x,0.10.0.x,0.10.1.x或0.10.2.x升级到0.11.0.0

http://kafka.apache.org/0102/documentation.html#upgrade

Kafka 0.11.0.0引入了新的消息格式版本以及有线协议更改。通过遵循以下建议的滚动升级计划,可以保证升级期间不会停机。但是,请在升级之前查看0.11.0.0中的显着更改。

从版本0.10.2开始,Java客户端(生产者和消费者)已经具有与较早的代理进行通信的能力。版本0.11.0的客户可以与版本0.10.0或更高版本的代理通信。但是,如果您的代理早于0.10.0,则必须先升级Kafka群集中的所有代理,然后再升级客户端。版本0.11.0代理支持0.8.x和更高版本的客户端。


###################################


升级之后的问题

这是一个坑,升级之后还是有死锁问题,一个新的死锁

由于 DelayedProduce 和 GroupMetadata 导致的死锁

1
2
3
4
5
6
7
8
9
10
Found one Java-level deadlock:
=============================
"executor-Heartbeat":
waiting to lock monitor 0x00007f113c00b678 (object 0x00000000c8ab9028, a kafka.server.DelayedProduce),
which is held by "kafka-scheduler-5"
"kafka-scheduler-5":
waiting to lock monitor 0x00007f1918013848 (object 0x00000000ea18c4d8, a kafka.coordinator.GroupMetadata),
which is held by "executor-Heartbeat"

Java stack information for the threads listed above:

bug:
https://issues.apache.org/jira/browse/KAFKA-5970

修复版本:
0.11.0.2, 1.0.0

在 0.10.X 版本中没有看到有修复该问题,且在0.11.x版本中又看到有内存泄漏问题,~~~

好在这个问题出现的概率非常低,只在测试环境出现过1次

IP地址

IPv4网络使用32位地址,即32个bit位(表示0和1),每8位分成1个部分,共4个部分,再将二进制的数转成10进制,以点分十进制表示
如:192.168.1.1 = 11000000.10101000.00000001.00000001

从数字的角度来说,一个8位的2进制数字最大为255,故IP地址最大为:255.255.255.255
IPv4地址的总是为 2的32次方 = 4294967296 ,只有40多亿,其中还有部分要挪作他用,所以后来又出了IPv6(
IPv6 地址为 128 位长度, 是 IPv4 地址的 4 倍)

IP地址

地址格式为:IP地址=网络地址+主机地址 或 IP地址=网络地址+子网地址

网络地址 + 主机地址 = 32位 构成一个完整的IP地址

最初设计互联网络时,为了便于寻址以及层次化构造网络,每个IP地址包括两个标识码(ID),即网络ID和主机ID。同一个物理网络上的所有主机都使用同一个网络ID,网络上的一个主机(包括网络上工作站,服务器和路由器等)有一个主机ID与其对应。IP地址根据网络ID的不同分为5种类型,A类地址、B类地址、C类地址、D类地址和E类地址。

子网掩码

简单的说就说分隔 网络地址 和 子网地址用的
如果子网的bit位为1则表示这一位被掩盖住了,是网络地址
书写形式有:255.255.255.0 或者 24 这两个是一个意思,表示这个ip地址的前面24位被淹盖住了,比如地址:192.168.1.1/24 ,其网络地址是:192.168.1,主机地址只能是 1~254

分类 前缀 范围 私有地址 默认掩码 说明
A类地址 0 0.0.0.0 ~ 126.255.255.255 10.0.0.0 ~ 10.255.255.255 255.0.0.0
B类地址 10 128.0.0.0 ~ 191.255.255.255 172.16.0.0 ~ 172.31.255.255 255.255.0.0
C类地址 110 192.0.0.0 ~ 223.255.255.255 192.168.0.0 ~ 192.168.255.255 255.255.255.0
D类地址 1110 224.0.0.0 ~ 239.255.255.255 239.0.0.0 ~ 239.255.255.255 (本地管理组) 组播地址
E类地址 11110 240.0.0.0 ~ 247.255.255.255 保留

1.A类IP地址

一个A类IP地址由1字节的网络地址和3字节主机地址组成,网络地址的最高位必须是“0”, 地址范围从1.0.0.0 到126.0.0.0。可用的A类网络有126个,每个网络能容纳1亿多个主机。

【十进制】 126 = 【二进制】01111110
127 是保留地址,127.0.0.1 给内部回送函数,而数字0则表示该地址是本地宿主机,不能传送

2.B类IP地址

一个B类IP地址由2个字节的网络地址和2个字节的主机地址组成,网络地址的最高位必须是“10”,地址范围从128.0.0.0到191.255.255.255。可用的B类网络有16382个,每个网络能容纳6万多个主机 。

【十进制】 128 = 【二进制】10000000
【十进制】 191 = 【二进制】10111111

3.C类IP地址

一个C类IP地址由3字节的网络地址和1字节的主机地址组成,网络地址的最高位必须是“110”。范围从192.0.0.0到223.255.255.255。C类网络可达209万余个,每个网络能容纳254个主机。

【十进制】 192 = 【二进制】11000000
【十进制】 223 = 【二进制】11011111

4.D类地址用于多点广播

D类IP地址第一个字节以“1110”开始,它是一个专门保留的地址。它并不指向特定的网络,目前这一类地址被用在多点广播(Multicast)中。多点广播地址用来一次寻址一组计算机,它标识共享同一协议的一组计算机。

5.E类IP地址

以“11110”开始,为将来使用保留。

全零(“0.0.0.0”)地址对应于当前主机。全“1”的IP地址(“255.255.255.255”)是当前子网的广播地址。

广播地址

广播地址(Broadcast Address)是专门用于同时向网络中所有工作站进行发送的一个地址。在使用TCP/IP 协议的网络中,主机标识段host ID 为全1 的IP 地址为广播地址,广播的分组传送给host ID段所涉及的所有计算机。例如,对于10.1.1.0 (255.255.255.0 )网段,其广播地址为10.1.1.255 (255 即为2 进制的11111111 ),当发出一个目的地址为10.1.1.255 的分组(封包)时,它将被分发给该网段上的所有计算机。

类型

广播地址应用于网络内的所有主机

1)受限广播
它不被路由发送,但会被送到相同物理网络段上的所有主机
IP地址的网络字段和主机字段全为1就是地址255.255.255.255

2)直接广播
网络广播会被路由,并会发送到专门网络上的每台主机
IP地址的网络字段定义这个网络,主机字段通常全为1,如 192.168.10.255

广播和地址

  1. TCP/IP协议栈中, 传输层只有UDP可以广播.
  2. 只能对同一子网内部广播, 广播数据包不经过路由器.
  3. UDP的广播地址为255.255.255.255
  4. 在winsock实现中, 有一个选项对应是否允许广播.必须调用setsockopt打开该选项.
  5. 打开后, 用sendto向255.255.255.255发送的数据包全部广播.

很多局域网都定义了一个特殊的保留地址, 称为广播地址. 当信息头中目的地址域的内容为广播地址时, 该帧被局域网上所有计算机接收. 这个过程称为广播.

一:maven-jar-plugin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<build>
<plugins>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.xxx.xxx.Main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>

maven-jar-plugin用于生成META-INF/MANIFEST.MF文件的部分内容

com.xxg.Main

指定MANIFEST.MF中的Main-Class

排除目录

1
2
3
4
5
6
7
8
9
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<!-- jar包中排除groovy脚本目录 -->
<excludes>
<exclude>groovy/</exclude>
</excludes>
</configuration>
</plugin>

排除文件,可以用通配符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<!-- jar包中排除这些文件 -->
<excludes>
<exclude>log4j.properties</exclude>
<exclude>conf.properties</exclude>
<exclude>system.properties</exclude>
<exclude>logback.xml</exclude>
<exclude>log4j2.xml</exclude>
</excludes>
</configuration>
</plugin>

二:maven-assembly-plugin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.xxx.xxx.Main</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

这里没有指定

1
<descriptor>src/main/assembly/assembly.xml</descriptor>  

而是用的

1
2
3
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>

这个jar-with-dependencies是assembly预先写好的一个,组装描述引用(assembly descriptor)

默认的compile scope范围是会打进jar包的,而且依赖也全部会打进去,所有有些包打完之后有可能会很大,可将scope范围改成provided,就不会打进去了。

指定打包输出的文件名

1
2
3
4
<configuration>
<finalName>CSV转Excel工具</finalName>
<appendAssemblyId>false</appendAssemblyId>
......


常规做法

一般项目都比较大,不会打成一个jar包,一般都是打成一个tar包,里面包含启动脚本,配置文件等

使用assembly插件

一般目录结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

└─assembly
│ assembly.xml

├─bin
│ dump.sh
│ restart.sh
│ server.sh
│ start.sh
│ stop.sh

└─conf
log4j.properties
system.properties

pom中配置assembly 插件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptor>src/main/assembly/assembly.xml</descriptor>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

assembly.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<assembly>
<id>assembly</id>
<formats>
<format>tar.gz</format>
</formats>
<includeBaseDirectory>true</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/assembly/bin</directory>
<outputDirectory>bin</outputDirectory>
<fileMode>0755</fileMode>
</fileSet>
<fileSet>
<directory>src/main/assembly/conf</directory>
<outputDirectory>conf</outputDirectory>
<fileMode>0644</fileMode>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<outputDirectory>lib</outputDirectory>
</dependencySet>
</dependencySets>
</assembly>

问题

插入时间到数据库中,会少13个小时(有时候是11个小时)

读取出来,则会自动加上少了的时间,读到的Date对象是正常的

环境

Linux

1
2
# uname -a
Linux had1 2.6.32-754.3.5.el6.x86_64 #1 SMP Tue Aug 14 20:46:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
1
2
# cat /etc/centos-release 
CentOS release 6.10 (Final)

Mysql 版本

5.7.20

1
show variables like "%time_zone%";
Variable name Value
system_time_zone CST
time_zone SYSTEM

原因

CTS 时区问题

mysql的驱动版本换成了:
mysql-connector-java 6.0.2

应该是6.X的版本更新了时区的判断方式,导致的

时间类型

GMT

格林威治标准时间

UTC

世界协调时间

DST

夏日节约时间

CST时间

CST却同时可以代表如下 4 个不同的时区:

Central Standard Time (USA) UT-6:00

Central Standard Time (Australia) UT+9:30

China Standard Time UT+8:00

Cuba Standard Time UT-4:00

可见,CST可以同时表示美国,澳大利亚,中国,古巴四个国家的标准时间。

解决版本

一 更换驱动版本

换成之前一直用的的 5.1.45

二 连接串指定时区

serverTimezone

1
jdbc.url=jdbc:mysql://192.168.1.211:3306/analysis?useUnicode=true&characterEncoding=UTF-8&useSSL=false&serverTimezone=Asia/Shanghai

三 mysql中指定

1
2
3
4
5
6
7
# vim /etc/my.cnf 

##在[mysqld]区域中加上
default-time_zone = ‘+8:00‘

##重启mysql使新时区生效
# /etc/init.d/mysqld restart

纬度不变 精度相差1的情况

测试结果如下:

小数点后6位,精度:0.1米

0.10139558886691133

小数点后5位,精度:1米

1.0139558860891593

小数点后4位,精度:10米

10.139558845411655

小数点后3位,精度:101米

101.39558846680167

小数点后3位(经纬度同时改变1),精度:150米

150.4837860345681

小数点后2位,精度:1014米

1013.9558844430293

测试代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
public class DistanceCalcUtil {

// 地球半径
private static final double EARTH_RADIUS = 6370996.81;

// 弧度
private static double radian(double d) {
return d * Math.PI / 180.0;
}

/**
* 计算两点之间的距离
*
* @param lat1
* @param lng1
* @param lat2
* @param lng2
* @return
*/
public static double distanceOfTwoPoints(double lat1, double lng1, double lat2, double lng2) {
double radLat1 = radian(lat1);
double radLat2 = radian(lat2);
double a = radLat1 - radLat2;
double b = radian(lng1) - radian(lng2);
double s = 2 * Math.asin(Math
.sqrt(Math.pow(Math.sin(a / 2), 2) + Math.cos(radLat1) * Math.cos(radLat2) * Math.pow(Math.sin(b / 2), 2)));
s = s * EARTH_RADIUS;
return s;
// s = Math.round(s * 10000) / 10000;
// Integer distance = Integer.parseInt(new
// java.text.DecimalFormat("0").format(s));
// return distance;
}

public static void main(String[] args) {
System.out.println("经度改变1位,测试:");

System.out.println("小数点后6位,精度:0.1米");
System.out.println(distanceOfTwoPoints(24.234532, 124.123223, 24.234532, 124.123224));

System.out.println("小数点后5位,精度:1米");
System.out.println(distanceOfTwoPoints(24.234532, 124.12322, 24.234532, 124.12323));

System.out.println("小数点后4位,精度:10米");
System.out.println(distanceOfTwoPoints(24.234532, 124.1232, 24.234532, 124.1233));

System.out.println("小数点后3位,精度:101米");
System.out.println(distanceOfTwoPoints(24.234532, 124.123, 24.234532, 124.124));

System.out.println("小数点后3位(经纬度同时改变1位),精度:150米");
System.out.println(distanceOfTwoPoints(24.234, 124.123, 24.235, 124.124));

System.out.println("小数点后2位,精度:1014米");
System.out.println(distanceOfTwoPoints(24.234532, 124.12, 24.234532, 124.13));
}
}

保存全国的经纬度点3位小数

在地图上测距,画一个大概覆盖全国的矩形

宽:4000000 米
高:2500000 米

换算成经纬度点,大概10亿个

初步估计保存全国的经纬度(小数点保留3位,精度约100米)大概需要保存10亿条记录

以SpringBoot 应用为例子

startup.sh

1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/sh
#
# linux 下的启动脚本
#

#进入脚本目录
cd `dirname $0`


## 启动程序
nohup java -Dspring.profiles.active=test -jar wwh-test-1.0-SNAPSHOT.jar > wwh-test.out 2>&1 &

shutdown.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash


# 设置jar包名称通配符
jar_pattern="wwh-test.*.jar"

# 查找正在运行的jar包的进程ID
pids=$(ps -ef | grep "$jar_pattern" | grep -v grep | awk '{print $2}')

# 如果找到正在运行的进程,就停止它
if [ -n "$pids" ]; then
echo "正在停止匹配 $jar_pattern 的进程..."
for pid in $pids; do
echo "停止进程 $pid"
kill -15 "$pid"
done

# 循环检查进程是否已经停止
for pid in $pids; do
while kill -0 "$pid" >/dev/null 2>&1; do
echo -n "."
sleep 1
done
echo "进程 $pid 已停止."
done

echo "所有进程已停止."
else
echo "未找到匹配 $jar_pattern 的进程."
fi


0%