记一次RocketMQ Client超时问题排查
org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started aft
问题的一种可能
昨天同事A遇到一个诡异问题,在代码中加入消费rocketmq的代码后,在本地运行好好的程序,部署到云上测试集群A1会报出以下异常:
time,cluster,namespace,service,pod,log
2023-12-14 09:36:00,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,2023-12-14 09:36:00.595 INFO 1 --- [ XNIO-1 task-1] c.h.d.i.rocket.PdmRocketMqConsumer : [TID:95e19b4533204903a985c90b068cddcc_64_17025177603380033] --- 构建mq5.0消费者:proxy:10.163.240.237:8080, topic:product_material_topic_uat_center, group:product_material_group_center
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,2023-12-14 09:36:39.995 ERROR 1 --- [ XNIO-1 task-1] c.h.d.i.rocket.PdmRocketMqConsumer : [TID:95e19b4533204903a985c90b068cddcc_64_17025177603380033] --- 构建mq5.0消费者异常:proxy:10.163.240.237:8080, topic:product_material_topic_uat_center, group:product_material_group_center
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,java.lang.IllegalStateException: Expected the service PushConsumerImpl-0 [FAILED] to be RUNNING, but the service has FAILED
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:381) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:305) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.client.java.impl.consumer.PushConsumerBuilderImpl.build(PushConsumerBuilderImpl.java:128) ~[rocketmq-client-java-5.0.5.jar!/:na]
at com.jiankunking.issuemanage.rocket.PdmRocketMqConsumer.initConsumer(PdmRocketMqConsumer.java:89) ~[classes!/:na]
at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq$original$l28F75DZ(IssueTagInfoController.java:103) ~[classes!/:na]
at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq$original$l28F75DZ$accessor$0D0TSRNs(IssueTagInfoController.java) ~[classes!/:na]
at com.jiankunking.issuemanage.controller.IssueTagInfoController$auxiliary$Rnw73z4x.call(Unknown Source) ~[classes!/:na]
at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86) ~[skywalking-agent.jar:8.12.0]
at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq(IssueTagInfoController.java) ~[classes!/:na]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205) ~[spring-web-5.3.24.jar!/:5.3.24]
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:150) ~[spring-web-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:117) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1071) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:964) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:645) ~[javax.servlet-api-4.0.1.jar!/:4.0.1]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:750) ~[javax.servlet-api-4.0.1.jar!/:4.0.1]
at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.24.jar!/:5.3.24]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.24.jar!/:5.3.24]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.24.jar!/:5.3.24]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:68) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.RedirectDirHandler.handleRequest(RedirectDirHandler.java:68) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:117) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.SendErrorPageHandler.handleRequest(SendErrorPageHandler.java:52) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:275) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:79) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:134) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:255) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:79) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:100) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.server.Connectors.executeRootHandler(Connectors.java:387) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:852) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
at org.apache.skywalking.apm.plugin.undertow.v2x.SWRunnable.run(SWRunnable.java:45) ~[na:na]
at org.apache.skywalking.apm.plugin.undertow.v2x.SWRunnable.run(SWRunnable.java:45) ~[na:na]
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2019) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1558) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1449) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
at org.xnio.XnioWorker$WorkerThreadFactory$1$1.run(XnioWorker.java:1282) ~[xnio-api-3.8.7.Final.jar!/:3.8.7.Final]
at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: java.util.concurrent.ExecutionException: org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:547) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.client.java.impl.ClientImpl.startUp(ClientImpl.java:188) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.client.java.impl.consumer.PushConsumerImpl.startUp(PushConsumerImpl.java:161) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.com.google.common.util.concurrent.Callables.lambda$threadRenaming$3(Callables.java:103) ~[rocketmq-client-java-5.0.5.jar!/:na]
... 1 common frames omitted
Caused by: org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now
at org.apache.rocketmq.shaded.io.grpc.Status.asRuntimeException(Status.java:539) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[rocketmq-client-java-5.0.5.jar!/:na]
at org.apache.rocketmq.shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[rocketmq-client-java-5.0.5.jar!/:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
... 1 common frames omitted
由于本地代码跟服务端代码完全一样,在加上错误提示:
Expected the service PushConsumerImpl-0 [FAILED] to be RUNNING, but the service has FAILED
org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now
同事第一怀疑的点是集群网络环境,这里我的第一反应不会是网络问题,为啥这么想呢?
因为如果是网络问题,那么集群内肯定会有大量的报错,而不仅仅是这一个服务。
那么问题出现在哪呢?难道是代码的问题?
@Slf4j
@Component
@Configuration
@EnableConfigurationProperties(RocketMqProperties.class)
public class DemoConsumer {
private final ClientServiceProvider provider = ClientServiceProvider.loadService();
private final List<PushConsumer> pushConsumers = Collections.synchronizedList(new LinkedList<>());
private final RestTemplate restTemplate;
private final String endpoints;
private final String accessKey;
private final String secretKey;
private final String topic;
private final String group;
@Autowired
public DemoConsumer(RestTemplate restTemplate, RocketMqProperties properties) {
this.restTemplate = restTemplate;
this.accessKey = properties.getAccessKey();
this.secretKey = properties.getSecretKey();
this.endpoints = properties.getEndpoints();
this.topic = properties.getTopic();
this.group = properties.getGroup();
listener();
}
public void listener() {
try {
log.info("开始构建mq5.0消费者");
SessionCredentialsProvider sessionCredentialsProvider = new StaticSessionCredentialsProvider(accessKey, secretKey);
ClientConfiguration clientConfiguration = ClientConfiguration.newBuilder()
.setEndpoints(endpoints)
.setCredentialProvider(sessionCredentialsProvider)
.build();
PushConsumer pushConsumer = provider.newPushConsumerBuilder()
.setClientConfiguration(clientConfiguration)
.setConsumerGroup(group)
.setSubscriptionExpressions(Collections.singletonMap(topic, new FilterExpression("*", FilterExpressionType.TAG)))
.setMessageListener(messageView -> {
String tagInfo = messageView.getTag().orElse("");
String code = StandardCharsets.UTF_8.decode(messageView.getBody()).toString();
log.info("tagInfo:{} 消息内容:{}", tagInfo, StandardCharsets.UTF_8.decode(messageView.getBody()));
if (StringUtils.isBlank(tagInfo) || StringUtils.isBlank(code)) {
return ConsumeResult.SUCCESS;
}
// 拆分tagInfo 获取素材类型
return ConsumeResult.SUCCESS;
}).build();
log.info("构建mq5.0消费者成功:proxy:{}, topic:{}, group:{}", endpoints, topic, group);
pushConsumers.add(pushConsumer);
} catch (ClientException e) {
log.info("构建mq5.0消费者失败:proxy:{}, topic:{}, group:{}, error:{}", endpoints, topic, group, e);
}
}
@PreDestroy
public void destroy() {
pushConsumers.forEach(pushConsumer -> {
try (pushConsumer) {
log.info("关闭mq5.0消费者:proxy:{}, topic:{}, group:{}", endpoints, topic, group);
} catch (Exception e) {
log.error("关闭mq5.0消费者失败:proxy:{}, topic:{}, group:{}, error:{}", endpoints, topic, group, e);
}
});
}
}
我比对了下官方demo,发现这里跟官方demo一模一样。
https://github.com/apache/rocketmq-clients/blob/master/java/client/src/main/java/org/apache/rocketmq/client/java/example/PushConsumerExample.java
而且另一个同事B一模一样的代码在另一个集群B1能运行起来。
难道真的是网络问题?
上述异常是在服务启动的时候抛出的,导致服务一直没有起来,这时候容器还有创建成功,也就没有对应的进程,所以无法通过抓包的方式排除网络问题。
我让同事将rocketmq初始化的逻辑调整为调用api触发,这样可以保证服务能够正常启动,然后我也可以通过网络抓包的方式来获取网络流程。
k8s集群中抓包的方式参见:https://jiankunking.blog.csdn.net/article/details/125189956
抓包截图如下:
从图中可以看出网络连接正常建立,并在2.7秒左右的时间服务主动RST连接,这个时间也就是服务抛出异常的时间。
从网络抓包,也验证了咱们的想法,网络至少在TCP层面是没有问题的。
虽然抓包已经验证了,但我还是将有问题的镜像部署到部署到集群B1发现还是有问题。
那么问题会出在哪里?排查一度陷入困境。
这期间也尝试过
- 将rocketmq的地址改成https://…
- 关闭、开启enableSsl
- 等 问题依然没有解决
该服务在测试环境的配置如下:
内存上限:800m 请求600m
CPU上限0.3核 请求0.1核
在测试环境中该服务的启动时间长达4分多钟,会不会cpu资源太少,服务启动的时候JIT占用太多资源,导致服务请求rocketmq的网络连接一直得不到调度,从而出现超时呢?
增加CPU上限到1核,服务不到一分钟就启动了,但异常依旧。
排查又卡主了,这时候想起了一个点
1、容器云中的应用我们会默认注入skywalking、prometheus等agent
2、skywalking在我之前不严谨的测试中发现会占用200~300m的内存
3、java 命令会被复写,默认堆内存是内存上限的65%
800*0.65=520m,再减掉200~300m,发现留给服务的内存大约220~320m之间,那么会不会是内存太少导致的问题呢?
逐步提升内存请求的上限到1200m,发现错误消失。
再回头看异常日志,发现在出现rocketmq异常的时候,HikariPool创建数据库连接也出现了问题
HikariPool初始化时机是在rocketmq init之后
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.098 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@5c4719b1 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.101 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@33b9b407 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.103 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@59c237c9 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.196 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@4c517517 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.198 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@28bf77bd (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.203 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@59d2c94b (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.296 WARN 1 --- [ XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@a835f53 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
然后,告诉同事问题解决方案,以为这就结束了…
然而下午同事发现将内存的上限改回800m服务也可以正常启动了,难道上午哪里出了问题?
先用之前问题镜像,降低内存上限,服务真的正常启动了…
既然服务现在能用之前的配置起来了,那么在没有默认注入agent的情况下,服务到底多大内存能正常启动呢?
先替换启动命令,移除默认注入的skywalking、prometheus等agent
java2
-server
-Xms300m
-Xmx300m
-jar
application.jar
--spring.profiles.active=test
发现服务300m就能正常启动。
然后换回带有agent的启动命令,发现这时:
- 内存上限=请求=500m的时候服务可以正常启动
- 内存上限=请求=450m的时候会出现上述的异常
- 内存上限=请求=400m的时候会出现容器OOMKilled的异常
到这里可以明确出两个结论
- 异常是由于内存不够产生的
- 产生异常的临界值没有明确
- 怀疑两点:
- skywalking等注入agent内存占用受别的因素影响会发生较大波动
- 服务自身环境上午数据与下午数据占用内存不一致
- 怀疑两点:
这里也怀疑过是集群超卖了内存请求,怀疑经排查
spec:
ratio:
limits.cpu: 500
limits.memory: 500
requests.cpu: 100
requests.memory: 100
请求没有超卖,超卖的是上限。
由于时间所限,暂未对服务内存占用分析及skywalking等agent内存占用分析,等有时间再续。
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)