问题的一种可能

昨天同事A遇到一个诡异问题,在代码中加入消费rocketmq的代码后,在本地运行好好的程序,部署到云上测试集群A1会报出以下异常:

time,cluster,namespace,service,pod,log
2023-12-14 09:36:00,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,2023-12-14 09:36:00.595  INFO 1 --- [  XNIO-1 task-1] c.h.d.i.rocket.PdmRocketMqConsumer       : [TID:95e19b4533204903a985c90b068cddcc_64_17025177603380033] --- 构建mq5.0消费者:proxy:10.163.240.237:8080, topic:product_material_topic_uat_center, group:product_material_group_center
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,2023-12-14 09:36:39.995 ERROR 1 --- [  XNIO-1 task-1] c.h.d.i.rocket.PdmRocketMqConsumer       : [TID:95e19b4533204903a985c90b068cddcc_64_17025177603380033] --- 构建mq5.0消费者异常:proxy:10.163.240.237:8080, topic:product_material_topic_uat_center, group:product_material_group_center
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,
2023-12-14 09:36:40,cluster-test,did-test,jiankunking-test,jiankunking-test-6b86565dbb-lzt9d,java.lang.IllegalStateException: Expected the service PushConsumerImpl-0 [FAILED] to be RUNNING, but the service has FAILED
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:381) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:305) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.client.java.impl.consumer.PushConsumerBuilderImpl.build(PushConsumerBuilderImpl.java:128) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at com.jiankunking.issuemanage.rocket.PdmRocketMqConsumer.initConsumer(PdmRocketMqConsumer.java:89) ~[classes!/:na]
	at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq$original$l28F75DZ(IssueTagInfoController.java:103) ~[classes!/:na]
	at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq$original$l28F75DZ$accessor$0D0TSRNs(IssueTagInfoController.java) ~[classes!/:na]
	at com.jiankunking.issuemanage.controller.IssueTagInfoController$auxiliary$Rnw73z4x.call(Unknown Source) ~[classes!/:na]
	at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86) ~[skywalking-agent.jar:8.12.0]
	at com.jiankunking.issuemanage.controller.IssueTagInfoController.initRocketMq(IssueTagInfoController.java) ~[classes!/:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
	at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205) ~[spring-web-5.3.24.jar!/:5.3.24]
	at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:150) ~[spring-web-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:117) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1071) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:964) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:645) ~[javax.servlet-api-4.0.1.jar!/:4.0.1]
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.24.jar!/:5.3.24]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:750) ~[javax.servlet-api-4.0.1.jar!/:4.0.1]
	at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.24.jar!/:5.3.24]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
	at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.24.jar!/:5.3.24]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
	at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.24.jar!/:5.3.24]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.24.jar!/:5.3.24]
	at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:68) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.RedirectDirHandler.handleRequest(RedirectDirHandler.java:68) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:117) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.SendErrorPageHandler.handleRequest(SendErrorPageHandler.java:52) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:275) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:79) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:134) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:131) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:255) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:79) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:100) ~[undertow-servlet-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.server.Connectors.executeRootHandler(Connectors.java:387) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:852) ~[undertow-core-2.2.20.Final.jar!/:2.2.20.Final]
	at org.apache.skywalking.apm.plugin.undertow.v2x.SWRunnable.run(SWRunnable.java:45) ~[na:na]
	at org.apache.skywalking.apm.plugin.undertow.v2x.SWRunnable.run(SWRunnable.java:45) ~[na:na]
	at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
	at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2019) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1558) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1449) ~[jboss-threads-3.1.0.Final.jar!/:3.1.0.Final]
	at org.xnio.XnioWorker$WorkerThreadFactory$1$1.run(XnioWorker.java:1282) ~[xnio-api-3.8.7.Final.jar!/:3.8.7.Final]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: java.util.concurrent.ExecutionException: org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:547) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.client.java.impl.ClientImpl.startUp(ClientImpl.java:188) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.client.java.impl.consumer.PushConsumerImpl.startUp(PushConsumerImpl.java:161) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.com.google.common.util.concurrent.Callables.lambda$threadRenaming$3(Callables.java:103) ~[rocketmq-client-java-5.0.5.jar!/:na]
	... 1 common frames omitted
Caused by: org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now
	at org.apache.rocketmq.shaded.io.grpc.Status.asRuntimeException(Status.java:539) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at org.apache.rocketmq.shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[rocketmq-client-java-5.0.5.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	... 1 common frames omitted

由于本地代码跟服务端代码完全一样,在加上错误提示:

Expected the service PushConsumerImpl-0 [FAILED] to be RUNNING, but the service has FAILED

org.apache.rocketmq.shaded.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall started after deadline exceeded: -2.591386886s from now

同事第一怀疑的点是集群网络环境,这里我的第一反应不会是网络问题,为啥这么想呢?

因为如果是网络问题,那么集群内肯定会有大量的报错,而不仅仅是这一个服务。

那么问题出现在哪呢?难道是代码的问题?

@Slf4j
@Component
@Configuration
@EnableConfigurationProperties(RocketMqProperties.class)
public class DemoConsumer {

    private final ClientServiceProvider provider = ClientServiceProvider.loadService();
    private final List<PushConsumer> pushConsumers = Collections.synchronizedList(new LinkedList<>());

    private final RestTemplate restTemplate;
    private final String endpoints;
    private final String accessKey;
    private final String secretKey;
    private final String topic;
    private final String group;

    @Autowired
    public DemoConsumer(RestTemplate restTemplate, RocketMqProperties properties) {
        this.restTemplate = restTemplate;
        this.accessKey = properties.getAccessKey();
        this.secretKey = properties.getSecretKey();
        this.endpoints = properties.getEndpoints();
        this.topic = properties.getTopic();
        this.group = properties.getGroup();
        listener();
    }

    public void listener() {
        try {
            log.info("开始构建mq5.0消费者");
            SessionCredentialsProvider sessionCredentialsProvider = new StaticSessionCredentialsProvider(accessKey, secretKey);
            ClientConfiguration clientConfiguration = ClientConfiguration.newBuilder()
                    .setEndpoints(endpoints)
                    .setCredentialProvider(sessionCredentialsProvider)
                    .build();
            PushConsumer pushConsumer = provider.newPushConsumerBuilder()
                    .setClientConfiguration(clientConfiguration)
                    .setConsumerGroup(group)
                    .setSubscriptionExpressions(Collections.singletonMap(topic, new FilterExpression("*", FilterExpressionType.TAG)))
                    .setMessageListener(messageView -> {
                        String tagInfo = messageView.getTag().orElse("");
                        String code = StandardCharsets.UTF_8.decode(messageView.getBody()).toString();
                        log.info("tagInfo:{} 消息内容:{}", tagInfo, StandardCharsets.UTF_8.decode(messageView.getBody()));
                        if (StringUtils.isBlank(tagInfo) || StringUtils.isBlank(code)) {
                            return ConsumeResult.SUCCESS;
                        }
                        // 拆分tagInfo 获取素材类型
                        return ConsumeResult.SUCCESS;
                    }).build();
            log.info("构建mq5.0消费者成功:proxy:{}, topic:{}, group:{}", endpoints, topic, group);
            pushConsumers.add(pushConsumer);
        } catch (ClientException e) {
            log.info("构建mq5.0消费者失败:proxy:{}, topic:{}, group:{}, error:{}", endpoints, topic, group, e);
        }
    }


    @PreDestroy
    public void destroy() {
        pushConsumers.forEach(pushConsumer -> {
            try (pushConsumer) {
                log.info("关闭mq5.0消费者:proxy:{}, topic:{}, group:{}", endpoints, topic, group);
            } catch (Exception e) {
                log.error("关闭mq5.0消费者失败:proxy:{}, topic:{}, group:{}, error:{}", endpoints, topic, group, e);
            }
        });
    }
}

我比对了下官方demo,发现这里跟官方demo一模一样。

https://github.com/apache/rocketmq-clients/blob/master/java/client/src/main/java/org/apache/rocketmq/client/java/example/PushConsumerExample.java

而且另一个同事B一模一样的代码在另一个集群B1能运行起来。

难道真的是网络问题?

上述异常是在服务启动的时候抛出的,导致服务一直没有起来,这时候容器还有创建成功,也就没有对应的进程,所以无法通过抓包的方式排除网络问题。

我让同事将rocketmq初始化的逻辑调整为调用api触发,这样可以保证服务能够正常启动,然后我也可以通过网络抓包的方式来获取网络流程。

k8s集群中抓包的方式参见:https://jiankunking.blog.csdn.net/article/details/125189956

抓包截图如下:
在这里插入图片描述

从图中可以看出网络连接正常建立,并在2.7秒左右的时间服务主动RST连接,这个时间也就是服务抛出异常的时间。

从网络抓包,也验证了咱们的想法,网络至少在TCP层面是没有问题的。

虽然抓包已经验证了,但我还是将有问题的镜像部署到部署到集群B1发现还是有问题。

那么问题会出在哪里?排查一度陷入困境。

这期间也尝试过

  • 将rocketmq的地址改成https://…
  • 关闭、开启enableSsl
  • 等 问题依然没有解决

该服务在测试环境的配置如下:

内存上限:800m 请求600m
CPU上限0.3核  请求0.1核

在测试环境中该服务的启动时间长达4分多钟,会不会cpu资源太少,服务启动的时候JIT占用太多资源,导致服务请求rocketmq的网络连接一直得不到调度,从而出现超时呢?

增加CPU上限到1核,服务不到一分钟就启动了,但异常依旧。

排查又卡主了,这时候想起了一个点

1、容器云中的应用我们会默认注入skywalking、prometheus等agent
2、skywalking在我之前不严谨的测试中发现会占用200~300m的内存
3、java 命令会被复写,默认堆内存是内存上限的65%

800*0.65=520m,再减掉200~300m,发现留给服务的内存大约220~320m之间,那么会不会是内存太少导致的问题呢?

逐步提升内存请求的上限到1200m,发现错误消失。

再回头看异常日志,发现在出现rocketmq异常的时候,HikariPool创建数据库连接也出现了问题

HikariPool初始化时机是在rocketmq init之后

2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.098  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@5c4719b1 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.101  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@33b9b407 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.103  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@59c237c9 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.196  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@4c517517 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.198  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@28bf77bd (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.203  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@59d2c94b (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
2023-12-14 09:43:24,cluster-test,did-test,jiankunking,jiankunking-6b86565dbb-lzt9d,2023-12-14 09:43:24.296  WARN 1 --- [  XNIO-1 task-1] com.zaxxer.hikari.pool.PoolBase          : [TID:95e19b4533204903a985c90b068cddcc_65_17025182033670077] --- HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl@a835f53 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.

然后,告诉同事问题解决方案,以为这就结束了…

然而下午同事发现将内存的上限改回800m服务也可以正常启动了,难道上午哪里出了问题?

在这里插入图片描述

先用之前问题镜像,降低内存上限,服务真的正常启动了…

既然服务现在能用之前的配置起来了,那么在没有默认注入agent的情况下,服务到底多大内存能正常启动呢?

先替换启动命令,移除默认注入的skywalking、prometheus等agent

java2
-server
-Xms300m
-Xmx300m
-jar
application.jar
--spring.profiles.active=test

发现服务300m就能正常启动。

然后换回带有agent的启动命令,发现这时:

  • 内存上限=请求=500m的时候服务可以正常启动
  • 内存上限=请求=450m的时候会出现上述的异常
  • 内存上限=请求=400m的时候会出现容器OOMKilled的异常

到这里可以明确出两个结论

  • 异常是由于内存不够产生的
  • 产生异常的临界值没有明确
    • 怀疑两点:
      • skywalking等注入agent内存占用受别的因素影响会发生较大波动
      • 服务自身环境上午数据与下午数据占用内存不一致

这里也怀疑过是集群超卖了内存请求,怀疑经排查

spec:
  ratio:
    limits.cpu: 500
    limits.memory: 500
    requests.cpu: 100
    requests.memory: 100

请求没有超卖,超卖的是上限。

由于时间所限,暂未对服务内存占用分析及skywalking等agent内存占用分析,等有时间再续。

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐