-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GetRecords blocked for more than 10 minutes #1037
Comments
Hello @nonexu, thank you for reaching out to us. Just looking at your small snippet, are you only sleeping if there is an error? How often do you make a call to the service if there isnt an error? |
@xibz It looks like this issue. I will do some modification and test. |
Hello @nonexu, I am going to go ahead and close this. If you are still having issues with this, please let us know. |
Hello, @xibz 2017-01-19 00:22:13.580 main.go:93 [Info ] aws sdk debug info:[DEBUG: Request kinesis/GetRecords Details: 2017-01-19 00:34:18.304 main.go:93 [Info ] aws sdk debug info:[DEBUG: Response kinesis/GetRecords Details: |
@nonexu - We log immediately when we have the body, IF you set the logger to log the body. Perhaps, profiling using |
@xibz Other question, sdk debug header and body immediately when it receive, why there is no time out I tried to set http timeout 3 minutes when init sdk, but it did not fix this issue. |
@nonexu , I was wondering about those times. Those are sent back from the service. req, out := svc.GetRecordsRequest(&kinesis.GetRecordsInput{
// values here
})
req.Handlers.Send.PushBack(func(r *request.Request) {
fmt.Println("Time:", time.Now().UTC())
}) If the times are correct, I'd profile next. Please let me know if you have any issues with the profiling and/or with the snippet above. |
@xibz func (c *Kinesis) GetRecords(input *GetRecordsInput) (*GetRecordsOutput, error) {
req, out := c.GetRecordsRequest(input)
req.Handlers.Send.PushBack(func(r *request.Request) {
fmt.Println("Time:", time.Now().UTC())
})
err := req.Send()
return out, err
} Issue was reproduced again, it took more than 1 hour this time.
Do you need more information? Some additional information about this issue: Thank you in advance! |
@nonexu - Looks like the service is returning the incorrect time. I would mention that in the ticket. I believe this is a service issue, but I think we should leave this issue open until it is solved. Please let us know if you need anything and/or information. |
Thank you, I only want to know the reason and solution for the issue. |
@nonexu - The service team just got back to me and is asking for the instance id of the ec2 instance that this is occurring on. Can you please provide that? |
@xibz Thanks for your help! |
@xibz Thanks for your idea! Maybe, I need to init new kinesis client, when iterator expired. But it is just a workaround. |
@nonexu you shouldn't need a new client, but you may need to retrieve a new iterator if the current expires. |
@nonexu - Additionally, how long does it take, typically, to experience a long delay? |
@xibz Sorry for delayed response, I took a leave in the past days. |
@nonexu - The service team is looking into to it, but requires some information. I'll let them know how often this occurs. Thank you for providing the necessary information and no worries on the delayed response! |
@xibz |
@xibz It is blocked now for 5 minutes on i-093416b7e6559c70b. |
@nonexu - I have asked them to check. I will let you know when they get back to me. |
@xibz detail information about this time. |
Hello @nonexu, the service team just got back to me. So, it looks like it is what we suspected earlier. I am unsure if the sleep I suggested in the loop is enough, but that instance was running at 100% CPU. I would profile and observe the instance to see what is causing the high CPU utilization. |
@xibz Thanks for your feedback. |
@nonexu - Sorry, I think I may have been unclear here. The issue here isn't the service or SDK, but how it is being used. I suggested a sleep would help. If it hadn't, I suggest profiling to see why this your code base is utilizing 100% CPU. Please let us know if there is any more that we can do. |
@xibz There is a sleep 1 second when it failed. |
@nonexu - Yea, that is suspicious to why it only affects certain instances. Without being able to see all the instances and how they are configured, won't lead me down any conclusions. I suggest looking into why the spike to 100% CPU usage and that may give more context to why it affect some instances and not others. |
@xibz |
@nonexu - You are only sleeping if there is an error or if the record length is 0. You need to add a sleep in your loop that is always hit. I just moved it to the beginning. for{
time.Sleep(time.Second)
params := &kinesis.GetRecordsInput{
ShardIterator: aws.String(shardIterator), // Required
Limit: aws.Int64(1000),
}
fmt.Println("start get data, shardId:", shardId)
result, err := this.client.GetRecords(params)
fmt.Println("api return")
if err != nil{
fmt.Println("get record error:", err.Error())
continue
}
recordLen := len(result.Records)
fmt.Println("get records length:", recordLen)
if recordLen > 0{
fmt.Println(result.Records[0])
}
shardIterator = *result.NextShardIterator
} Try that and let us know if that helps! |
@xibz Thanks for your reply. |
@nonexu - Thank you for keeping us up to date. What was the CPU load? And how long was the delay? |
@xibz It delayed for 10 minutes this time. |
@xibz Could you test this case on a server with same OS version and configuration as i-093416b7e6559c70b. |
Thank you @nonexu for all the information. I've forwarded your graph to the service team. I will try to reproduce this on my end using your code and see if I can reproduce it. |
@nonexu - The service team has asked for sar metrics of the instance during use to capture this delay. Can you please provide that? |
Hello @nonexu, if you are still having issues with this, please feel free to reopen this. I am going to close this for now until we have further data. |
Hi we've merged in PR ##1166 which adds retrying of connection reset errors discovered during unmarshaling response body, and read timeouts for Kinesis API Get operations. The read timeouts should prevent your application hanging for significant amounts of time. |
Hi,
I encounter an issue that it took more than 10 minutes to call GetRecords.
So, there was an error "Iterator expired", when get records next time.
It was correct when I started my application, and issue happened after several minutes.
GetRecords should not be blocked, from AWS doc
I enabled kinesis api debug info: log attached.
Did you encounter this issue? And how to solve it?
Thank you in advance!
debug_info.log.gz

The text was updated successfully, but these errors were encountered: