Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow log request performance #539

Open
deathalt opened this issue Jul 24, 2024 · 13 comments
Open

Slow log request performance #539

deathalt opened this issue Jul 24, 2024 · 13 comments
Labels

Comments

@deathalt
Copy link
Contributor

log format:

{"body":"body","traceid":"1","spanid":"1","severity":"Information","flags":1,"attributes":{"AccountId":1,"ActionId":"UUID","ActionName":"Controller","Stat":"1","Id":"2","ConnectionId":"123","GId":0,"ParentId":"UUID","RequestId":"UUID","RequestPath":"/path","SpanId":"1323aff603f45021","TraceId":"03016d03700e03ec8b2892038bacfd92","{OriginalFormat}":"INIT"},"resources":{"deployment.environment":"Stage","service.instance.id":"UUID","service.name":"Service","team":"Team","telemetry.sdk.language":"dotnet","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.9.0"},"instrumentation_scope":{"name":"Controller"}}

Loki request:

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= 85577686 | json body="body", Environment="resources["deployment.environment"]", Scope="instrumentation_scope["name"]" | Environment = Stage

Took more then 30 seconds and crush then.

{level=~"WARN|INFO|ERROR|FATAL", job="Service"}

works fine.

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} | json body="body"

7 days gap -> 15.2 seconds response time, "Logs volume" in grafana 502

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} | json body="body", Environment="resources[\"deployment.environment\"]", Scope="instrumentation_scope[\"name\"]" | Environment = Stage``

7 days gap -> 30 seconds response time -> 502 Client.Timeout exceeded while awaiting headers

Any chance to speed it up?

@akvlad
Copy link
Collaborator

akvlad commented Jul 29, 2024

Hello @deathalt

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= `85577686` [1d]))

can you count this over the last 7 days?

@akvlad
Copy link
Collaborator

akvlad commented Jul 29, 2024

Another question is I see no use of

body="body", Scope="instrumentation_scope[\"name\"]"

do you really need to extract these two labels?

@deathalt
Copy link
Contributor Author

@akvlad

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= 85577686 [1d]))

took 1.83 mins and fails then

image

Another question is I see no use of
body="body", Scope="instrumentation_scope["name"]"

I cannot answer, because it is one our dev use case, and I think he just want to get this label in log body

@akvlad
Copy link
Collaborator

akvlad commented Jul 30, 2024

I have to know the amount of data we deal with so I can reproduce it locally.

@deathalt what about

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} [1d]))

?

@deathalt
Copy link
Contributor Author

@akvlad

image

1.59s

@akvlad
Copy link
Collaborator

akvlad commented Jul 30, 2024

@deathalt Clickhouse is definitely unable to parse about 2.5B rows in 1 minute.
Please increase the timeouts untill

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= ``85577686`` [1d]))

starts passing.
Then please try something like:

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= `85577686` |~ `"deployment.environment":\\s*"Stage"`

Thanks for the information. I'll try to ingest a comparable amount of data and try the request.

@deathalt
Copy link
Contributor Author

@akvlad

I don't understand what the problem is.
In my settings

  FASTIFY_BODYLIMIT: 15242880
  FASTIFY_REQUESTTIMEOUT: 0
  FASTIFY_MAXREQUESTS: 0

but I still get

Get “http://qryn-read.qryn.svc:3100/loki/api/v1/query_range?
net/http: timeout awaiting response headers (Client.Timeout exceeded while awaiting headers)

~1.5 minutes after the request

sum by (job) (count_over_time({level=~“WARN|INFO|ERROR|FATAL”, job=“Service”} |= 85577686[1d])))

@lmangani
Copy link
Collaborator

lmangani commented Jul 31, 2024

@deathalt no errors in the qryn logs preceding this response? Just to be sure could you also confirm you have no proxy between the client and qryn?

@deathalt
Copy link
Contributor Author

deathalt commented Jul 31, 2024

@lmangani

│ qryn-read-5bcf8468c5-mvqf4 {"level":30,"time":1722422596334,"pid":19,"hostname":"qryn-read-5bcf8468c5-mvqf4","name":"qryn","reqId":"req-1w","res":{"statusCode":200},"msg":"stream closed prematurely"}                                                                                                                                               │
│ qryn-read-5bcf8468c5-mvqf4 {"level":50,"time":1722422596334,"pid":19,"hostname":"qryn-read-5bcf8468c5-mvqf4","name":"qryn","reqId":"req-1w","err":"premature close\nError: premature close\n    at onclosenexttick (/app/node_modules/end-of-stream/index.js:54:86)\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:1 │
│ 1)","msg":"premature close"} 

only this

clickhouse -> consul -> qryn

no proxy at all

@lmangani
Copy link
Collaborator

Is consul providing the connectivity when the queries run? I'm not familiar with it.
If it is involved with providing networking are you sure it doesn't introduce any session timers between client/server?

@deathalt
Copy link
Contributor Author

Ok, there was default clickhouse timeouts

Here you can find result of

sum by (job) (count_over_time({level=~“WARN|INFO|ERROR|FATAL”, job=“Service”} |= 85577686[1d])))

image

@lmangani
Copy link
Collaborator

lmangani commented Jul 31, 2024

@deathalt thanks for the update. Which ClickHouse setting was affecting this? I'll add it to the configuration notes.
Did this positively affect any of the previously reported failures as a side-effect?

@akvlad
Copy link
Collaborator

akvlad commented Jul 31, 2024

@deathalt so you're trying to filter about 350 rows out of 3.3B . Mmmm...
There are not a lot of ways to improve the raw power of grep.
Maybe you can manually configure skip indexes on samples_v3 table. It can improve LIKE '%.....%' of |= 85577686 part

It's an interesting experiment to import 1B of rows and check the difference between skipIndexes and nothing.

@lmangani lmangani added the 3.x label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants