2022-07-22

Thanos Compactor Crashing - Download blocks timeout resulting in block overlap?

Copied from my issue here. Looking to post here to see if I can get any attention because I'm stumped on this.

Thanos Compactor is crashing seemingly regularly after running for some period of time, timing out and then later experiences overlaps in the object store, seemingly due to the crash from before.

Thanos, Prometheus and Golang version used:

Thanos: v0.27.0

Prometheus: 2.25.2

Golang: go1.15.10

Here's my docker-compose for good measure:

services:
  sidecar:
    image: thanosio/thanos:v0.27.0
    container_name: thanos-sidecar
    command:
      - "sidecar"
      - "--debug.name=sidecar"
      - "--grpc-address=0.0.0.0:10901"
      - "--http-address=0.0.0.0:10902"
      - "--prometheus.url=http://172.17.0.1:9090"
      - "--tsdb.path=/data/"
      - "--objstore.config-file=/bucket.yml"
      - "--shipper.upload-compacted"
    ports:
      - 10902:10902
    restart: always
    volumes:
      - /data:/data
      - ./azure_bucket_config.yml:/bucket.yml
    networks:
      - thanos

  store:
    image: thanosio/thanos:v0.27.0
    container_name: thanos-store
    restart: always
    command:
      - "store"
      - "--grpc-address=0.0.0.0:10905"
      - "--http-address=0.0.0.0:10906"
      - "--data-dir=/data/store"
      - "--objstore.config-file=/bucket.yml"
    ports:
      - 10906:10906
    volumes:
      - store:/data
      - ./azure_bucket_config.yml:/bucket.yml
    networks:
      - thanos

  query:
    image: thanosio/thanos:v0.27.0
    container_name: thanos-query
    command:
      - "query"
      - "--grpc-address=0.0.0.0:10903"
      - "--http-address=0.0.0.0:10904"
      - "--query.replica-label=prometheus"
      - "--store=thanos-store:10905"
      - "--query.auto-downsampling"
    ports:
      - 10904:10904
    depends_on:
      - sidecar
      - store
    networks:
      - thanos
    restart: always

  compactor:
    image: thanosio/thanos:v0.27.0
    container_name: thanos-compactor
    command:
      - "compact"
      - "--wait"
      - "--data-dir=/tmp/thanos-compact"
      - "--objstore.config-file=/bucket.yml"
      - "--http-address=0.0.0.0:10907"
      - "--retention.resolution-raw=182d"
      - "--retention.resolution-5m=1y"
      - "--retention.resolution-1h=2y"
    ports:
      - 10907:10907
    volumes:
      - compact:/tmp
      - ./azure_bucket_config.yml:/bucket.yml
      - ./fs_backup_config.yml:/backup.yml
    depends_on:
      - sidecar
      - store
    networks:
      - thanos
    restart: always

networks:
  thanos: {}
volumes:
  store: {}
  compact: {}

Object Storage Provider: Azure Blob Storage

Bucket config
type: AZURE
config:
  storage_account: "xxxx"
  storage_account_key: "xxxx"
  container: "prometheus4"
  endpoint: ""
  max_retries: 0
  msi_resource: ""
  user_assigned_id: ""
  pipeline_config:
    max_tries: 0
    try_timeout: 0s
    retry_delay: 0s
    max_retry_delay: 0s
  reader_config:
    max_retry_requests: 0
  http_config:
    idle_conn_timeout: 0s
    response_header_timeout: 0s
    insecure_skip_verify: false
    tls_handshake_timeout: 0s
    expect_continue_timeout: 0s
    max_idle_conns: 0
    max_idle_conns_per_host: 0
    max_conns_per_host: 0
    tls_config:
      ca_file: ""
      cert_file: ""
      key_file: ""
      server_name: ""
      insecure_skip_verify: false
    disable_compression: false
prefix: "prometheus4"

Full logs to relevant components:

Initial Crash and Reboot - 2022-07-19 5:03 UTC
level=info ts=2022-07-19T05:00:57.370656812Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=339.294198ms duration_ms=339 cached=91 returned=91 partial=0
level=info ts=2022-07-19T05:01:57.342172313Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=311.118213ms duration_ms=311 cached=91 returned=91 partial=0
level=info ts=2022-07-19T05:02:57.354952621Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=323.230806ms duration_ms=323 cached=91 returned=91 partial=0
level=warn ts=2022-07-19T05:03:14.1241479Z caller=objstore.go:289 msg="failed to remove file on partial dir download error" file=/tmp/thanos-compact/downsample/01G29KQFEVC2QW98275P3EHWD9 err="remove /tmp/thanos-compact/downsample/01G29KQFEVC2QW98275P3EHWD9: directory not empty"
level=warn ts=2022-07-19T05:03:14.253037996Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01G29KQFEVC2QW98275P3EHWD9: get file 01G29KQFEVC2QW98275P3EHWD9/chunks/000132: cannot download blob, address: https://prometheuseast.blob.core.windows.net/prometheus4/prometheus4/01G29KQFEVC2QW98275P3EHWD9/chunks/000132: context deadline exceeded"
level=info ts=2022-07-19T05:03:14.267923621Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01G29KQFEVC2QW98275P3EHWD9: get file 01G29KQFEVC2QW98275P3EHWD9/chunks/000132: cannot download blob, address: https://prometheuseast.blob.core.windows.net/prometheus4/prometheus4/01G29KQFEVC2QW98275P3EHWD9/chunks/000132: context deadline exceeded"
level=info ts=2022-07-19T05:03:14.278933395Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01G29KQFEVC2QW98275P3EHWD9: get file 01G29KQFEVC2QW98275P3EHWD9/chunks/000132: cannot download blob, address: https://prometheuseast.blob.core.windows.net/prometheus4/prometheus4/01G29KQFEVC2QW98275P3EHWD9/chunks/000132: context deadline exceeded"
level=info ts=2022-07-19T05:03:14.279031558Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01G29KQFEVC2QW98275P3EHWD9: get file 01G29KQFEVC2QW98275P3EHWD9/chunks/000132: cannot download blob, address: https://prometheuseast.blob.core.windows.net/prometheus4/prometheus4/01G29KQFEVC2QW98275P3EHWD9/chunks/000132: context deadline exceeded"
level=error ts=2022-07-19T05:03:14.321547079Z caller=main.go:158 err="downsampling to 5 min: download block 01G29KQFEVC2QW98275P3EHWD9: get file 01G29KQFEVC2QW98275P3EHWD9/chunks/000132: cannot download blob, address: https://prometheuseast.blob.core.windows.net/prometheus4/prometheus4/01G29KQFEVC2QW98275P3EHWD9/chunks/000132: context deadline exceeded\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:440\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:476\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:503\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:158\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"
level=info ts=2022-07-19T05:03:28.2214006Z caller=factory.go:50 msg="loading bucket configuration"
level=info ts=2022-07-19T05:03:28.61335506Z caller=compact.go:387 msg="retention policy of raw samples is enabled" duration=4368h0m0s
level=info ts=2022-07-19T05:03:28.613473154Z caller=compact.go:394 msg="retention policy of 5 min aggregated samples is enabled" duration=8760h0m0s
level=info ts=2022-07-19T05:03:28.61350257Z caller=compact.go:397 msg="retention policy of 1 hour aggregated samples is enabled" duration=17520h0m0s
level=info ts=2022-07-19T05:03:28.615536041Z caller=compact.go:615 msg="starting compact node"
level=info ts=2022-07-19T05:03:28.615580162Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-07-19T05:03:28.616391532Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-07-19T05:03:28.616423401Z caller=http.go:73 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10907
level=info ts=2022-07-19T05:03:28.6176218Z caller=tls_config.go:195 service=http/server component=compact msg="TLS is disabled." http2=false
level=info ts=2022-07-19T05:03:28.625083387Z caller=compact.go:1268 msg="start sync of metas"
level=info ts=2022-07-19T05:03:30.30407491Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.688172615s duration_ms=1688 cached=92 returned=92 partial=0
level=info ts=2022-07-19T05:03:31.83913099Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.534884123s duration_ms=1534 cached=92 returned=92 partial=0
level=info ts=2022-07-19T05:03:32.116512785Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=3.500685699s duration_ms=3500 cached=92 returned=47 partial=0

Compaction Starts - Overlapped Blocks Halting Compaction - 2022-07-19 13:30 UTC

level=info ts=2022-07-19T13:30:39.589842814Z caller=compact.go:1273 msg="start of GC"
level=info ts=2022-07-19T13:30:39.590738529Z caller=compact.go:1296 msg="start of compactions"
level=error ts=2022-07-19T13:30:39.66560371Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@8403025132557603590: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1658116800000, maxt: 1658124000000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQQ7FFAVMBYHD7EEBDYH79, mint: 1658116800000, maxt: 1658124000000, range: 2h0m0s>\n[mint: 1658131200000, maxt: 1658138400000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQRJ4DP1QXEX6MMVSJ88X3, mint: 1658131200000, maxt: 1658138400000, range: 2h0m0s>\n[mint: 1658044800000, maxt: 1658052000000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQHG3JRPNQXFYK643Y1F0Q, mint: 1658044800000, maxt: 1658052000000, range: 2h0m0s>\n[mint: 1658073600000, maxt: 1658080800000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQM4XZJH5BYH118CEP2AWP, mint: 1658073600000, maxt: 1658080800000, range: 2h0m0s>\n[mint: 1658066400000, maxt: 1658073600000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQKMDTW9C7AFMB2FM67E46, mint: 1658066400000, maxt: 1658073600000, range: 2h0m0s>\n[mint: 1658080800000, maxt: 1658088000000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQMNE9H6QNZJA4XAESFEWQ, mint: 1658080800000, maxt: 1658088000000, range: 2h0m0s>\n[mint: 1658088000000, maxt: 1658095200000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQN5X1FYNB5DZ6NKPRYYF3, mint: 1658088000000, maxt: 1658095200000, range: 2h0m0s>\n[mint: 1658095200000, maxt: 1658102400000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQNNVR4GYETFJNXBHHN7RA, mint: 1658095200000, maxt: 1658102400000, range: 2h0m0s>\n[mint: 1658102400000, maxt: 1658109600000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQP5FNZE93V91HE19BE9NS, mint: 1658102400000, maxt: 1658109600000, range: 2h0m0s>\n[mint: 1658109600000, maxt: 1658116800000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQPPPZCWDGRW5WQHTQ7CRQ, mint: 1658109600000, maxt: 1658116800000, range: 2h0m0s>\n[mint: 1658052000000, maxt: 1658059200000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQJD3VM384NDV7G0KF0TGM, mint: 1658052000000, maxt: 1658059200000, range: 2h0m0s>\n[mint: 1658059200000, maxt: 1658066400000, range: 2h0m0s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQK1PVP7PZWZDD34SE29FK, mint: 1658059200000, maxt: 1658066400000, range: 2h0m0s>\n[mint: 1658124000002, maxt: 1658131200000, range: 1h59m59s, blocks: 2]: <ulid: 01G8ADQB9EACGF27TJX4426ZZD, mint: 1658016000000, maxt: 1658188800000, range: 48h0m0s>, <ulid: 01G8AQQPNV65T33D285NXJCWJE, mint: 1658124000002, maxt: 1658131200000, range: 1h59m59s>"

This isn't the first time I've experienced this, and I know I can solve it by going in and manually removing the overlapped blocks. I have done that twice before, but each time this crash occurs and then overlapped blocks seem to arise afterwards.

Here's the meta.json for the first two overlapped blocks. It looks like the sidecar uploaded the conflicting block, but there's no errors in the sidecar.

Block 01G8ADQB9EACGF27TJX4426ZZD meta.json

{
    "ulid": "01G8ADQB9EACGF27TJX4426ZZD",
    "minTime": 1658016000000,
    "maxTime": 1658188800000,
    "stats": {
        "numSamples": 5667734569,
        "numSeries": 352368,
        "numChunks": 47768530
    },
    "compaction": {
        "level": 3,
        "sources": [
            "01G851JCXS09NZT8CPCY5AHY5B",
            "01G863X15QW488C5R7Z4BZ152W",
            "01G86ARRE2FAR3ZY97RSKSXG60",
            "01G86HMFNR3PX5X5X8MYTPMAMZ",
            "01G86RG6XVDGGET9FAR7X13M5A",
            "01G86ZBY5SCB4YP4D1RYEA76SR",
            "01G8767NDRGX33QYWSJYNMT5P6",
            "01G87D3CP3SW9PTWB80YT8WJTP",
            "01G87KZ3XR8ECMDMBYW71YP0C4",
            "01G87TTV6768PW4DWFPBAE4S4E",
            "01G881PJDRY8374W44YRTM52Q1",
            "01G885KERJY442EJ36G10EFFVB",
            "01G885N9ADNV6WZD1RWBDDJHWC",
            "01G885Q88ZB70TMRWFRBG7CYK3",
            "01G885S39FNMZ1VCA6PJ3KS4HW",
            "01G888J9P20F42QWWM11STJEZK",
            "01G88FE0Y2D9GEG8SF20TG8R3N",
            "01G88P9R5RXFJMR7PJZ5J092ZA",
            "01G88X5FDSAWJYRD2X1YVY5WGC",
            "01G89416NR1A53R1Q6PGPWZ0FX",
            "01G89AWXZA7C1H41AGY8P93ZYD",
            "01G89HRN62EJASHRBRKEW11AV9",
            "01G89RMCE95MB5D7F97RZ1BEDN",
            "01G89ZG3NDE8WDBJSZEC3ADMPV"
        ],
        "parents": [
            {
                "ulid": "01G898BMCA4N2KPDBQPYBATGGE",
                "minTime": 1658016000000,
                "maxTime": 1658044800000
            },
            {
                "ulid": "01G898J50T0KVQPMADAQ8H8RX1",
                "minTime": 1658044800000,
                "maxTime": 1658073600000
            },
            {
                "ulid": "01G8995AK55QDGX98Q00J7X3R1",
                "minTime": 1658073600000,
                "maxTime": 1658102400000
            },
            {
                "ulid": "01G899SJMW46VRKGVM5806QY94",
                "minTime": 1658102400000,
                "maxTime": 1658131200000
            },
            {
                "ulid": "01G89DA51XWY21PPSA0KRJMBVF",
                "minTime": 1658131200000,
                "maxTime": 1658160000000
            },
            {
                "ulid": "01G8ADEYQKW5QWMBD9NDHZEJNR",
                "minTime": 1658160000000,
                "maxTime": 1658188800000
            }
        ]
    },
    "version": 1,
    "thanos": {
        "labels": {
            "id": "prometheus4"
        },
        "downsample": {
            "resolution": 0
        },
        "source": "compactor",
        "segment_files": [
            "000001",
            "000002",
            "000003",
            "000004",
            "000005",
            "000006",
            "000007",
            "000008",
            "000009",
            "000010",
            "000011",
            "000012",
            "000013",
            "000014",
            "000015",
            "000016",
            "000017",
            "000018",
            "000019",
            "000020"
        ],
        "files": [
            {
                "rel_path": "chunks/000001",
                "size_bytes": 536870753
            },
            {
                "rel_path": "chunks/000002",
                "size_bytes": 536869730
            },
            {
                "rel_path": "chunks/000003",
                "size_bytes": 536869544
            },
            {
                "rel_path": "chunks/000004",
                "size_bytes": 536869478
            },
            {
                "rel_path": "chunks/000005",
                "size_bytes": 536870664
            },
            {
                "rel_path": "chunks/000006",
                "size_bytes": 536870217
            },
            {
                "rel_path": "chunks/000007",
                "size_bytes": 536870435
            },
            {
                "rel_path": "chunks/000008",
                "size_bytes": 536870734
            },
            {
                "rel_path": "chunks/000009",
                "size_bytes": 536870209
            },
            {
                "rel_path": "chunks/000010",
                "size_bytes": 536870805
            },
            {
                "rel_path": "chunks/000011",
                "size_bytes": 536870086
            },
            {
                "rel_path": "chunks/000012",
                "size_bytes": 536870358
            },
            {
                "rel_path": "chunks/000013",
                "size_bytes": 536869918
            },
            {
                "rel_path": "chunks/000014",
                "size_bytes": 536869493
            },
            {
                "rel_path": "chunks/000015",
                "size_bytes": 536870520
            },
            {
                "rel_path": "chunks/000016",
                "size_bytes": 536870318
            },
            {
                "rel_path": "chunks/000017",
                "size_bytes": 536870035
            },
            {
                "rel_path": "chunks/000018",
                "size_bytes": 536869542
            },
            {
                "rel_path": "chunks/000019",
                "size_bytes": 536870567
            },
            {
                "rel_path": "chunks/000020",
                "size_bytes": 255732013
            },
            {
                "rel_path": "index",
                "size_bytes": 367206459
            },
            {
                "rel_path": "meta.json"
            }
        ]
    }
}

Block 01G8AQQ7FFAVMBYHD7EEBDYH79 meta.json

{
    "ulid": "01G8AQQ7FFAVMBYHD7EEBDYH79",
    "minTime": 1658116800000,
    "maxTime": 1658124000000,
    "stats": {
        "numSamples": 167502131,
        "numSeries": 174658,
        "numChunks": 1413697
    },
    "compaction": {
        "level": 1,
        "sources": [
            "01G8AQQ7FFAVMBYHD7EEBDYH79"
        ],
        "parents": [
            {
                "ulid": "01G881PJDRY8374W44YRTM52Q1",
                "minTime": 1658116800000,
                "maxTime": 1658124000000
            }
        ]
    },
    "version": 1,
    "thanos": {
        "labels": {
            "id": "prometheus4"
        },
        "downsample": {
            "resolution": 0
        },
        "source": "sidecar",
        "segment_files": [
            "000001"
        ],
        "files": [
            {
                "rel_path": "chunks/000001",
                "size_bytes": 367378075
            },
            {
                "rel_path": "index",
                "size_bytes": 25854440
            },
            {
                "rel_path": "meta.json"
            }
        ]
    }
}

It may also be worth mentioning that it looks like the block each conflict occurs in, 01G8ADQB9EACGF27TJX4426ZZD, was uploaded before the blocks uploaded by the sidecar. So maybe the sidecar is the root of the issue? Could it be because of the --shipper.upload-compacted flag? I had this flag enabled to migrate from historical Prometheus TSDB, but it should be done now. Would this be causing this kind of behaviour?

Block 01G8ADQB9EACGF27TJX4426ZZD, uploaded 2022-07-18 image

Block 01G8AQQ7FFAVMBYHD7EEBDYH79, uploaded 2022-07-19 image

Edit: Tried re-creating the sidecar without --shipper.upload-compacted and manually resolved the conflict in the object store. Left overnight and found that there was no timeout, however compaction was still halted because of overlapped blocks, so I can rule this out.



No comments:

Post a Comment