Skip to content

[Bug][Connector-V2][Hbase] HBase source only scans the first split/region, read count << HBase count #10286

@yzeng1618

Description

@yzeng1618

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When running an HBase -> HBase pipeline with the SeaTunnel connector-v2 HBase source, the job finishes successfully but the Total Read Count is much smaller than the real row count in HBase.

  • HBase shell count 'assign_cf_table', {COLUMNS=>'cf1', CACHE=>10000} returns 10,000,000 rows
  • SeaTunnel job summary shows Total Read Count / Total Write Count around 2460778
  • Logs show multiple splits were assigned, e.g. Assigning 4 splits to subtask: 0, but the source finishes quickly.
Image Image

SeaTunnel Version

2.3.12

SeaTunnel Config

env {
  parallelism = 1
  job.mode = "BATCH"
}

source {
  Hbase {
    zookeeper_quorum = "<zk1:2181,zk2:2181,zk3:2181>"
    table = "assign_cf_table"
    caching = 100000
    batch = 100
    cache_blocks = false

    # kerberos configs (masked)
    hbase.client.kerberos.principal = "<principal>"
    hbase.client.keytab.file = "<keytab>"
    krb5_path = "<krb5.conf>"

    hbase_extra_config = {
      "hbase.security.authentication" = "kerberos"
      "hadoop.security.authentication" = "kerberos"
      "hbase.master.kerberos.principal" = "hbase/_HOST@REALM"
      "hbase.regionserver.kerberos.principal" = "hbase/_HOST@REALM"
      "hbase.rpc.protection" = "authentication"
      "hbase.zookeeper.useSasl" = "false"
    }

    schema = {
      columns = [
        { name = "rowkey" type = string },
        { name = "cf1:id" type = string },
        { name = "cf1:name" type = string }
      ]
    }
  }
}

sink {
  Hbase {
    zookeeper_quorum = "<zk1:2181,zk2:2181,zk3:2181>"
    table = "assign_cf_table3"
    rowkey_column = ["rowkey"]
    family_name { all_columns = "cf1" }

    hbase_extra_config = {
      "hbase.security.authentication" = "kerberos"
      "hadoop.security.authentication" = "kerberos"
      "hbase.master.kerberos.principal" = "hbase/_HOST@REALM"
      "hbase.regionserver.kerberos.principal" = "hbase/_HOST@REALM"
      "hbase.rpc.protection" = "authentication"
      "hbase.zookeeper.useSasl" = "false"
    }
  }
}

Running Command

./bin/seatunnel.sh -e flink -c /path/to/hbase2hbase.conf

Error Exception

N/A (job finishes successfully; only read/write count mismatch)

Zeta or Flink or Spark Version

flink 1.20.1

Java or Scala Version

java8

Screenshots

Image Image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions