Skip to content

Conversation

@fivetran-surabhisingh
Copy link
Collaborator

Jira ticket

Closes <ADD TICKET LINK HERE, EACH PR MUST BE LINKED TO A JIRA TICKET>

Description of Change

<MENTION A SHORT DESCRIPTION OF YOUR CHANGES HERE>

Testing

<MENTION ABOUT YOUR TESTING DETAILS HERE, ATTACH SCREENSHOTS IF NEEDED (WITHOUT PII)>

Checklist

Some tips and links to help validate your PR:

  • Tested the connector with fivetran debug command.
  • Added/Updated example specific README.md file, refer here for template.
  • Followed Python Coding Standards, refer here

@fivetran-surabhisingh fivetran-surabhisingh self-assigned this Oct 31, 2025
@fivetran-surabhisingh fivetran-surabhisingh requested a review from a team as a code owner October 31, 2025 18:21
@fivetran-surabhisingh fivetran-surabhisingh added the hackathon For all the PRs related to the internal Fivetran 2025 Connector SDK Hackathon. label Oct 31, 2025
@github-actions github-actions bot added the size/M PR size: medium label Oct 31, 2025
@github-actions
Copy link

🧹 Python Code Quality Check

⚠️ Flake8 has detected issues, please fix the issues before merging:

📎 Download full report from workflow artifacts.

📌 Only Python files changed in this PR were checked.

🔍 See how this check works

This comment is auto-updated with every commit.

@fivetran-sahilkhirwal
Copy link
Contributor

We already have a ready to use example present in the examples repo here:
https://github.com/fivetran/fivetran_connector_sdk/tree/main/connectors/redshift/large_data_volume

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR attempts to add a Redshift multithreading connector example, but the implementation uses an incompatible API pattern and is missing critical required files and functionality.

Key issues:

  • Uses a decorator-based API (@connector, config.Config(), schema.Schema()) that does not exist in the Fivetran Connector SDK
  • Missing all required files: configuration.json, README.md, and requirements.txt
  • Violates SDK multithreading guidelines by calling SDK operations from worker threads

Comment on lines +6 to +57
import threading
import psycopg2
from fivetran_connector_sdk import connector, config, state, records, log, schema

CONFIG = config.Config(
host=config.StringField(),
port=config.IntegerField(default=5439),
database=config.StringField(),
user=config.StringField(),
password=config.SecretField(),
threads=config.IntegerField(default=4)
)

SCHEMA = schema.Schema(
name="redshift_table",
columns={
"id": schema.StringColumn(),
"data": schema.JSONColumn(),
}
)

@connector(
name="RedshiftMultithreadingConnector",
version="0.1.0",
config=CONFIG,
schema=SCHEMA,
)
def run_connector(ctx: state.Context):
def worker(offset):
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()

threads = []
for i in range(ctx.config.threads):
t = threading.Thread(target=worker, args=(i * 100,))
t.start()
threads.append(t)

for t in threads:
t.join()

return ctx.update_state({"last_sync": "now"})
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connector uses an incorrect API pattern for the Fivetran Connector SDK. The SDK does not support decorator-based connectors with @connector, config.Config(), or schema.Schema() patterns.

The correct pattern requires:

  1. Import: from fivetran_connector_sdk import Connector, Operations as op, Logging as log
  2. Define update(configuration: dict, state: dict) function
  3. Define schema(configuration: dict) function
  4. Initialize: connector = Connector(update=update, schema=schema)

Please refer to the template at template_example_connector/connector.py or existing examples like connectors/redshift/simple_redshift_connector/connector.py for the correct structure.

Suggested change
import threading
import psycopg2
from fivetran_connector_sdk import connector, config, state, records, log, schema
CONFIG = config.Config(
host=config.StringField(),
port=config.IntegerField(default=5439),
database=config.StringField(),
user=config.StringField(),
password=config.SecretField(),
threads=config.IntegerField(default=4)
)
SCHEMA = schema.Schema(
name="redshift_table",
columns={
"id": schema.StringColumn(),
"data": schema.JSONColumn(),
}
)
@connector(
name="RedshiftMultithreadingConnector",
version="0.1.0",
config=CONFIG,
schema=SCHEMA,
)
def run_connector(ctx: state.Context):
def worker(offset):
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()
threads = []
for i in range(ctx.config.threads):
t = threading.Thread(target=worker, args=(i * 100,))
t.start()
threads.append(t)
for t in threads:
t.join()
return ctx.update_state({"last_sync": "now"})
import threading # For multithreaded data extraction
import psycopg2 # For connecting to Amazon Redshift
import json # For reading configuration from JSON file
from fivetran_connector_sdk import Connector # For connector initialization
from fivetran_connector_sdk import Logging as log # For logging
from fivetran_connector_sdk import Operations as op # For upsert and checkpoint operations
# Constants for configuration keys and batch size
__BATCH_SIZE = 100 # Number of records per thread
__TABLE_NAME = "redshift_table"
__MAX_THREADS = 16 # Maximum allowed threads
def validate_configuration(configuration: dict):
"""
Validate the configuration dictionary to ensure it contains all required parameters.
This function is called at the start of the update method to ensure that the connector has all necessary configuration values.
Args:
configuration: a dictionary that holds the configuration settings for the connector.
Raises:
ValueError: if any required configuration parameter is missing or invalid.
"""
required_configs = ["host", "port", "database", "user", "password", "threads"]
for key in required_configs:
if key not in configuration:
raise ValueError(f"Missing required configuration value: {key}")
if not isinstance(configuration["port"], int) or not (0 < configuration["port"] < 65536):
raise ValueError("Port must be a valid integer between 1 and 65535.")
if not isinstance(configuration["threads"], int) or not (1 <= configuration["threads"] <= __MAX_THREADS):
raise ValueError(f"Threads must be an integer between 1 and {__MAX_THREADS}.")
# Additional validation can be added here as needed
def schema(configuration: dict):
"""
Define the schema function which lets you configure the schema your connector delivers.
See the technical reference documentation for more details on the schema function:
https://fivetran.com/docs/connectors/connector-sdk/technical-reference#schema
Args:
configuration: a dictionary that holds the configuration settings for the connector.
"""
return [
{
"table": __TABLE_NAME,
"primary_key": ["id"],
"columns": {
"id": "STRING",
"data": "JSON"
}
}
]
def update(configuration: dict, state: dict):
"""
Define the update function which lets you configure how your connector fetches data.
See the technical reference documentation for more details on the update function:
https://fivetran.com/docs/connectors/connector-sdk/technical-reference#update
Args:
configuration: a dictionary that holds the configuration settings for the connector.
state: a dictionary that holds the state of the connector.
"""
log.warning("Example: DATABASE : Redshift Multithreading")
validate_configuration(configuration)
# Use threading to fetch data in parallel from Redshift
threads = []
thread_errors = []
results_lock = threading.Lock()
def worker(offset):
"""
Worker function to fetch a batch of records from Redshift.
Args:
offset: The offset for the SQL query LIMIT/OFFSET.
"""
try:
conn = psycopg2.connect(
host=configuration["host"],
dbname=configuration["database"],
user=configuration["user"],
password=configuration["password"],
port=configuration["port"],
)
cur = conn.cursor()
# Fetch a batch of records using LIMIT and OFFSET
cur.execute(f"SELECT * FROM some_table LIMIT {__BATCH_SIZE} OFFSET {offset}")
rows = cur.fetchall()
for row in rows:
# The 'upsert' operation is used to insert or update data in the destination table.
# The first argument is the name of the destination table.
# The second argument is a dictionary containing the record to be upserted.
op.upsert(
table=__TABLE_NAME,
data={"id": str(row[0]), "data": row}
)
conn.close()
except Exception as e:
with results_lock:
thread_errors.append(str(e))
log.severe(f"Thread failed with error: {e}")
num_threads = configuration["threads"]
for i in range(num_threads):
t = threading.Thread(target=worker, args=(i * __BATCH_SIZE,))
t.start()
threads.append(t)
for t in threads:
t.join()
if thread_errors:
raise RuntimeError(f"One or more threads failed: {thread_errors}")
# Save the progress by checkpointing the state. This is important for ensuring that the sync process can resume
# from the correct position in case of next sync or interruptions.
# Learn more about how and where to checkpoint by reading our best practices documentation
# (https://fivetran.com/docs/connectors/connector-sdk/best-practices#largedatasetrecommendation).
state["last_sync"] = "now"
op.checkpoint(state)
# Create the connector object using the schema and update functions
connector = Connector(update=update, schema=schema)
# Check if the script is being run as the main module.
# This is Python's standard entry method allowing your script to be run directly from the command line or IDE 'run' button.
# This is useful for debugging while you write your code. Note this method is not called by Fivetran when executing your connector in production.
# Please test using the Fivetran debug command prior to finalizing and deploying your connector.
if __name__ == "__main__":
# Open the configuration.json file and load its contents
with open("configuration.json", "r") as f:
configuration = json.load(f)
# Test the connector locally
connector.debug()

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +46
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling and retry logic. Database queries can fail due to transient network issues, connection timeouts, or temporary database unavailability. There is no try-except block or retry mechanism with exponential backoff.

Example with retry logic:

import time

__MAX_RETRIES = 5

for attempt in range(__MAX_RETRIES):
    try:
        conn = psycopg2.connect(...)
        cursor = conn.cursor()
        cursor.execute(query)
        break
    except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:
        if attempt == __MAX_RETRIES - 1:
            raise
        sleep_time = min(60, 2 ** attempt)
        log.warning(f"Retry {attempt + 1}/{__MAX_RETRIES} after {sleep_time}s: {e}")
        time.sleep(sleep_time)

Copilot generated this review using guidance from repository custom instructions.
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: The offset parameter is directly interpolated into the SQL query using an f-string without sanitization. While offset comes from controlled code in this example, this pattern is unsafe and violates security best practices.

Use parameterized queries instead:

cur.execute("SELECT * FROM some_table LIMIT %s OFFSET %s", (100, offset))
Suggested change
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
cur.execute("SELECT * FROM some_table LIMIT %s OFFSET %s", (100, offset))

Copilot uses AI. Check for mistakes.
for t in threads:
t.join()

return ctx.update_state({"last_sync": "now"})
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing checkpointing logic. State is only updated once at the end via ctx.update_state({"last_sync": "now"}), which means:

  1. If the connector fails mid-sync, all progress is lost
  2. The state value "now" is not meaningful for incremental syncs
  3. No progress tracking during the sync

The SDK requires checkpointing at appropriate intervals (e.g., after processing each batch) using op.checkpoint(state) with meaningful state values like last processed record ID or timestamp.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +34 to +46
def worker(offset):
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing logging. The SDK requires using from fivetran_connector_sdk import Logging as log and logging important events such as:

  • Connection establishment: log.info(f"Connected to Redshift at {host}:{port}")
  • Progress updates: log.info(f"Processing batch {batch_num}")
  • Retry attempts: log.warning(f"Retry {attempt}/{max_retries} due to: {error}")
  • Errors: log.severe(f"Failed to fetch data: {error}")

Never use Python's built-in logging module or print() statements.

Copilot generated this review using guidance from repository custom instructions.
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory safety violation: fetchall() loads the entire result set into memory, which can cause memory overflow with large datasets.

Use server-side cursors with fetchmany(batch_size) instead:

__BATCH_SIZE = 1000
cursor = connection.cursor(name='server_side_cursor')
cursor.execute(query)
while True:
    rows = cursor.fetchmany(__BATCH_SIZE)
    if not rows:
        break
    for row in rows:
        op.upsert(table, row_to_dict(row))
    op.checkpoint(state)

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +1 to +57
"""
Redshift Multithreading connector for Fivetran Connector SDK.
Demonstrates threaded extraction from Amazon Redshift.
"""

import threading
import psycopg2
from fivetran_connector_sdk import connector, config, state, records, log, schema

CONFIG = config.Config(
host=config.StringField(),
port=config.IntegerField(default=5439),
database=config.StringField(),
user=config.StringField(),
password=config.SecretField(),
threads=config.IntegerField(default=4)
)

SCHEMA = schema.Schema(
name="redshift_table",
columns={
"id": schema.StringColumn(),
"data": schema.JSONColumn(),
}
)

@connector(
name="RedshiftMultithreadingConnector",
version="0.1.0",
config=CONFIG,
schema=SCHEMA,
)
def run_connector(ctx: state.Context):
def worker(offset):
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()

threads = []
for i in range(ctx.config.threads):
t = threading.Thread(target=worker, args=(i * 100,))
t.start()
threads.append(t)

for t in threads:
t.join()

return ctx.update_state({"last_sync": "now"})
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the required validate_configuration() function. This function must validate configuration values for correctness (e.g., port ranges, valid formats, required fields) and raise ValueError with descriptive error messages for invalid configurations.

Example:

def validate_configuration(configuration: dict):
    """
    Validate the configuration dictionary to ensure it contains all required parameters.
    Args:
        configuration: a dictionary that holds the configuration settings for the connector.
    Raises:
        ValueError: if any required configuration parameter is missing or invalid.
    """
    required_configs = ["host", "database", "user", "password"]
    for key in required_configs:
        if key not in configuration:
            raise ValueError(f"Missing required configuration value: {key}")
    
    port = configuration.get("port", 5439)
    if not isinstance(port, int) or port < 1 or port > 65535:
        raise ValueError(f"Invalid port number: {port}. Must be between 1 and 65535.")

This function should be called at the start of the update() function.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +35 to +46
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource leak: Database connections are not properly closed in case of exceptions. If an error occurs during query execution or data processing, the connection remains open.

Use context managers or try-finally blocks:

conn = None
try:
    conn = psycopg2.connect(...)
    cur = conn.cursor()
    # ... process data
finally:
    if conn:
        conn.close()

Or better, use psycopg2's context manager:

with psycopg2.connect(...) as conn:
    with conn.cursor() as cur:
        # ... process data
Suggested change
conn = psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})
conn.close()
# Use context managers to ensure the connection and cursor are always closed, even if an exception occurs.
with psycopg2.connect(
host=ctx.config.host,
dbname=ctx.config.database,
user=ctx.config.user,
password=ctx.config.password,
port=ctx.config.port,
) as conn:
with conn.cursor() as cur:
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
for row in cur.fetchall():
records.write("redshift_table", {"id": row[0], "data": row})

Copilot uses AI. Check for mistakes.
port=ctx.config.port,
)
cur = conn.cursor()
cur.execute(f"SELECT * FROM some_table LIMIT 100 OFFSET {offset}")
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pagination logic with fixed LIMIT/OFFSET is inefficient and can lead to data inconsistency issues:

  1. Performance: OFFSET becomes slower as the offset increases because the database must scan and skip all previous rows
  2. Data consistency: If rows are inserted/updated between thread executions, some data may be missed or duplicated
  3. Not incremental: This approach always reads all data, not just new/updated records

For proper incremental syncs, use:

  • A replication key (e.g., updated_at timestamp or auto-incrementing ID)
  • WHERE clause filtering: WHERE updated_at > last_synced_timestamp ORDER BY updated_at
  • Track progress in state: state["last_updated_at"]

See connectors/redshift/simple_redshift_connector/connector.py for a proper incremental sync pattern.

Copilot uses AI. Check for mistakes.

import threading
import psycopg2
from fivetran_connector_sdk import connector, config, state, records, log, schema
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'log' is not used.

Suggested change
from fivetran_connector_sdk import connector, config, state, records, log, schema
from fivetran_connector_sdk import connector, config, state, records, schema

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@fivetran-rishabhghosh fivetran-rishabhghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address copilot comments and add description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hackathon For all the PRs related to the internal Fivetran 2025 Connector SDK Hackathon. size/M PR size: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants