Skip to content

Connection Cleanup #56

@type1fool

Description

@type1fool

Problem

I have been using PHP ETL and the Parallel extension's functional API to ingest large CSV datasets. It has been reliable and fast, but there can be several hundred database connections left open until the ingest is complete. I poked around the source code for this ETL package, but there are no methods or documentation for removing connections.

For context, this is how the ingest script works:

  • Stream contents of the CSV file into a temporary $batch array
  • Once 5K rows have been loaded into the batch, push a task into $tasks array
  • At end of file, load remaining batch into a task
  • Each task runs ETL in a parallel process using a closure
    • Initialize new ETL instance
    • addConnection() for each target database
    • extract() & transform() the batch data
    • load() the batch into the DB
    • run() ETL
    • unset($etl)
    • return (also tried exit)
  • Await completion of all tasks
  • Move to next dataset...

Though I can't share any actual code, those closure bullet points are essentially what happens.

Inside the ETL closure, I have tried exit and return after $etl->run() completes, and I have tried unsetting the ETL instance in the closure. Still, the processes and DB connections remain open.

Documentation for the Parallel extension could be more robust.

Request

The Manager class would benefit from a removeConnection or destroyConnection method, where the conn would be removed from $connections. Would that terminate the PDO connection?

I'm happy to open a PR if this would work. I would also take advice on using persistent connections with ETL and Parallel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions