Skip to content

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

@pelletencate

Description

@pelletencate

I'm running ci-queue in a bit of an alternative setting. I've got a Ruby on Rails app which has a long test suite, and I'm running 8 concurrent instances of RSpec on Heroku CI, but on one and the same dyno, which also runs Redis in-dyno.

I start the whole chain with the following script

export CI_QUEUE_URL=$([ -z "$REDIS_URL" ] && echo "redis://127.0.0.1:6379" || echo "$REDIS_URL")

for num in $(seq 1 $PARALLEL_COUNT); do
  RUBYOPT="-W0" \
  CI_NODE_INDEX=$(expr $num - 1) \
  DATABASE_URL=$([ "$num" -ne "1" ] && echo $DATABASE_URL$num || echo $DATABASE_URL) \
  rspec-queue \
    --timeout 180 \
    --max-consecutive-failures 10 \
    --max-requeues 5 \
    --requeue-tolerance 5 $@ &
done

wait
rspec-queue --report

Every now and then when this runs on Heroku CI, 7 out of the 8 workers end at the same time, but it seems the 8th one keeps on running for hours until Heroku kills my build after 2 hours that usually passes in about 10-15 minutes. It's as if it doesn't understand that the queue is done.

Randomized with seed 57482

.

Finished in 9 minutes 51 seconds (files took 10.22 seconds to load)

85 examples, 0 failures, 1 pending

Randomized with seed 15517

< THIS IS WHERE I EXPECT THE REPORT >

.......................................................................................................................-----> test command `bin/test` failed with signal: terminated

While there are 8 instances running, the word Finished only shows up 7 times in the log, so I assume what I see here is the finishing of workers 6 and 7, and the following dots being part of worker 8.

I have no idea how to debug this further, but I'd love to add more details if someone can help me in the right direction.


A little more research

I've compared the output of a successful build and a failed build.

  1. A successful build had 8 workers finish with a total of 1216 examples, an average number of 152.
  2. A failed build had 7 workers finish with a total of 1194 examples reported.
    • This means that at the time they finished, the 8th worker only processed 22 examples, which is remarkebly few compared to the average of 170 examples processed by the other 7. (This could potentially corroborate your theory). However:
    • After that, the 8th worker drew 130 dots, which as I can only imagine refer to specs that already passed in different workers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions