-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Reference: SMAlgo-314
Please fill out the form below.
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Factorization Machine
- Framework Version: 1.33.0
- Python Version: 3.6; conda_python3 kernel
- CPU or GPU: CPU
- Python SDK Version: 1.33.0
- Are you using a custom image: 'factorization-machines' from Sagemaker Jupyter Notebook.
Describe the problem
We are aiming to produce recommendations using sagemaker with factorization machines. We feed the model with a sparse matrix of 45000 rows and 15000 columns. Training completes successfully. The batch transformation stage crashes during the wait(), the exception redirects to read the logs. The message is : “Unable to get response from algorithm.”
Minimal repro / logs
Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
EXCEPTION OUTPUT:
ValueError Traceback (most recent call last)
in ()
13 print(datetime.datetime.now().time())
14
---> 15 fmTr.wait()
16 print(datetime.datetime.now().time())
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
205 def wait(self):
206 self._ensure_last_transform_job()
--> 207 self.latest_transform_job.wait()
208
209 def _ensure_last_transform_job(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
304
305 def wait(self):
--> 306 self.sagemaker_session.wait_for_transform_job(self.job_name)
307
308 @staticmethod
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll)
1004 """
1005 desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll)
-> 1006 self._check_job_status(job, desc, "TransformJobStatus")
1007 return desc
1008
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
1026 reason = desc.get("FailureReason", "(No reason provided)")
1027 job_type = status_key_name.replace("JobStatus", " job")
-> 1028 raise ValueError("Error for {} {}: {} Reason: {}".format(job_type, job, status, reason))
1029
1030 def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error for Transform job factorization-machines-2019-08-01-09-40-45-581: Failed Reason: InternalServerError: We encountered an internal error. Please try again.
LOG MESSAGE :
2019-08-01T09:44:02.787:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD
2019-08-01T09:45:48.275:[sagemaker logs]: (...bucket and key...)/BATCH_jobName.csv000.json: Unable to get response from algorithm
- Exact command to reproduce:
fmTr = fm.transformer( instance_count=1,
instance_type='ml.c4.xlarge', # 'ml.m4.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path= 's3://'+bucket+'/'+outputPath)
fmTr.transform(batch_input_s3, content_type='application/json', split_type='Line')
fmTr.wait()