-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Describe the bug
Model monitor baseline processing job crashes on CSV files.
To reproduce
From a SageMaker notebook following the SageMaker tutorial training XGBoost on mnist, and I adding the model monitor tutorial to the notebook.
Download mnist:
import pickle, gzip, urllib.request, json
import numpy as np
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')create dataset:
import numpy as np
import boto3
from sagemaker import get_execution_role
role = get_execution_role()
region = boto3.Session().region_name
bucket='myBucket' # Replace with your s3 bucket name
prefix = 'sagemaker/xgboost-mnist' # Used as part of the path in the bucket where you store data
def convert_data():
data_partitions = [('train', train_set), ('validation', valid_set), ('test', test_set)]
for data_partition_name, data_partition in data_partitions:
print('{}: {} {}'.format(data_partition_name, data_partition[0].shape, data_partition[1].shape))
labels = [t.tolist() for t in data_partition[1]]
features = [t.tolist() for t in data_partition[0]]
if data_partition_name != 'test':
examples = np.insert(features, 0, labels, axis=1)
else:
examples = features
#print(examples[50000,:])
np.savetxt('data.csv', examples, delimiter=',', format='%f')
key = "{}/{}/examples.csv".format(prefix,data_partition_name)
url = 's3://{}/{}'.format(bucket, key)
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file('data.csv')
print('Done writing to {}'.format(url))
convert_data()create baseline:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
my_default_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
my_default_monitor.suggest_baseline(
baseline_dataset=baseline_data_uri+'examples.csv', ## filename must have csv extension
dataset_format=DatasetFormat.csv(header=False), ## mnist has no header
output_s3_uri=baseline_results_uri,
wait=True
)Expected behavior
Successful Processing job with float inferred type in Model Monitor.
Screenshots or logs
- If
format='%f'is not set processing job successful but inferred type isstringbecause of the scientific notation. See results in screenshot:
Note that dumping the dataset to CSV using pandas.DataFrame.to_csv() produces the same error.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.55.3
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Model Monitor processing job
- Framework version: 468650794304.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-model-monitor-analyzer
- Python version:3.6.5
- CPU or GPU:CPU
- Custom Docker image (Y/N):N
- Running from SageMaker instance on a
ml.m4.10xlarge
Additional context
I ran through the SageMaker tutorial training XGBoost on mnist, and I added the model monitor tutorial to the notebook.
create dataset:
- had to add
.csvextension to filename - had to add
format='%f'to np.savetxt to avoid writing in scientific notation otherwise inferred type isstringin model monitor.
Reactions are currently unavailable

