Skip to content

Model monitor baseline processing job crashes on CSV file #1652

@G-ecs

Description

@G-ecs

Describe the bug
Model monitor baseline processing job crashes on CSV files.

To reproduce
From a SageMaker notebook following the SageMaker tutorial training XGBoost on mnist, and I adding the model monitor tutorial to the notebook.

Download mnist:

import pickle, gzip, urllib.request, json
import numpy as np

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

create dataset:

import numpy as np
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket='myBucket' # Replace with your s3 bucket name
prefix = 'sagemaker/xgboost-mnist' # Used as part of the path in the bucket where you store data

def convert_data():
    data_partitions = [('train', train_set), ('validation', valid_set), ('test', test_set)]
    for data_partition_name, data_partition in data_partitions:
        print('{}: {} {}'.format(data_partition_name, data_partition[0].shape, data_partition[1].shape))
        labels = [t.tolist() for t in data_partition[1]]
        features = [t.tolist() for t in data_partition[0]]
        
        if data_partition_name != 'test':
            examples = np.insert(features, 0, labels, axis=1)
        else:
            examples = features
        #print(examples[50000,:])
        
        
        np.savetxt('data.csv', examples, delimiter=',', format='%f')
        
        
        
        key = "{}/{}/examples.csv".format(prefix,data_partition_name)
        url = 's3://{}/{}'.format(bucket, key)
        boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file('data.csv')
        print('Done writing to {}'.format(url))
        
convert_data()

create baseline:

from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri+'examples.csv',            ## filename must have csv extension
    dataset_format=DatasetFormat.csv(header=False),            ## mnist has no header
    output_s3_uri=baseline_results_uri,
    wait=True
)

Expected behavior
Successful Processing job with float inferred type in Model Monitor.

Screenshots or logs

  1. If format='%f' is not set processing job successful but inferred type is string because of the scientific notation. See results in screenshot:

Screenshot 2020-06-26 at 13 26 07

2. If `format='%f'` is set the processing job fails:

Screenshot 2020-06-26 at 13 28 49

Note that dumping the dataset to CSV using pandas.DataFrame.to_csv() produces the same error.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.55.3
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Model Monitor processing job
  • Framework version: 468650794304.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-model-monitor-analyzer
  • Python version:3.6.5
  • CPU or GPU:CPU
  • Custom Docker image (Y/N):N
  • Running from SageMaker instance on a ml.m4.10xlarge

Additional context
I ran through the SageMaker tutorial training XGBoost on mnist, and I added the model monitor tutorial to the notebook.

create dataset:

  1. had to add .csv extension to filename
  2. had to add format='%f' to np.savetxt to avoid writing in scientific notation otherwise inferred type is string in model monitor.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions