Model monitor baseline processing job crashes on CSV file

**Describe the bug**
Model monitor baseline processing job crashes on CSV files.

**To reproduce**
From a SageMaker notebook following the SageMaker [tutorial](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html) training XGBoost on mnist, and I adding the [model monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html) tutorial to the notebook.


`Download mnist`: 
```Python
import pickle, gzip, urllib.request, json
import numpy as np

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
```
`create dataset`: 
```Python
import numpy as np
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket='myBucket' # Replace with your s3 bucket name
prefix = 'sagemaker/xgboost-mnist' # Used as part of the path in the bucket where you store data

def convert_data():
    data_partitions = [('train', train_set), ('validation', valid_set), ('test', test_set)]
    for data_partition_name, data_partition in data_partitions:
        print('{}: {} {}'.format(data_partition_name, data_partition[0].shape, data_partition[1].shape))
        labels = [t.tolist() for t in data_partition[1]]
        features = [t.tolist() for t in data_partition[0]]
        
        if data_partition_name != 'test':
            examples = np.insert(features, 0, labels, axis=1)
        else:
            examples = features
        #print(examples[50000,:])
        
        
        np.savetxt('data.csv', examples, delimiter=',', format='%f')
        
        
        
        key = "{}/{}/examples.csv".format(prefix,data_partition_name)
        url = 's3://{}/{}'.format(bucket, key)
        boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file('data.csv')
        print('Done writing to {}'.format(url))
        
convert_data()
```

`create baseline`:
```Python
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri+'examples.csv',            ## filename must have csv extension
    dataset_format=DatasetFormat.csv(header=False),            ## mnist has no header
    output_s3_uri=baseline_results_uri,
    wait=True
)
```

**Expected behavior**
Successful Processing job with float inferred type in Model Monitor.

**Screenshots or logs**

1. If `format='%f'` is not set processing job successful but inferred type is `string` because of the scientific notation. See results in screenshot:
<img width="565" alt="Screenshot 2020-06-26 at 13 26 07" src="https://user-images.githubusercontent.com/60920333/86157431-4b3df580-baff-11ea-888b-7081df168730.png">
2. If `format='%f'` is set the processing job fails:
<img width="1072" alt="Screenshot 2020-06-26 at 13 28 49" src="https://user-images.githubusercontent.com/60920333/86157586-893b1980-baff-11ea-98f2-4e51b506a7dc.png">

Note that dumping the dataset to CSV using pandas.DataFrame.to_csv() produces the same error.

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 1.55.3
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: Model Monitor processing job
- **Framework version**: 468650794304.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-model-monitor-analyzer
- **Python version**:3.6.5
- **CPU or GPU**:CPU
- **Custom Docker image (Y/N)**:N
- Running from SageMaker instance on a `ml.m4.10xlarge`

**Additional context**
I ran through the SageMaker [tutorial](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html) training XGBoost on mnist, and I added the [model monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html) tutorial to the notebook.

`create dataset`: 

1. had to add `.csv` extension to filename
2. had to add `format='%f'` to np.savetxt to avoid writing in scientific notation otherwise inferred type is `string` in model monitor.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model monitor baseline processing job crashes on CSV file #1652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model monitor baseline processing job crashes on CSV file #1652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions