- Creating from scratch via Java methods
- Creating from a serialized JSON representation
- Creating from sample data (inferring)
- Schema validation
- Writing a Schema to a File
You can build a Schema instance from scratch or modify an existing one:
Schema schema = new Schema();
Field nameField = new StringField("name");
schema.addField(nameField);
Field coordinatesField = new GeopointField("coordinates");
schema.addField(coordinatesField);
System.out.println(schema.asJson());
// {"fields":[{"name":"name","format":"default","description":"","type":"string","title":""},{"name":"coordinates","format":"default","description":"","type":"geopoint","title":""}]}You can also build a Schema instance with JSONObject instances instead of Field instances:
Schema schema = new Schema(); // By default strict=false validation
JSONObject nameFieldJsonObject = new JSONObject();
nameFieldJsonObject.put("name", "name");
nameFieldJsonObject.put("type", Field.FIELD_TYPE_STRING);
schema.addField(nameFieldJsonObject);
// Because strict=false, an invalid Field definition will be included.
// The error will be logged/tracked in the error list schema.getErrors().
JSONObject invalidFieldJsonObject = new JSONObject();
invalidFieldJsonObject.put("name", "id");
invalidFieldJsonObject.put("type", Field.FIELD_TYPE_INTEGER);
invalidFieldJsonObject.put("format", "invalid");
schema.addField(invalidFieldJsonObject);
JSONObject coordinatesFieldJsonObject = new JSONObject();
coordinatesFieldJsonObject.put("name", "coordinates");
coordinatesFieldJsonObject.put("type", Field.FIELD_TYPE_GEOPOINT);
coordinatesFieldJsonObject.put("format", Field.FIELD_FORMAT_ARRAY);
schema.addField(coordinatesFieldJsonObject);
System.out.println(schema.asJson());
/*
{"fields":[
{"name":"name","format":"default","type":"string"},
{"name":"id","format":"invalid","type":"integer"},
{"name":"coordinates","format":"array","type":"geopoint"}
]}
*/When using the addField method, the schema undergoes validation after every field addition.
If adding a field causes the schema to fail validation, then the field is automatically removed.
Alternatively, you might want to build your Schema by loading the schema definition from a JSON file:
String schemaFilePath = "/path/to/schema/file/shema.json";
Schema schema = new Schema(schemaFilePath, true); // enforce validation with strict=true.If you don't have a schema for a CSV and don't want to manually define one then you can auto-generate it:
String csvData = "id,name,age\n1,John,30\n2,Jane,25\n3,Bob,35";
Schema schema = Schema.infer(csvData, StandardCharsets.UTF_8);The type inferral algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned. The inferral algorithm traverses all of the table's rows and attempts to cast every single value of the table.
You can infer a schema from
- a CSV file
- a URL pointing to a CSV file
- a CSV containing String
- a JSON array node
- a String array containing multiple CSV data sets
- a File List
- a URL List
- mixed data types (CSV, JSON, etc.)
If you have more than one CSV file, you can use Schema.infer() to check that all files have the same schema:
File testFile = getResourceFile("/testsuite-data/files/csv/1mb.csv");
File testFile2 = getResourceFile("/testsuite-data/files/csv/10mb.csv");
List<File> fileList = Arrays.asList(testFile, testFile2);
Schema schema = Schema.infer(fileList, StandardCharsets.UTF_8);If the CSV files have different headers, the Schema.infer() call will throw an Exception because there
is no common schema that can be inferred from the files.
In case you want to infer a schema from a file and then use the data, it can be helpful to not use the static Schema.infer()
method, but first create a Table instance and then infer the schema from it.
URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/tableschema-java/master" +
"/src/test/resources/fixtures/data/simple_data.csv");
Table table = Table.fromSource(url);
Schema schema = table.inferSchema();
System.out.println(schema.asJson());
// {"fields":[{"name":"id","format":"","description":"","title":"","type":"integer","constraints":{}},{"name":"title","format":"","description":"","title":"","type":"string","constraints":{}}]}When dealing with large tables, you might want to limit the number of rows that the inferral algorithm processes:
// Only process the first 25 rows for type inferral.
Schema schema = table.inferSchema(25);If List<Object[]> data and String[] headers are available, the schema can also be inferred from the a Schema object:
JSONObject inferredSchema = schema.infer(data, headers);Row limit can also be set:
JSONObject inferredSchema = schema.infer(data, headers, 25);Using an instance of Table or Scheme to infer a schema invokes the same method from the TypeInferred Singleton:
TypeInferrer.getInstance().infer(data, headers, 25);To make sure a schema complies with Table Schema specifications, we can validate each custom schema against the official Table Schema schema:
JSONObject schemaJsonObj = new JSONObject();
Field nameField = new IntegerField("id");
schemaJsonObj.put("fields", new JSONArray());
schemaJsonObj.asJsonArray("fields").put(nameField.asJson());
Schema schema = Schema.fromJson(schemaJsonObj.toString(), true);
System.out.println(schema.isValid());
// trueYou can write a Schema into a JSON file:
Schema schema = new Schema();
Field nameField = new StringField("name");
schema.addField(nameField);
Field coordinatesField = new GeopointField("coordinates");
schema.addField(coordinatesField);
schema.writeJson(new File("schema.json"));