Page title
Section title
Generating Schemas from Examples with jtd-infer
JSON Type Definition, aka RFC 8927, is an easy-to-learn, standardized way to define a schema for JSON data. You can use JSON Typedef to portably validate data across programming languages, create dummy data, generate code, and more.
jtd-infer
is a tool that can generate a JSON Typedef schema from example data.
It lives on GitHub here.
This article will go over why jtd-infer
may be useful to you, how to install
it, and then go through an example of using jtd-infer
on a few real-world
datasets.
Why inferring schemas is useful
Inferring a schema from example data is useful because it makes it easier to
onboard onto JSON Typedef. If you have an existing system that works with JSON,
you can pass some of these JSON inputs/outputs into jtd-infer
, and get a
schema out. From there, you can start validating data against that
schema or generate data structures.
That means you can start seeing the value of JSON Typedef in your existing
system in just a few minutes.
jtd-infer
can also be useful if you prefer to do a “code-first” approach,
where you first write your JSON data, and once you have an example of the data
in hand you start to write a schema. Some people find this easier to think about
than diving straight into writing a JSON Typedef schema.
Installing jtd-infer
If you’re on macOS, the easiest way to install jtd-infer
is via Homebrew:
brew install jsontypedef/jsontypedef/jtd-infer
For all other platforms, you can install a prebuilt binary from the latest
release of
jtd-infer
.
Supported platforms are:
- Windows (x86_64-pc-windows-gnu.zip)
- macOS (x86_64-apple-darwin.zip)
- Linux (x86_64-unknown-linux-gnu.zip)
Finally, you can also install jtd-infer
from source. See the jtd-infer
repo for more information if
you go this route.
Using jtd-infer
You can always run jtd-infer --help
to get details, but at a high level here’s
how you usually use jtd-infer
:
-
First, you need example data to infer from.
jtd-infer
will work on any sequence of JSON messages, so you can use any JSON file that contains one or more JSON messages in sequence.If your JSON messages are in multiple files, then a useful trick to know is that
cat dir/*.json
will output every JSON file indir
concatenated together. That will produce a sequence of JSON messages, and sojtd-infer
can work with that. -
Then, run
jtd-infer
on that data. -
If the output in (2) could be improved by using an “enum”, a “values”, or a “discriminator” schema, then give
jtd-infer
some additional hints using--enum-hint
,--values-hint
, or--discriminator-hint
.
For example, let’s start with a very small example. Let’s say you have this in a
file called in.json
:
{ "name": "Abraham Lincoln", "isAdmin": true }
If you run jtd-infer in.json
, you will get this output:
{
"properties": {
"name": { "type": "string" },
"isAdmin": { "type": "boolean" }
}
}
If in.json
had some properties that aren’t always present, it will guess that
the property is optional. So if in.json
instead contained:
{ "name": "Abraham Lincoln", "isAdmin": true }
{ "name": "William Sherman", "isAdmin": false, "middleName": "Tecumseh" }
Then jtd-infer in.json
will output (note: the result here is formatted for
clarity):
{
"properties": {
"isAdmin": {
"type": "boolean"
},
"name": {
"type": "string"
}
},
"optionalProperties": {
"middleName": {
"type": "string"
}
}
}
Changing the default number type
By default, jtd-infer
will guess the narrowest, most specific number type for
your data. For example, if you show jtd-infer
numbers like 1
, 2
, and 28
,
then it will guess you’re working with unsigned, 8-bit numbers:
echo "[1, 2, 28]" | jtd-infer
{ "elements": { "type": "uint8" } }
This may be undesirable behavior for you. There are two ways you can address this:
-
Make your example data more representative of the range of numerical values you support. If you support negative or fractional numbers as input, try to make sure those appear in your examples. If there’s a maximum or minimum value input numbers can take on, try to include those too.
-
Tell
jtd-infer
to guess a particular number type by default using--default-number-type
.
All --default-number-type
does is tell jtd-infer
what number type to guess
when it sees a number. If your suggested default number type doesn’t fit with
the data, jtd-infer
will fall back to guessing a number type on its own.
For instance, here’s how --default-number-type
affects the previous example:
echo "[1, 2, 28]" | jtd-infer --default-number-type=int32
{ "elements": { "type": "int32" } }
If you’re using jtd-infer
to retroactively schema-ify the inputs to a
JavaScript-based application, it could make sense to tell jtd-infer
to assume
everything is a float64
, because that’s the only numerical type JavaScript
supports. You would do that by passing --default-number-type=float64
.
Giving jtd-infer hints
By default, jtd-infer
will never output
“enum”,
“values”, or
“discriminator” schemas. This is
done on purpose, to make jtd-infer
's behavior as predictable as possible;
rather than trying to guess if something is an enum versus just some generic
string, jtd-infer
assumes everything is a string unless you tell it it’s an
enum.
To get jtd-infer
to output enum, values, or discriminator schemas, you can use
--enum-hint
, --values-hint
, or --discriminator-hint
. These next sections
will show you how to do that with some examples.
Using enum hints
Let’s use a real-world dataset. The Nobel Prize organization maintains an API of every Nobel prize. You can access that data by running:
curl http://api.nobelprize.org/v1/prize.json
You can run jtd-infer
on this data like so:
curl http://api.nobelprize.org/v1/prize.json | jtd-infer
Which outputs this schema:
{
"properties": {
"prizes": {
"elements": {
"properties": {
"year": {
"type": "string"
},
"category": {
"type": "string"
}
},
"optionalProperties": {
"laureates": {
"elements": {
"properties": {
"motivation": {
"type": "string"
},
"id": {
"type": "string"
},
"firstname": {
"type": "string"
},
"share": {
"type": "string"
}
},
"optionalProperties": {
"surname": {
"type": "string"
}
}
}
},
"overallMotivation": {
"type": "string"
}
}
}
}
}
}
What’s of interest to us is the category
property in that schema. As you may
already know, there are six categories of Nobel Prize; as a result, the
category
property only ever takes on one of six values. This is a perfect
use-case for an enum. We can tell jtd-infer
that the category
property is an
enum, and it’ll do the rest for us:
curl http://api.nobelprize.org/v1/prize.json | jtd-infer --enum-hint=/prizes/-/category
The value we pass to --enum-hint
is a path inside the example data. The “-” in
the path is a wildcard; it means “any property of the object or array”. When we
run that command, we get this result:
{
"properties": {
"prizes": {
"elements": {
"properties": {
"year": {
"type": "string"
},
"category": {
"enum": [
"chemistry",
"economics",
"peace",
"physics",
"medicine",
"literature"
]
}
},
"optionalProperties": {
"overallMotivation": {
"type": "string"
},
"laureates": {
"elements": {
"properties": {
"firstname": {
"type": "string"
},
"id": {
"type": "string"
},
"motivation": {
"type": "string"
},
"share": {
"type": "string"
}
},
"optionalProperties": {
"surname": {
"type": "string"
}
}
}
}
}
}
}
}
}
Now category
is an enum
schema, and we can see all six Nobel Prize
categories in the schema.
Using values hints
Like with the “enum” example above, let’s use a real-world dataset as an example. The British government maintains an API of every bank holiday in the UK. You can access it by running:
curl https://www.gov.uk/bank-holidays.json
You can run jtd-infer
on this data like so:
curl https://www.gov.uk/bank-holidays.json | jtd-infer
Which outputs this schema:
{
"properties": {
"scotland": {
"properties": {
"events": {
"elements": {
"properties": {
"bunting": {
"type": "boolean"
},
"notes": {
"type": "string"
},
"date": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
},
"division": {
"type": "string"
}
}
},
"england-and-wales": {
"properties": {
"events": {
"elements": {
"properties": {
"date": {
"type": "string"
},
"bunting": {
"type": "boolean"
},
"title": {
"type": "string"
},
"notes": {
"type": "string"
}
}
}
},
"division": {
"type": "string"
}
}
},
"northern-ireland": {
"properties": {
"division": {
"type": "string"
},
"events": {
"elements": {
"properties": {
"notes": {
"type": "string"
},
"date": {
"type": "string"
},
"title": {
"type": "string"
},
"bunting": {
"type": "boolean"
}
}
}
}
}
}
}
}
This output is correct, but it is also a bit verbose. The top-level keys in
the object are scotland
, england-and-wales
, and northern-ireland
, and
their values all have the exact same schema. It could make more sense to say
that this data is really a map/dictionary from divisions of the UK to details
about that division, not a “struct”.
To do that, we can tell jtd-infer
that it should use a “values” schema at the
top level. Here’s how you can do that:
curl https://www.gov.uk/bank-holidays.json | jtd-infer --values-hint=
We give --values-hint
an empty string as a value. That’s sort of a special
case value; it tells jtd-infer
that we’re talking about the root of the
data, not any property within the data. When we run that command, we get this
result:
{
"values": {
"properties": {
"events": {
"elements": {
"properties": {
"date": {
"type": "string"
},
"bunting": {
"type": "boolean"
},
"notes": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
},
"division": {
"type": "string"
}
}
}
}
Which is a bit clearer, and makes it obvious that the top-level properties all have the same sort of data for their value.
Using discriminator hints
Discriminated unions are very common in “event log” JSON payloads, such as in an
activity feed or an event
sourcing architecture.
Real-world examples include GitHub’s Events
API
and AWS’s EventBridge
API.
However, most of these real-world examples are a bit too complex to serve as
good examples here. So in this section, we’ll use a made-up but realistic
example to illustrate how to use --discriminator-hint
.
Let’s say you have an activity event log in your product, and every activity event is a JSON message. Examples of this message include:
{ "eventType": "USER_CREATED", "id": "users/123" }
{ "eventType": "USER_CREATED", "id": "users/456" }
{ "eventType": "USER_PAYMENT_PLAN_CHANGED", "id": "users/789", "plan": "PAID" }
{ "eventType": "USER_PAYMENT_PLAN_CHANGED", "id": "users/123", "plan": "FREE" }
{ "eventType": "USER_DELETED", "id": "users/456", "softDelete": false }
If you put those messages in a file called in.json
and then ran jtd-infer in.json
, you’d get this output:
{
"properties": {
"id": {
"type": "string"
},
"eventType": {
"type": "string"
}
},
"optionalProperties": {
"plan": {
"type": "string"
},
"softDelete": {
"type": "boolean"
}
}
}
This is a decent first guess, but what we’d prefer is for jtd-infer
to give us
a “discriminator” schema keyed off of eventType
. That way, we can have a
schema that will let us be confident that for USER_PAYMENT_PLAN_CHANGED
events, plan
will always be present.
To achieve this, we can give jtd-infer
a hint: that the eventType
property
is a discriminator property. So when we run:
jtd-infer in.json --discriminator-hint=/eventType
We get this output:
{
"discriminator": "eventType",
"mapping": {
"USER_CREATED": {
"properties": {
"id": {
"type": "string"
}
}
},
"USER_DELETED": {
"properties": {
"id": {
"type": "string"
},
"softDelete": {
"type": "boolean"
}
}
},
"USER_PAYMENT_PLAN_CHANGED": {
"properties": {
"plan": {
"type": "string"
},
"id": {
"type": "string"
}
}
}
}
}
Which is much more precise, and reveals more clearly what is really going on with these messages.
Providing multiple hints
Although the previous examples all used the hint arguments separately, you can also use them together, or give any hint multiple times. For example, this example data:
{ "id": "123", "kind": "LEGACY", "status": "OK", "tags": {"foo": "bar" }}
{ "id": "456", "kind": "LEGACY", "status": "ERROR", "tags": {"baz": "quux" }}
{ "id": "789", "kind": "MODERN", "status": "OK", "tags": {"asdf": "hjkl" }}
Would, by default, lead jtd-infer
to infer this schema:
{
"properties": {
"tags": {
"optionalProperties": {
"foo": {
"type": "string"
},
"asdf": {
"type": "string"
},
"baz": {
"type": "string"
}
}
},
"kind": {
"type": "string"
},
"status": {
"type": "string"
},
"id": {
"type": "string"
}
}
}
But if you wanted both kind
and status
to be treated as an enum, and tags
to be treated as a map/dictionary, you could invoke jtd-infer
as:
jtd-infer --enum-hint=/kind --enum-hint=/status --values-hint=/tags
And you’ll get this output instead:
{
"properties": {
"id": {
"type": "string"
},
"tags": {
"values": {
"type": "string"
}
},
"kind": {
"enum": ["LEGACY", "MODERN"]
},
"status": {
"enum": ["ERROR", "OK"]
}
}
}
Section title
-
-
-
-
Tooling
-
Advanced Concepts
-
Language-Specific Documentation