← Back to all posts
5 min readCentrali Team

Turn Schemaless JSON Into a Typed Schema — Automatically

Run schema discovery on any schemaless collection to detect fields, infer types, and build a validated schema from real data — no manual property definition required.

TutorialFeature

TL;DR: Centrali's schema discovery scans your schemaless collections, detects every field across your records, infers types with confidence scores, and lets you accept suggestions to build a validated schema — no manual property definition required. Run it from the console's AI tab in one click.


Schemaless collections are great for ingesting unpredictable data. Webhook events from six different providers. API responses you don't fully control. CSV imports where the columns keep changing. You write whatever JSON shows up, and Centrali stores it.

But sooner or later, you want structure. You want the console to show typed columns instead of raw JSON. You want filters and Smart Queries to know that amount is a number and created is a date. You want validation to catch bad records before they land.

That's what schema discovery does. It reads your existing records, infers a schema from real data, and lets you review and accept it — field by field or all at once.

Prerequisites

Any schemaless collection with records works. For this walkthrough we'll use a collection called webhook-events-demo that holds 28 webhook events from Stripe, GitHub, Shopify, Twilio, Slack, and SendGrid — each with a different shape. The same flow applies to any schemaless collection, webhook-related or not.

Step 1: Open the AI Tab

Navigate to your collection in the console and click the AI tab. You'll see three sub-tabs: Validation, Insights, and Schema Discovery. Click Schema Discovery.

AI tab with Schema Discovery sub-tab selected, showing empty state with Scan Existing Records and Trigger Inference buttons

The page has three parts:

  • Schema Discovery Configuration — set the discovery mode (Strict, Schemaless, or Auto-Evolving) and the batch size for automatic inference
  • Action buttons — Scan Existing Records (analyze what's in the database) or Trigger Inference (process buffered records)
  • Schema Suggestions — the results. Empty until you run a scan

Step 2: Run a Scan

Click Scan Existing Records. This samples up to 50 records from your collection (configurable up to 200) and analyzes them for field names, types, and patterns.

The scan runs in the background. Within a few seconds, the summary cards update:

Summary cards showing Total 28, Pending 28, Accepted 0, Rejected 0, Buffer 0

  • Total — how many suggestions have been generated across all scans
  • Pending — suggestions waiting for your review
  • Accepted — suggestions you've approved (properties added to the collection)
  • Rejected — suggestions you've dismissed
  • Buffer — records queued for the next automatic inference batch

Step 3: Review Suggestions

Below the summary cards, the Schema Suggestions table shows every field the scan discovered:

Suggestions table showing discovered fields across multiple providers — ip, prerelease, deliveryId, timestamp, sender, repo, tag, callStatus, direction, duration, customerId, livemode, eventId, amount, financialStatus, customerEmail — with inferred types, sample values, and confidence scores

Look at the variety in that table. callStatus from Twilio records. repo from GitHub. financialStatus from Shopify. customerId from Stripe. bounceReason from SendGrid. The scan found every unique field across every provider — 28 records, 28 distinct fields inferred.

Each suggestion includes:

ColumnWhat it tells you
OperationNew Field for fields not on the schema yet, Update Type if the inferred type differs from an existing property
FieldThe field name as it appears in your records
TypeThe inferred type — string, number, boolean, datetime, object, or array
SamplesUp to 3 example values from your actual data
ConfidenceHow confident the inference is, based on consistency across sampled records. Most will be 100%
Statuspending, accepted, or rejected

The confidence score is your friend. Fields that appear in every record with the same type hit 100%. Fields that only appear in a subset — like duration (only on call events) or bounceReason (only on email bounces) — still get picked up, often at 100% because when they do appear, they're consistent.

You can filter the table by status using the toggle at the top right — useful when you've already accepted some and want to focus on what's left.

Step 4: Accept or Reject

For each suggestion, you have two choices:

  • Accept — adds the field as a typed property on the collection. It becomes a filterable, sortable column in the Records view and is available in Smart Queries.
  • Reject — dismisses the suggestion. The field still exists in your records' data, but it won't become a formal property.

To accept many at once, check the boxes on the suggestions you want, then click Accept All Selected. You'll get a confirmation dialog:

Accept 9 Suggestions confirmation modal with Cancel and Accept buttons

Confirm and the properties are added in one transaction.

Step 5: See the Result

After accepting suggestions, go back to the Schema tab. Your collection now has typed properties — all discovered from real data, not defined manually:

Schema tab showing 9 accepted properties with types: provider, currency, eventType, shopDomain, customerEmail, customerId, eventId, status — all strings — plus amount as number

The Records tab also updates — columns that were previously hidden in raw JSON now appear as first-class sortable, filterable fields:

Records view showing typed columns for provider, currency, eventType, shopDomain, customerEmail, customerId populated across records from all six providers

Stripe records show their customerId, Shopify records show their shopDomain and customerEmail, GitHub records show their eventType — all in the same typed grid. No coding, no SDK calls, no manual schema design.

The Three Schema Modes

Schema discovery works differently depending on your collection's mode:

ModeBehaviorWhen to use
SchemalessNo validation. Any JSON accepted. Discovery suggests fields from what it finds.Starting out. You don't know the shape yet.
Auto-evolvingKnown fields are validated. Unknown fields are accepted and buffered for discovery.Your schema is mostly stable, but new fields occasionally appear.
StrictFull validation. Extra fields are rejected.Schema is locked down. No surprises.

The typical progression:

  1. Start schemaless — ingest freely while you figure out the data shape
  2. Run schema discovery — let the system learn from your real data
  3. Accept suggestions — build the schema from what's actually there
  4. Switch to auto-evolving — validate known fields, still accept new ones
  5. Eventually go strict — once the schema is stable and you want hard validation

You can change the mode at any time from the Schema Discovery Configuration section at the top of the AI tab. No data migration needed — existing records are unaffected.

Automatic vs. Manual Discovery

Schema discovery works two ways:

Automatic (buffer-based): In schemaless and auto-evolving modes, every new record is added to a buffer. Once the buffer reaches the batch size (default: 10 records), inference runs automatically and generates suggestions. You don't have to click anything — suggestions appear as data flows in.

Manual (scan-based): Click "Scan Existing Records" to analyze what's already in the collection. This is useful when you've imported data in bulk, or when you want to re-scan after changing the configuration. The scan samples up to 200 records in a single pass.

Both paths produce the same suggestions table. The difference is timing — automatic runs continuously as data arrives, manual runs when you ask.

Configuration

The Schema Discovery Configuration section lets you tune the inference engine:

SettingDefaultWhat it controls
Batch size10How many buffered records trigger automatic inference
Max keys per record100Upper limit on fields analyzed per record
Max nesting depth3How deep into nested objects the scan looks
Max string length sampled1,000Longest string value considered for type inference

The defaults work well for most collections. Raise the batch size if you're receiving high-volume events and don't want inference running on every 10th record.

What's Next

Start building with schema discovery

Building something with Centrali and want to share feedback about this feature?

Email feedback@centrali.io