Scene Timestamps

POST/postproduction/v1/scene-timestamps

What it does

Use this when you already know the sequence of scenes in your video and need to place them on the narration timeline without manual scrubbing. Upload one narration audio file plus an ordered list of scenes with required startCueText and anchorText, and the endpoint returnsscene start times plus one recommended transition between each pair of consecutive scenes.

In practice, this helps you move from “we know what the scenes should be” to “we know roughly where each scene should start in the edit”. The result is easier timeline prep for editors, agents, and export tooling that need stable cut-point suggestions instead of raw transcript data.

How it works

Treat this endpoint as a timeline alignment step. You provide the scene plan, the API places that plan onto the narration audio and gives you machine-friendly timing data back.

The important part for a caller is that metadata.scenes[] stays in your order. The API does not invent missing scenes and it does not reshuffle them. Instead, it returns astartMs for each scene plus one transition recommendation between every adjacent pair.

If your narration starts later in a longer edit, send startOffsetMs and the returned timestamps are already shifted for that downstream timeline. If you also have the full narration text, pass it together with top-level languageCode to improve alignment robustness on names, jargon, or noisier speech.

Keep both cue fields short and scene-specific. startCueText is the strongest public hint for where a scene should already feel active. anchorText stays useful as the broader script-facing fallback when start-cue onset evidence is weaker.

What comes back

data.scenes[] includes every input scene. Each item gets a timeline startMs.
data.transitions[] has length N - 1. One boundary for each adjacent scene pair.
startOffsetMs is applied to the returned timestamps. Use it when your narration starts later in the full edit.
No transcript is exposed. Public response contains only timeline outputs and request metadata.

Why use it?

Find rough cut points without manual scrubbing. Good for narrated explainers, tutorials, and talking-head workflows.
Generate machine-friendly timing output. Response structure is stable enough for editors, internal tools, and follow-up export steps.
Keep your own scene plan. The API aligns the order you provide; it does not invent scenes or re-sequence them.
Use confidence windows downstream. Wider intervals signal that a cut may need UI review or snapping logic.

Examples

cURL example

curl -X POST 'https://api.creatornode.io/postproduction/v1/scene-timestamps' \
  -H 'X-API-Key: YOUR_KEY' \
  -F 'audio=@cooking-narration.mp3' \
  -F 'metadata={"startOffsetMs":2000,"languageCode":"en","scenes":[{"id":"prep","startCueText":"We start by laying out everything","anchorText":"We lay out the tomatoes, basil, garlic, and olive oil."},{"id":"cooking","startCueText":"Everything then hits the hot pan","anchorText":"Everything hits the hot pan with olive oil."},{"id":"plating","startCueText":"Finally, the dish is plated","anchorText":"The dish is plated and finished with herbs."}],"narrationText":"We start by laying out everything we need. Fresh tomatoes, basil, garlic, and olive oil are all ready on the board. Everything then hits the hot pan. Finally, the dish is plated and finished with herbs."}'

Response excerpt

{
  "success": true,
  "data": {
    "startOffsetMs": 2000,
    "scenes": [
      { "index": 0, "id": "prep", "label": "Scene 1", "startMs": 2000 },
      { "index": 1, "id": "cooking", "label": "Scene 2", "startMs": 16500 },
      { "index": 2, "id": "plating", "label": "Scene 3", "startMs": 33200 }
    ],
    "transitions": [
      {
        "index": 0,
        "fromSceneIndex": 0,
        "toSceneIndex": 1,
        "recommendedMs": 16500,
        "confidence": { "score": 0.82, "intervalMs": { "from": 15800, "to": 17400 } }
      },
      {
        "index": 1,
        "fromSceneIndex": 1,
        "toSceneIndex": 2,
        "recommendedMs": 33200,
        "confidence": { "score": 0.76, "intervalMs": { "from": 32400, "to": 34100 } }
      }
    ]
  },
  "meta": {
    "requestId": "req_scene_123",
    "processingTimeMs": 842,
    "tier": "premium",
    "audio": { "mimeType": "audio/mpeg", "sizeBytes": 724680, "durationMs": 45300 }
  }
}

Typical workflow

Call /v1/describe-scenes first if your workflow starts from images.
Take data.scenes[].anchorText and data.scenes[].startCueText from that response.
Send those cue fields here with the narration audio.
Use returned startMs and transitions in your editor or export pipeline.

Tips & tricks

Scene cue fields are required. This endpoint expects both startCueText and anchorText for every scene.
Keep startCueText early. It should represent the earliest short narration phrase after which the scene should already feel active.
Provide narrationText when available. The API still runs speech-to-text, but the extra script context improves alignment on names, jargon, and noisy audio.
Set languageCode for non-English audio. The top-level hint can improve transcription accuracy.
Make cue text distinct from scene to scene. Repetitive cues create more ambiguous boundaries.
Treat index as canonical. Client IDs are echoed back, but ordered array position is still the stable key.
Use confidence intervals, not only the single timestamp. They are the best signal for whether a cut should be auto-accepted or reviewed.
See the interactive schema. Full OpenAPI reference: Scene Timestamps docs.

Cost & Limits

Feature	Detail
Minimum charge	9 credits for requests up to 5 minutes of audio and up to 10 scenes, including the first started audio block
Additional audio duration	+3 credits for each additional started 5-minute audio block after the first
Scene count	+1 credit per additional 10 scenes after the first 10
Input format	multipart/form-data (metadata JSON + one audio file)
Best paired with	Describe Scenes when your workflow starts from images instead of scene text

Tier Limits

Limit	Free	Premium
Max audio size	5 MB	25 MB
Max audio duration	5 min	20 min
Max scenes	10	100
Max narration length	5 000 chars	50 000 chars
Max per-scene text field	300 chars	500 chars
Max total scene text	4 000 chars	40 000 chars

Pricing is driven by audio duration first, then scene count. The 13-credit minimum already includes the first started5-minute audio block. Examples: 4m30s with 8 scenes costs 13 credits; 11 minutes with 8 scenes costs 19 credits; 11 minutes with 25 scenes costs 21 credits. Audio file size remains a limit, not a billing dimension.

Other Endpoints

POST/postproduction/v1/compose-otio

Compose OTIO

Compose an OpenTimelineIO timeline from an explicit manifest and deterministic editorial timing.

POST/postproduction/v1/describe-scenes

Describe Scenes

Generate concise per-scene descriptions from uploaded scene images and optional narration context.