Scene Timestamps
What it does
Use this when you already know the sequence of scenes in your video and need to place them on the narration timeline without manual scrubbing. Upload one narration audio file plus an ordered list of scenes with required startCueText and anchorText, and the endpoint returnsscene start times plus one recommended transition between each pair of consecutive scenes.
In practice, this helps you move from “we know what the scenes should be” to “we know roughly where each scene should start in the edit”. The result is easier timeline prep for editors, agents, and export tooling that need stable cut-point suggestions instead of raw transcript data.
How it works
Treat this endpoint as a timeline alignment step. You provide the scene plan, the API places that plan onto the narration audio and gives you machine-friendly timing data back.
The important part for a caller is that metadata.scenes[] stays in your order. The API does not invent missing scenes and it does not reshuffle them. Instead, it returns astartMs for each scene plus one transition recommendation between every adjacent pair.
If your narration starts later in a longer edit, send startOffsetMs and the returned timestamps are already shifted for that downstream timeline. If you also have the full narration text, pass it together with top-level languageCode to improve alignment robustness on names, jargon, or noisier speech.
Keep both cue fields short and scene-specific. startCueText is the strongest public hint for where a scene should already feel active. anchorText stays useful as the broader script-facing fallback when start-cue onset evidence is weaker.
What comes back
data.scenes[]includes every input scene. Each item gets a timelinestartMs.data.transitions[]has lengthN - 1. One boundary for each adjacent scene pair.startOffsetMsis applied to the returned timestamps. Use it when your narration starts later in the full edit.- No transcript is exposed. Public response contains only timeline outputs and request metadata.
Why use it?
- Find rough cut points without manual scrubbing. Good for narrated explainers, tutorials, and talking-head workflows.
- Generate machine-friendly timing output. Response structure is stable enough for editors, internal tools, and follow-up export steps.
- Keep your own scene plan. The API aligns the order you provide; it does not invent scenes or re-sequence them.
- Use confidence windows downstream. Wider intervals signal that a cut may need UI review or snapping logic.
Examples
cURL example
curl -X POST 'https://api.creatornode.io/postproduction/v1/scene-timestamps' \
-H 'X-API-Key: YOUR_KEY' \
-F 'audio=@cooking-narration.mp3' \
-F 'metadata={"startOffsetMs":2000,"languageCode":"en","scenes":[{"id":"prep","startCueText":"We start by laying out everything","anchorText":"We lay out the tomatoes, basil, garlic, and olive oil."},{"id":"cooking","startCueText":"Everything then hits the hot pan","anchorText":"Everything hits the hot pan with olive oil."},{"id":"plating","startCueText":"Finally, the dish is plated","anchorText":"The dish is plated and finished with herbs."}],"narrationText":"We start by laying out everything we need. Fresh tomatoes, basil, garlic, and olive oil are all ready on the board. Everything then hits the hot pan. Finally, the dish is plated and finished with herbs."}'Response excerpt
{
"success": true,
"data": {
"startOffsetMs": 2000,
"scenes": [
{ "index": 0, "id": "prep", "label": "Scene 1", "startMs": 2000 },
{ "index": 1, "id": "cooking", "label": "Scene 2", "startMs": 16500 },
{ "index": 2, "id": "plating", "label": "Scene 3", "startMs": 33200 }
],
"transitions": [
{
"index": 0,
"fromSceneIndex": 0,
"toSceneIndex": 1,
"recommendedMs": 16500,
"confidence": { "score": 0.82, "intervalMs": { "from": 15800, "to": 17400 } }
},
{
"index": 1,
"fromSceneIndex": 1,
"toSceneIndex": 2,
"recommendedMs": 33200,
"confidence": { "score": 0.76, "intervalMs": { "from": 32400, "to": 34100 } }
}
]
},
"meta": {
"requestId": "req_scene_123",
"processingTimeMs": 842,
"tier": "premium",
"audio": { "mimeType": "audio/mpeg", "sizeBytes": 724680, "durationMs": 45300 }
}
}Typical workflow
Call /v1/describe-scenes first if your workflow starts from images.
Take data.scenes[].anchorText and data.scenes[].startCueText from that response.
Send those cue fields here with the narration audio.
Use returned startMs and transitions in your editor or export pipeline.Tips & tricks
- Scene cue fields are required. This endpoint expects both
startCueTextandanchorTextfor every scene. - Keep
startCueTextearly. It should represent the earliest short narration phrase after which the scene should already feel active. - Provide
narrationTextwhen available. The API still runs speech-to-text, but the extra script context improves alignment on names, jargon, and noisy audio. - Set
languageCodefor non-English audio. The top-level hint can improve transcription accuracy. - Make cue text distinct from scene to scene. Repetitive cues create more ambiguous boundaries.
- Treat
indexas canonical. Client IDs are echoed back, but ordered array position is still the stable key. - Use confidence intervals, not only the single timestamp. They are the best signal for whether a cut should be auto-accepted or reviewed.
- See the interactive schema. Full OpenAPI reference: Scene Timestamps docs.
Cost & Limits
| Feature | Detail |
|---|---|
| Minimum charge | 9 credits for requests up to 5 minutes of audio and up to 10 scenes, including the first started audio block |
| Additional audio duration | +3 credits for each additional started 5-minute audio block after the first |
| Scene count | +1 credit per additional 10 scenes after the first 10 |
| Input format | multipart/form-data (metadata JSON + one audio file) |
| Best paired with | Describe Scenes when your workflow starts from images instead of scene text |
Tier Limits
| Limit | Free | Premium |
|---|---|---|
| Max audio size | 5 MB | 25 MB |
| Max audio duration | 5 min | 20 min |
| Max scenes | 10 | 100 |
| Max narration length | 5 000 chars | 50 000 chars |
| Max per-scene text field | 300 chars | 500 chars |
| Max total scene text | 4 000 chars | 40 000 chars |
Pricing is driven by audio duration first, then scene count. The 13-credit minimum already includes the first started5-minute audio block. Examples: 4m30s with 8 scenes costs 13 credits; 11 minutes with 8 scenes costs 19 credits; 11 minutes with 25 scenes costs 21 credits. Audio file size remains a limit, not a billing dimension.