| =============================================================== |
| Subtitle overlays, hardware-accelerated decoding and playbin |
| =============================================================== |
| |
| Status: EARLY DRAFT / BRAINSTORMING |
| |
| === 1. Background === |
| |
| Subtitles can be muxed in containers or come from an external source. |
| |
| Subtitles come in many shapes and colours. Usually they are either |
| text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles |
| and the most common form of DVB subs). Bitmap based subtitles are |
| usually compressed in some way, like some form of run-length encoding. |
| |
| Subtitles are currently decoded and rendered in subtitle-format-specific |
| overlay elements. These elements have two sink pads (one for raw video |
| and one for the subtitle format in question) and one raw video source pad. |
| |
| They will take care of synchronising the two input streams, and of |
| decoding and rendering the subtitles on top of the raw video stream. |
| |
| Digression: one could theoretically have dedicated decoder/render elements |
| that output an AYUV or ARGB image, and then let a videomixer element do |
| the actual overlaying, but this is not very efficient, because it requires |
| us to allocate and blend whole pictures (1920x1080 AYUV = 8MB, |
| 1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the overlay region |
| is only a small rectangle at the bottom. This wastes memory and CPU. |
| We could do something better by introducing a new format that only |
| encodes the region(s) of interest, but we don't have such a format yet, and |
| are not necessarily keen to rewrite this part of the logic in playbin |
| at this point - and we can't change existing elements' behaviour, so would |
| need to introduce new elements for this. |
| |
| Playbin2 supports outputting compressed formats, i.e. it does not |
| force decoding to a raw format, but is happy to output to a non-raw |
| format as long as the sink supports that as well. |
| |
| In case of certain hardware-accelerated decoding APIs, we will make use |
| of that functionality. However, the decoder will not output a raw video |
| format then, but some kind of hardware/API-specific format (in the caps) |
| and the buffers will reference hardware/API-specific objects that |
| the hardware/API-specific sink will know how to handle. |
| |
| |
| === 2. The Problem === |
| |
| In the case of such hardware-accelerated decoding, the decoder will not |
| output raw pixels that can easily be manipulated. Instead, it will |
| output hardware/API-specific objects that can later be used to render |
| a frame using the same API. |
| |
| Even if we could transform such a buffer into raw pixels, we most |
| likely would want to avoid that, in order to avoid the need to |
| map the data back into system memory (and then later back to the GPU). |
| It's much better to upload the much smaller encoded data to the GPU/DSP |
| and then leave it there until rendered. |
| |
| Currently playbin only supports subtitles on top of raw decoded video. |
| It will try to find a suitable overlay element from the plugin registry |
| based on the input subtitle caps and the rank. (It is assumed that we |
| will be able to convert any raw video format into any format required |
| by the overlay using a converter such as videoconvert.) |
| |
| It will not render subtitles if the video sent to the sink is not |
| raw YUV or RGB or if conversions have been disabled by setting the |
| native-video flag on playbin. |
| |
| Subtitle rendering is considered an important feature. Enabling |
| hardware-accelerated decoding by default should not lead to a major |
| feature regression in this area. |
| |
| This means that we need to support subtitle rendering on top of |
| non-raw video. |
| |
| |
| === 3. Possible Solutions === |
| |
| The goal is to keep knowledge of the subtitle format within the |
| format-specific GStreamer plugins, and knowledge of any specific |
| video acceleration API to the GStreamer plugins implementing |
| that API. We do not want to make the pango/dvbsuboverlay/dvdspu/kate |
| plugins link to libva/libvdpau/etc. and we do not want to make |
| the vaapi/vdpau plugins link to all of libpango/libkate/libass etc. |
| |
| |
| Multiple possible solutions come to mind: |
| |
| (a) backend-specific overlay elements |
| |
| e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu, |
| vaapidvbsuboverlay, vdpaudvbsuboverlay, etc. |
| |
| This assumes the overlay can be done directly on the backend-specific |
| object passed around. |
| |
| The main drawback with this solution is that it leads to a lot of |
| code duplication and may also lead to uncertainty about distributing |
| certain duplicated pieces of code. The code duplication is pretty |
| much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu, |
| kate, assrender, etc. available in form of base classes to derive |
| from is not really an option. Similarly, one would not really want |
| the vaapi/vdpau plugin to depend on a bunch of other libraries |
| such as libpango, libkate, libtiger, libass, etc. |
| |
| One could add some new kind of overlay plugin feature though in |
| combination with a generic base class of some sort, but in order |
| to accommodate all the different cases and formats one would end |
| up with quite convoluted/tricky API. |
| |
| (Of course there could also be a GstFancyVideoBuffer that provides |
| an abstraction for such video accelerated objects and that could |
| provide an API to add overlays to it in a generic way, but in the |
| end this is just a less generic variant of (c), and it is not clear |
| that there are real benefits to a specialised solution vs. a more |
| generic one). |
| |
| |
| (b) convert backend-specific object to raw pixels and then overlay |
| |
| Even where possible technically, this is most likely very |
| inefficient. |
| |
| |
| (c) attach the overlay data to the backend-specific video frame buffers |
| in a generic way and do the actual overlaying/blitting later in |
| backend-specific code such as the video sink (or an accelerated |
| encoder/transcoder) |
| |
| In this case, the actual overlay rendering (i.e. the actual text |
| rendering or decoding DVD/DVB data into pixels) is done in the |
| subtitle-format-specific GStreamer plugin. All knowledge about |
| the subtitle format is contained in the overlay plugin then, |
| and all knowledge about the video backend in the video backend |
| specific plugin. |
| |
| The main question then is how to get the overlay pixels (and |
| we will only deal with pixels here) from the overlay element |
| to the video sink. |
| |
| This could be done in multiple ways: One could send custom |
| events downstream with the overlay data, or one could attach |
| the overlay data directly to the video buffers in some way. |
| |
| Sending inline events has the advantage that is is fairly |
| transparent to any elements between the overlay element and |
| the video sink: if an effects plugin creates a new video |
| buffer for the output, nothing special needs to be done to |
| maintain the subtitle overlay information, since the overlay |
| data is not attached to the buffer. However, it slightly |
| complicates things at the sink, since it would also need to |
| look for the new event in question instead of just processing |
| everything in its buffer render function. |
| |
| If one attaches the overlay data to the buffer directly, any |
| element between overlay and video sink that creates a new |
| video buffer would need to be aware of the overlay data |
| attached to it and copy it over to the newly-created buffer. |
| |
| One would have to do implement a special kind of new query |
| (e.g. FEATURE query) that is not passed on automatically by |
| gst_pad_query_default() in order to make sure that all elements |
| downstream will handle the attached overlay data. (This is only |
| a problem if we want to also attach overlay data to raw video |
| pixel buffers; for new non-raw types we can just make it |
| mandatory and assume support and be done with it; for existing |
| non-raw types nothing changes anyway if subtitles don't work) |
| (we need to maintain backwards compatibility for existing raw |
| video pipelines like e.g.: ..decoder ! suboverlay ! encoder..) |
| |
| Even though slightly more work, attaching the overlay information |
| to buffers seems more intuitive than sending it interleaved as |
| events. And buffers stored or passed around (e.g. via the |
| "last-buffer" property in the sink when doing screenshots via |
| playbin) always contain all the information needed. |
| |
| |
| (d) create a video/x-raw-*-delta format and use a backend-specific videomixer |
| |
| This possibility was hinted at already in the digression in |
| section 1. It would satisfy the goal of keeping subtitle format |
| knowledge in the subtitle plugins and video backend knowledge |
| in the video backend plugin. It would also add a concept that |
| might be generally useful (think ximagesrc capture with xdamage). |
| However, it would require adding foorender variants of all the |
| existing overlay elements, and changing playbin to that new |
| design, which is somewhat intrusive. And given the general |
| nature of such a new format/API, we would need to take a lot |
| of care to be able to accommodate all possible use cases when |
| designing the API, which makes it considerably more ambitious. |
| Lastly, we would need to write videomixer variants for the |
| various accelerated video backends as well. |
| |
| |
| Overall (c) appears to be the most promising solution. It is the least |
| intrusive and should be fairly straight-forward to implement with |
| reasonable effort, requiring only small changes to existing elements |
| and requiring no new elements. |
| |
| Doing the final overlaying in the sink as opposed to a videomixer |
| or overlay in the middle of the pipeline has other advantages: |
| |
| - if video frames need to be dropped, e.g. for QoS reasons, |
| we could also skip the actual subtitle overlaying and |
| possibly the decoding/rendering as well, if the |
| implementation and API allows for that to be delayed. |
| |
| - the sink often knows the actual size of the window/surface/screen |
| the output video is rendered to. This *may* make it possible to |
| render the overlay image in a higher resolution than the input |
| video, solving a long standing issue with pixelated subtitles on |
| top of low-resolution videos that are then scaled up in the sink. |
| This would require for the rendering to be delayed of course instead |
| of just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer |
| in the overlay, but that could all be supported. |
| |
| - if the video backend / sink has support for high-quality text |
| rendering (clutter?) we could just pass the text or pango markup |
| to the sink and let it do the rest (this is unlikely to be |
| supported in the general case - text and glyph rendering is |
| hard; also, we don't really want to make up our own text markup |
| system, and pango markup is probably too limited for complex |
| karaoke stuff). |
| |
| |
| === 4. API needed === |
| |
| (a) Representation of subtitle overlays to be rendered |
| |
| We need to pass the overlay pixels from the overlay element to the |
| sink somehow. Whatever the exact mechanism, let's assume we pass |
| a refcounted GstVideoOverlayComposition struct or object. |
| |
| A composition is made up of one or more overlays/rectangles. |
| |
| In the simplest case an overlay rectangle is just a blob of |
| RGBA/ABGR [FIXME?] or AYUV pixels with positioning info and other |
| metadata, and there is only one rectangle to render. |
| |
| We're keeping the naming generic ("OverlayFoo" rather than |
| "SubtitleFoo") here, since this might also be handy for |
| other use cases such as e.g. logo overlays or so. It is not |
| designed for full-fledged video stream mixing though. |
| |
| // Note: don't mind the exact implementation details, they'll be hidden |
| |
| // FIXME: might be confusing in 0.11 though since GstXOverlay was |
| // renamed to GstVideoOverlay in 0.11, but not much we can do, |
| // maybe we can rename GstVideoOverlay to something better |
| |
| struct GstVideoOverlayComposition |
| { |
| guint num_rectangles; |
| GstVideoOverlayRectangle ** rectangles; |
| |
| /* lowest rectangle sequence number still used by the upstream |
| * overlay element. This way a renderer maintaining some kind of |
| * rectangles <-> surface cache can know when to free cached |
| * surfaces/rectangles. */ |
| guint min_seq_num_used; |
| |
| /* sequence number for the composition (same series as rectangles) */ |
| guint seq_num; |
| } |
| |
| struct GstVideoOverlayRectangle |
| { |
| /* Position on video frame and dimension of output rectangle in |
| * output frame terms (already adjusted for the PAR of the output |
| * frame). x/y can be negative (overlay will be clipped then) */ |
| gint x, y; |
| guint render_width, render_height; |
| |
| /* Dimensions of overlay pixels */ |
| guint width, height, stride; |
| |
| /* This is the PAR of the overlay pixels */ |
| guint par_n, par_d; |
| |
| /* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems, |
| * and BGRA on little-endian systems (i.e. pixels are treated as |
| * 32-bit values and alpha is always in the most-significant byte, |
| * and blue is in the least-significant byte). |
| * |
| * FIXME: does anyone actually use AYUV in practice? (we do |
| * in our utility function to blend on top of raw video) |
| * What about AYUV and endianness? Do we always have [A][Y][U][V] |
| * in memory? */ |
| /* FIXME: maybe use our own enum? */ |
| GstVideoFormat format; |
| |
| /* Refcounted blob of memory, no caps or timestamps */ |
| GstBuffer *pixels; |
| |
| // FIXME: how to express source like text or pango markup? |
| // (just add source type enum + source buffer with data) |
| // |
| // FOR 0.10: always send pixel blobs, but attach source data in |
| // addition (reason: if downstream changes, we can't renegotiate |
| // that properly, if we just do a query of supported formats from |
| // the start). Sink will just ignore pixels and use pango markup |
| // from source data if it supports that. |
| // |
| // FOR 0.11: overlay should query formats (pango markup, pixels) |
| // supported by downstream and then only send that. We can |
| // renegotiate via the reconfigure event. |
| // |
| |
| /* sequence number: useful for backends/renderers/sinks that want |
| * to maintain a cache of rectangles <-> surfaces. The value of |
| * the min_seq_num_used in the composition tells the renderer which |
| * rectangles have expired. */ |
| guint seq_num; |
| |
| /* FIXME: we also need a (private) way to cache converted/scaled |
| * pixel blobs */ |
| } |
| |
| (a1) Overlay consumer API: |
| |
| How would this work in a video sink that supports scaling of textures: |
| |
| gst_foo_sink_render () { |
| /* assume only one for now */ |
| if video_buffer has composition: |
| composition = video_buffer.get_composition() |
| |
| for each rectangle in composition: |
| if rectangle.source_data_type == PANGO_MARKUP |
| actor = text_from_pango_markup (rectangle.get_source_data()) |
| else |
| pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...) |
| actor = texture_from_rgba (pixels, ...) |
| |
| .. position + scale on top of video surface ... |
| } |
| |
| (a2) Overlay producer API: |
| |
| e.g. logo or subpicture overlay: got pixels, stuff into rectangle: |
| |
| if (logoverlay->cached_composition == NULL) { |
| comp = composition_new (); |
| |
| rect = rectangle_new (format, pixels_buf, |
| width, height, stride, par_n, par_d, |
| x, y, render_width, render_height); |
| |
| /* composition adds its own ref for the rectangle */ |
| composition_add_rectangle (comp, rect); |
| rectangle_unref (rect); |
| |
| /* buffer adds its own ref for the composition */ |
| video_buffer_attach_composition (comp); |
| |
| /* we take ownership of the composition and save it for later */ |
| logoverlay->cached_composition = comp; |
| } else { |
| video_buffer_attach_composition (logoverlay->cached_composition); |
| } |
| |
| FIXME: also add some API to modify render position/dimensions of |
| a rectangle (probably requires creation of new rectangle, unless |
| we handle writability like with other mini objects). |
| |
| (b) Fallback overlay rendering/blitting on top of raw video |
| |
| Eventually we want to use this overlay mechanism not only for |
| hardware-accelerated video, but also for plain old raw video, |
| either at the sink or in the overlay element directly. |
| |
| Apart from the advantages listed earlier in section 3, this |
| allows us to consolidate a lot of overlaying/blitting code that |
| is currently repeated in every single overlay element in one |
| location. This makes it considerably easier to support a whole |
| range of raw video formats out of the box, add SIMD-optimised |
| rendering using ORC, or handle corner cases correctly. |
| |
| (Note: side-effect of overlaying raw video at the video sink is |
| that if e.g. a screnshotter gets the last buffer via the last-buffer |
| property of basesink, it would get an image without the subtitles |
| on top. This could probably be fixed by re-implementing the |
| property in GstVideoSink though. Playbin2 could handle this |
| internally as well). |
| |
| void |
| gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp |
| GstBuffer * video_buf) |
| { |
| guint n; |
| |
| g_return_if_fail (gst_buffer_is_writable (video_buf)); |
| g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL); |
| |
| ... parse video_buffer caps into BlendVideoFormatInfo ... |
| |
| for each rectangle in the composition: { |
| |
| if (gst_video_format_is_yuv (video_buf_format)) { |
| overlay_format = FORMAT_AYUV; |
| } else if (gst_video_format_is_rgb (video_buf_format)) { |
| overlay_format = FORMAT_ARGB; |
| } else { |
| /* FIXME: grayscale? */ |
| return; |
| } |
| |
| /* this will scale and convert AYUV<->ARGB if needed */ |
| pixels = rectangle_get_pixels_scaled (rectangle, overlay_format); |
| |
| ... clip output rectangle ... |
| |
| __do_blend (video_buf_format, video_buf->data, |
| overlay_format, pixels->data, |
| x, y, width, height, stride); |
| |
| gst_buffer_unref (pixels); |
| } |
| } |
| |
| |
| (c) Flatten all rectangles in a composition |
| |
| We cannot assume that the video backend API can handle any |
| number of rectangle overlays, it's possible that it only |
| supports one single overlay, in which case we need to squash |
| all rectangles into one. |
| |
| However, we'll just declare this a corner case for now, and |
| implement it only if someone actually needs it. It's easy |
| to add later API-wise. Might be a bit tricky if we have |
| rectangles with different PARs/formats (e.g. subs and a logo), |
| though we could probably always just use the code from (b) |
| with a fully transparent video buffer to create a flattened |
| overlay buffer. |
| |
| (d) core API: new FEATURE query |
| |
| For 0.10 we need to add a FEATURE query, so the overlay element |
| can query whether the sink downstream and all elements between |
| the overlay element and the sink support the new overlay API. |
| Elements in between need to support it because the render |
| positions and dimensions need to be updated if the video is |
| cropped or rescaled, for example. |
| |
| In order to ensure that all elements support the new API, |
| we need to drop the query in the pad default query handler |
| (so it only succeeds if all elements handle it explicitly). |
| |
| Might want two variants of the feature query - one where |
| all elements in the chain need to support it explicitly |
| and one where it's enough if some element downstream |
| supports it. |
| |
| In 0.11 this could probably be handled via GstMeta and |
| ALLOCATION queries (and/or we could simply require |
| elements to be aware of this API from the start). |
| |
| There appears to be no issue with downstream possibly |
| not being linked yet at the time when an overlay would |
| want to do such a query. |
| |
| |
| Other considerations: |
| |
| - renderers (overlays or sinks) may be able to handle only ARGB or only AYUV |
| (for most graphics/hw-API it's likely ARGB of some sort, while our |
| blending utility functions will likely want the same colour space as |
| the underlying raw video format, which is usually YUV of some sort). |
| We need to convert where required, and should cache the conversion. |
| |
| - renderers may or may not be able to scale the overlay. We need to |
| do the scaling internally if not (simple case: just horizontal scaling |
| to adjust for PAR differences; complex case: both horizontal and vertical |
| scaling, e.g. if subs come from a different source than the video or the |
| video has been rescaled or cropped between overlay element and sink). |
| |
| - renderers may be able to generate (possibly scaled) pixels on demand |
| from the original data (e.g. a string or RLE-encoded data). We will |
| ignore this for now, since this functionality can still be added later |
| via API additions. The most interesting case would be to pass a pango |
| markup string, since e.g. clutter can handle that natively. |
| |
| - renderers may be able to write data directly on top of the video pixels |
| (instead of creating an intermediary buffer with the overlay which is |
| then blended on top of the actual video frame), e.g. dvdspu, dvbsuboverlay |
| |
| However, in the interest of simplicity, we should probably ignore the |
| fact that some elements can blend their overlays directly on top of the |
| video (decoding/uncompressing them on the fly), even more so as it's |
| not obvious that it's actually faster to decode the same overlay |
| 70-90 times (say) (ie. ca. 3 seconds of video frames) and then blend |
| it 70-90 times instead of decoding it once into a temporary buffer |
| and then blending it directly from there, possibly SIMD-accelerated. |
| Also, this is only relevant if the video is raw video and not some |
| hardware-acceleration backend object. |
| |
| And ultimately it is the overlay element that decides whether to do |
| the overlay right there and then or have the sink do it (if supported). |
| It could decide to keep doing the overlay itself for raw video and |
| only use our new API for non-raw video. |
| |
| - renderers may want to make sure they only upload the overlay pixels once |
| per rectangle if that rectangle recurs in subsequent frames (as part of |
| the same composition or a different composition), as is likely. This caching |
| of e.g. surfaces needs to be done renderer-side and can be accomplished |
| based on the sequence numbers. The composition contains the lowest |
| sequence number still in use upstream (an overlay element may want to |
| cache created compositions+rectangles as well after all to re-use them |
| for multiple frames), based on that the renderer can expire cached |
| objects. The caching needs to be done renderer-side because attaching |
| renderer-specific objects to the rectangles won't work well given the |
| refcounted nature of rectangles and compositions, making it unpredictable |
| when a rectangle or composition will be freed or from which thread |
| context it will be freed. The renderer-specific objects are likely bound |
| to other types of renderer-specific contexts, and need to be managed |
| in connection with those. |
| |
| - composition/rectangles should internally provide a certain degree of |
| thread-safety. Multiple elements (sinks, overlay element) might access |
| or use the same objects from multiple threads at the same time, and it |
| is expected that elements will keep a ref to compositions and rectangles |
| they push downstream for a while, e.g. until the current subtitle |
| composition expires. |
| |
| === 5. Future considerations === |
| |
| - alternatives: there may be multiple versions/variants of the same subtitle |
| stream. On DVDs, there may be a 4:3 version and a 16:9 version of the same |
| subtitles. We could attach both variants and let the renderer pick the best |
| one for the situation (currently we just use the 16:9 version). With totem, |
| it's ultimately totem that adds the 'black bars' at the top/bottom, so totem |
| also knows if it's got a 4:3 display and can/wants to fit 4:3 subs (which |
| may render on top of the bars) or not, for example. |
| |
| === 6. Misc. FIXMEs === |
| |
| TEST: should these look (roughly) alike (note text distortion) - needs fixing in textoverlay |
| |
| gst-launch-0.10 \ |
| videotestsrc ! video/x-raw,width=640,height=480,pixel-aspect-ratio=1/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \ |
| videotestsrc ! video/x-raw,width=320,height=480,pixel-aspect-ratio=2/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \ |
| videotestsrc ! video/x-raw,width=640,height=240,pixel-aspect-ratio=1/2 ! textoverlay text=Hello font-desc=72 ! xvimagesink |
| |
| ~~~ THE END ~~~ |
| |