| Hardware Acceleration in GStreamer 1.0 |
| -------------------------------------- |
| |
| Status : DRAFT |
| |
| |
| Preamble: |
| |
| This document serves to identify and define the various usages of |
| hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the |
| problems that arise and need to be solved, and a proposal API. |
| |
| |
| Out of scope: |
| |
| This document will initially limit itself to usage of hwaccel in the |
| field of video capture, processing and display due to their |
| complexity. |
| It is not excluded that some parts of the research could be |
| applicable to other fields (audio, text, generic media). |
| |
| This document will not cover how encoded data is parsed and |
| fed/obtained to/from the various hardware subsystems. |
| |
| |
| Overall Goal: |
| |
| Make the most of the underlying hardware features while at the same |
| time not introduce any noticable overhead [0] and provide the |
| biggest flexibility of use-cases possible. |
| |
| |
| Secondary Goals: |
| |
| Avoid Providing a system that only allows (efficient) usage of one |
| use-case and/or through a specific combination or elements. This is |
| contrary to the principles of GStreamer. |
| |
| Not introduce any unneeded memory copies. |
| |
| Not introduce any extra latency. |
| |
| Process data asynchronously wherever possible. |
| |
| |
| Terminology: |
| |
| Due to the limitations of the GStreamer 0.10 API, most of these |
| element, especially sink elements, were named "non-raw video |
| elements". |
| In the rest of this document, we will no longer refer to them as |
| non-raw since they _do_ handle raw video and in GStreamer 1.0 it no |
| longer matters where the raw video is located or accessed. We will |
| prefer the term "hardware-accelerated video element". |
| |
| |
| Specificities: |
| |
| Hardware-accelerated elements differ from non-hwaccel elements in a |
| few ways: |
| |
| * They handle memory which ,in the vast majority of the cases, is |
| not accessible directly. |
| * The processing _can_ happen asynchronously |
| * They _might_ be part of a GPU sub-system and therefore tightly |
| coupled to the display system. |
| |
| |
| Features handled: |
| |
| HW-accelerated elements can handle a variety of individual logical |
| features. These should, in the spirit of GStreamer, be controlable |
| in an individual fashion. |
| |
| * Video decoding and encoding |
| * Display |
| * Capture |
| * Scaling (Downscaling (preview), Upscaling (Super-resolution)) |
| * Deinterlacing (including inverse-telecine) |
| * Post-processing (Noise reduction, ...) |
| * Colorspace conversion |
| * Overlaying and compositing |
| |
| |
| Use-cases: |
| ---------- |
| |
| UC1 : HW-accelerated video decoding to counterpart sink |
| |
| Example : * VDPAU decoder to VDPAU sink |
| * libVA decoder to libVA sink |
| |
| In these situations, the HW-accelerated decoder and sink can use the |
| same API to communicate with each other and share data. |
| |
| There might be extra processing that can be applied before display |
| (deinterlacing, noise reduction, overlaying, ...) and that is |
| provided by the backing hardware. All these features should be |
| usable in a transparent fashion from GStreamer. |
| |
| They might also need to communicate/share a common context. |
| |
| |
| UC2 : HW-accelerated video decoding to different hwaccel sink |
| |
| Example : * VDPAU/libVA decoder to OpenGL-based sink |
| |
| The goal here is to end up with the decoded pictures as openGL |
| textures, which can then be used in an openGL scene (with all the |
| transformations one can do with those textures). |
| |
| GStreamer is responsible for: |
| 1) Filling the contents of those textures |
| 2) Informing the application when to use which texture at which time |
| (i.e. synchronization). |
| |
| How the textures are used is not the responsibility of GStreamer, |
| although a fallback could be possible (displaying the texture in a |
| specified X window for ex) if the application does not handle the |
| OpenGL scene. |
| |
| Efficient usage is only possible if the HW-accelerated system |
| provides an API by which one can either: |
| * Be given openGL texture IDs for the decoder to decode into |
| * OR 'transform' hwaccel-backed buffers into texture IDs |
| |
| Just as for UC1, some information will need to be exchanged between |
| the OpenGL-backed elements and the other HW-accelerated element. |
| |
| |
| UC3 : HW-accelerated decoding to HW-accelerated encoding |
| |
| This is needed in cases where we want to reencode a stream from one |
| format/profile to another format/profile, like for example for |
| UPNP/DLNA embedded devices. |
| |
| If the encoder and decoder are using the same backing hardware, this |
| is similar to UC1. |
| |
| If the encoder and decoder are backed by 1) different hardware but |
| there is an API allowing communication between the two, OR 2) the |
| same hardware but through different APIs this is similar to UC2. |
| |
| If the hardware backing the encoder and decoder don't have direct |
| communication means, then best-effort must be ensured to only |
| introduce one copy. The recent ongoing improvements in the kernel |
| regarding DMA usage could help in that regards, allowing some |
| hardware to be aware of another hardware. |
| |
| |
| UC4 : HW-accelerated decoding to software plugin |
| |
| Examples : * Transcoding a stream using a software encoder |
| * Applying measurement/transformations |
| * Your crazy idea here |
| * ... |
| |
| While the most common usage of HW-accelerated decoding is for |
| display, we do not want to limit users of the GStreamer framework to |
| only be able to use those plugins in some limited use-cases. Users |
| should be able to benefit from the acceleration in any use-cases. |
| |
| |
| UC5 : Software element to HW-accelerated display |
| |
| Examples : * Software decoder to VA/VDPAU/GL/.. sink |
| * Visualization to VA/VDPAU/GL/... sink |
| * anything in fact |
| |
| We need to ensure in these cases that any GStreamer plugin can |
| output data to a HW-accelerated display. |
| |
| This process must not introduce any unwanted synchronization issues, |
| meaning the transfer to the backing hardware needs to happen before |
| the synchronization time in the sinks. |
| |
| |
| UC6 : HW-accelerated capture to HW-accelerated encoder |
| |
| Examples : * Camerabin usage |
| * Streaming server |
| * Video-over-IP |
| * ... |
| |
| In order to provide not only low-cpu usage (through HW-accelerated |
| encoding) but also low-latency, we need to be able to have capture |
| hardware provide the data to be encoded in such a way that the |
| encoder can read it without any copy. |
| |
| Some capture APIs provide means by which the hardware can be |
| provided by a pool of buffers backed by some MMAP contiguous |
| memory. |
| |
| |
| UC6.1 : UC6 + simultaneous preview |
| |
| Examples : Camerabin usage (preview of video/photo while shooting) |
| |
| |
| |
| Problems: |
| --------- |
| |
| P1 : Ranking of decoders |
| |
| How do we pick the best decoder available ? Do we just set the |
| ranking of hardware-accelerated plugins to higher ranks ? |
| |
| |
| P2 : Capabilities of HW-accelerated decoders |
| |
| Hardware decoders can have much tighter constraints as to what they |
| can handle (limitations in sizes, bitrate, profile, level, |
| ...). |
| |
| These limitations might be known without probbing the hardware, but |
| in most cases they require querying it. |
| Getting as much information about the stream to decode is needed. |
| This can be obtained through parsers and only look for a decoder |
| once the parser has provided extensive caps. |
| |
| |
| P3 : Finding and auto-plugging the best elements |
| |
| Taking the case where several decoders are available and several |
| sink elements are available, how do we establish which is the best |
| combination ? |
| |
| Assuming we take the highest-ranked (and compatible) decoder, how do |
| we figure out which sink element is compatible ? |
| |
| Assuming the user/application selects a specific sink, how do we |
| figure out which is the best decoder to use ? |
| |
| /!\ Caps are not longer sufficient to establish compatibility |
| |
| |
| P4 : How to handle systems that require calls to happen in one thread |
| |
| In OpenGL (for example) calls can only be done from one thread, |
| which might not be a GStreamer thread (the sink could be controlled |
| from an application thread). |
| |
| How do we properly (and safely) handle buffers and contexts ? Do we |
| create an API that allows marshalling processing into the proper |
| thread (resulting in an asynchronous API from the GStreamer point of |
| view) ? |
| |
| |
| |
| Proposal Design: |
| |
| D1 : GstCaps |
| |
| We use the "video/x-raw" GstCaps. |
| |
| The format field and other required fields are filled in the same |
| way they would be for non-HW-accelerated streams. |
| |
| |
| D2 : Buffers and memory access |
| |
| The buffers used/provided/consumed by the various HW-accelerated |
| elements must be usable with non-HW-accelerated elements. |
| |
| To that extent, the GstMemory backing the various buffers must be |
| accessible via the mapping methods and therefore have the proper |
| GstAllocator implementation if-so required. |
| |
| In the un-likelihood that the hardware does not provide any means to |
| map the memory or that there are such limitation (such as on DRM |
| systems), there should still be an implementation of |
| GstMemoryMapFunction that returns NULL (and a size/maxsize of zero) |
| when called. |
| |
| |
| D3 : GstVideoMeta |
| |
| In the same way that a custom GstAllocator is required, it is |
| important that elements implement the proper GstVideoMeta API |
| wherever applicable. |
| |
| The GstVideoMeta fields should correspond to the memory returned by |
| a call to gst_buffer_map() and/or gst_video_meta_map(). |
| |
| => gst_video_meta_{map|unmap}() needs to call the |
| GstVideoMeta->{map|unmap} implementations |
| |
| |
| D4 : Custom GstMeta |
| |
| In order to pass along API and/or hardware-specific information |
| regarding the various buffers, the elements will be able to create |
| custom GstMeta. |
| |
| Ex (For VDPAU): |
| |
| struct _GstVDPAUMeta { |
| GstMeta meta; |
| |
| VdpDevice device; |
| VdpVideoSurface surface; |
| ... |
| }; |
| |
| If an element supports multiple APIs for accessing/using the data |
| (like for example VDPAU and GLX), it should all the applicable |
| GstMeta. |
| |
| |
| D5 : Buffer pools |
| |
| In order to: |
| * avoid expensive cycles of buffer destruction/creation, |
| * allow upstream elements to end up with the optimal buffers/memory |
| to which to upload, |
| elements should implement GstBufferPools whenever possible. |
| |
| If the backing hardware has a system by which it differentiates used |
| buffers and available buffers, the bufferpool should have the proper |
| release_buffer() and acquire_buffer() implementations. |
| |
| |
| D6 : Ahead-of-time/asynchronous uploading |
| |
| In the case where the buffers to be displayed are not on the target |
| hardware, we need to ensure the buffers are uploaded before the |
| synchronization time. If data is uploaded at the render time we will |
| end up with an unknown render latency, resulting in bad A/V |
| synchronization. |
| |
| In order for this to happen, the buffers provided by downstream |
| elements should have a GstAllocator implementation allowing |
| uploading memory on _map(GST_MAP_WRITE). |
| |
| If this uploading happens asynchronously, the GstAllocator should |
| implement a system so that if an intermediary element wishes to map |
| the memory it can do so (either by providing a cached version of the |
| memory, or by using locks). |
| |
| |
| D7 : Overlay and positioning support |
| |
| FIXME : Move to a separate design doc |
| |
| struct _GstVideoCompositingMeta { |
| GstMeta meta; |
| |
| /* zorder : Depth Position of the layer in the final scene |
| * 0 = background |
| * 2**32 = foreground |
| */ |
| guint zorder; |
| |
| /* x,y : Spatial position of the layer in the final scene |
| */ |
| guint x; |
| guint y; |
| |
| /* width/height : Target width/height of the layer in the |
| * final scene. |
| */ |
| |
| guint width; |
| guint height; |
| /* basewidth/baseheight : Reference scene width/height |
| * If both values are zero, the x/y/width/height values above |
| * are to be used as absolute coordinates, regardless of the |
| * final scene's width and height. |
| * If the values are non-zero, the x/y/width/height values |
| * above should be scaled based on those values. |
| * Ex : real x position = x / basewidth * scene_width |
| */ |
| guint basewidth; |
| guint baseheight; |
| |
| /* alpha : Global alpha multiplier |
| * 0.0 = completely transparent |
| * 1.0 = no modification of original transparency (or opacity) |
| */ |
| gdouble alpha; |
| } |
| |
| |
| D8 : De-interlacing support |
| |
| FIXME : Move to a separate design doc |
| |
| For systems that can apply deinterlacing, the user needs to be in |
| control of whether it should be applied or not. |
| |
| This should be done through the usage of the deinterlace element. |
| |
| In order to benefit from the HW-acceleration, downstream/upstream |
| elements need a way by which they can indicate that the |
| deinterlacing process will be applied later. |
| |
| To this extent, we introduce a new GstMeta : GstDeinterlaceMeta |
| |
| typedef const gchar *GstDeinterlaceMethod; |
| |
| struct _GstDeinterlaceMeta { |
| GstMeta meta; |
| |
| GstDeinterlaceMethod method; |
| } |
| |
| |
| D9 : Context sharing |
| |
| Re-use parts of -bad's videocontext ? |
| |
| |
| D10 : Non-MT-safe APIs |
| |
| If the wrapped API/system does not offer an API which is MT-safe |
| and/or usable from more than one thread (like OpenGL), we need: |
| * A system by which a global context can be provided to all elements |
| wanting to use that system, |
| * A system by which elements can serialize processing to a 3rd party |
| thread. |
| |
| |
| [0]: Defining "noticeable overhead" is always tricky, but essentially |
| means that the overhead introduced by GStreamer core and the element |
| code should not exceed the overhead introduced for non-hw-accelerated |
| elements. |