docs/design/draft-hw-acceleration.txt - mtk-gst-plugins-base-debian - Git at Google

 Hardware Acceleration in GStreamer 1.0
 --------------------------------------

 Status : DRAFT


 Preamble:

   This document serves to identify and define the various usages of
   hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the
   problems that arise and need to be solved, and a proposal API.


 Out of scope:

   This document will initially limit itself to usage of hwaccel in the
   field of video capture, processing and display due to their
   complexity.
   It is not excluded that some parts of the research could be
   applicable to other fields (audio, text, generic media).

   This document will not cover how encoded data is parsed and
   fed/obtained to/from the various hardware subsystems.


 Overall Goal:

   Make the most of the underlying hardware features while at the same
   time not introduce any noticable overhead [0] and provide the
   biggest flexibility of use-cases possible.


 Secondary Goals:

   Avoid Providing a system that only allows (efficient) usage of one
   use-case and/or through a specific combination or elements. This is
   contrary to the principles of GStreamer.

   Not introduce any unneeded memory copies.

   Not introduce any extra latency.

   Process data asynchronously wherever possible.


 Terminology:

   Due to the limitations of the GStreamer 0.10 API, most of these
   element, especially sink elements, were named "non-raw video
   elements".
   In the rest of this document, we will no longer refer to them as
   non-raw since they _do_ handle raw video and in GStreamer 1.0 it no
   longer matters where the raw video is located or accessed. We will
   prefer the term "hardware-accelerated video element".


 Specificities:

   Hardware-accelerated elements differ from non-hwaccel elements in a
   few ways:

   * They handle memory which ,in the vast majority of the cases, is
     not accessible directly.
   * The processing _can_ happen asynchronously
   * They _might_ be part of a GPU sub-system and therefore tightly
     coupled to the display system.


 Features handled:

   HW-accelerated elements can handle a variety of individual logical
   features. These should, in the spirit of GStreamer, be controlable
   in an individual fashion.

   * Video decoding and encoding
   * Display
   * Capture
   * Scaling (Downscaling (preview), Upscaling (Super-resolution))
   * Deinterlacing (including inverse-telecine)
   * Post-processing (Noise reduction, ...)
   * Colorspace conversion
   * Overlaying and compositing


 Use-cases:
 ----------

 UC1 : HW-accelerated video decoding to counterpart sink

   Example : * VDPAU decoder to VDPAU sink
             * libVA decoder to libVA sink

   In these situations, the HW-accelerated decoder and sink can use the
   same API to communicate with each other and share data.

   There might be extra processing that can be applied before display
   (deinterlacing, noise reduction, overlaying, ...) and that is
   provided by the backing hardware. All these features should be
   usable in a transparent fashion from GStreamer.

   They might also need to communicate/share a common context.


 UC2 : HW-accelerated video decoding to different hwaccel sink

   Example : * VDPAU/libVA decoder to OpenGL-based sink

   The goal here is to end up with the decoded pictures as openGL
   textures, which can then be used in an openGL scene (with all the
   transformations one can do with those textures).

   GStreamer is responsible for:
   1) Filling the contents of those textures
   2) Informing the application when to use which texture at which time
     (i.e. synchronization).

   How the textures are used is not the responsibility of GStreamer,
   although a fallback could be possible (displaying the texture in a
   specified X window for ex) if the application does not handle the
   OpenGL scene.

   Efficient usage is only possible if the HW-accelerated system
   provides an API by which one can either:
   * Be given openGL texture IDs for the decoder to decode into
   * OR 'transform' hwaccel-backed buffers into texture IDs

   Just as for UC1, some information will need to be exchanged between
   the OpenGL-backed elements and the other HW-accelerated element.


 UC3 : HW-accelerated decoding to HW-accelerated encoding

   This is needed in cases where we want to reencode a stream from one
   format/profile to another format/profile, like for example for
   UPNP/DLNA embedded devices.

   If the encoder and decoder are using the same backing hardware, this
   is similar to UC1.

   If the encoder and decoder are backed by 1) different hardware but
   there is an API allowing communication between the two, OR 2) the
   same hardware but through different APIs this is similar to UC2.

   If the hardware backing the encoder and decoder don't have direct
   communication means, then best-effort must be ensured to only
   introduce one copy. The recent ongoing improvements in the kernel
   regarding DMA usage could help in that regards, allowing some
   hardware to be aware of another hardware.


 UC4 : HW-accelerated decoding to software plugin

   Examples : * Transcoding a stream using a software encoder
              * Applying measurement/transformations
              * Your crazy idea here
              * ...

   While the most common usage of HW-accelerated decoding is for
   display, we do not want to limit users of the GStreamer framework to
   only be able to use those plugins in some limited use-cases. Users
   should be able to benefit from the acceleration in any use-cases.


 UC5 : Software element to HW-accelerated display

   Examples : * Software decoder to VA/VDPAU/GL/.. sink
              * Visualization to VA/VDPAU/GL/... sink
              * anything in fact

   We need to ensure in these cases that any GStreamer plugin can
   output data to a HW-accelerated display.

   This process must not introduce any unwanted synchronization issues,
   meaning the transfer to the backing hardware needs to happen before
   the synchronization time in the sinks.


 UC6 : HW-accelerated capture to HW-accelerated encoder

   Examples : * Camerabin usage
              * Streaming server
              * Video-over-IP
              * ...

   In order to provide not only low-cpu usage (through HW-accelerated
   encoding) but also low-latency, we need to be able to have capture
   hardware provide the data to be encoded in such a way that the
   encoder can read it without any copy.

   Some capture APIs provide means by which the hardware can be
   provided by a pool of buffers backed by some MMAP contiguous
   memory.


 UC6.1 : UC6 + simultaneous preview

   Examples : Camerabin usage (preview of video/photo while shooting)


 Problems:
 ---------

 P1 : Ranking of decoders

   How do we pick the best decoder available ? Do we just set the
   ranking of hardware-accelerated plugins to higher ranks ?


 P2 : Capabilities of HW-accelerated decoders

   Hardware decoders can have much tighter constraints as to what they
   can handle (limitations in sizes, bitrate, profile, level,
   ...).

   These limitations might be known without probbing the hardware, but
   in most cases they require querying it.
   Getting as much information about the stream to decode is needed.
   This can be obtained through parsers and only look for a decoder
   once the parser has provided extensive caps.


 P3 : Finding and auto-plugging the best elements

   Taking the case where several decoders are available and several
   sink elements are available, how do we establish which is the best
   combination ?

   Assuming we take the highest-ranked (and compatible) decoder, how do
   we figure out which sink element is compatible ?

   Assuming the user/application selects a specific sink, how do we
   figure out which is the best decoder to use ?

   /!\ Caps are not longer sufficient to establish compatibility


 P4 : How to handle systems that require calls to happen in one thread

   In OpenGL (for example) calls can only be done from one thread,
   which might not be a GStreamer thread (the sink could be controlled
   from an application thread).

   How do we properly (and safely) handle buffers and contexts ? Do we
   create an API that allows marshalling processing into the proper
   thread (resulting in an asynchronous API from the GStreamer point of
   view) ?


 Proposal Design:

 D1 : GstCaps

   We use the "video/x-raw" GstCaps.

   The format field and other required fields are filled in the same
   way they would be for non-HW-accelerated streams.


 D2 : Buffers and memory access

   The buffers used/provided/consumed by the various HW-accelerated
   elements must be usable with non-HW-accelerated elements.

   To that extent, the GstMemory backing the various buffers must be
   accessible via the mapping methods and therefore have the proper
   GstAllocator implementation if-so required.

   In the un-likelihood that the hardware does not provide any means to
   map the memory or that there are such limitation (such as on DRM
   systems), there should still be an implementation of
   GstMemoryMapFunction that returns NULL (and a size/maxsize of zero)
   when called.


 D3 : GstVideoMeta

   In the same way that a custom GstAllocator is required, it is
   important that elements implement the proper GstVideoMeta API
   wherever applicable.

   The GstVideoMeta fields should correspond to the memory returned by
   a call to gst_buffer_map() and/or gst_video_meta_map().

   => gst_video_meta_{map|unmap}() needs to call the
      GstVideoMeta->{map|unmap} implementations


 D4 : Custom GstMeta

   In order to pass along API and/or hardware-specific information
   regarding the various buffers, the elements will be able to create
   custom GstMeta.

   Ex (For VDPAU):

   struct _GstVDPAUMeta {
      GstMeta         meta;

      VdpDevice       device;
      VdpVideoSurface surface;
      ...
   };

   If an element supports multiple APIs for accessing/using the data
   (like for example VDPAU and GLX), it should all the applicable
   GstMeta.


 D5 : Buffer pools

   In order to:
   * avoid expensive cycles of buffer destruction/creation,
   * allow upstream elements to end up with the optimal buffers/memory
     to which to upload,
   elements should implement GstBufferPools whenever possible.

   If the backing hardware has a system by which it differentiates used
   buffers and available buffers, the bufferpool should have the proper
   release_buffer() and acquire_buffer() implementations.


 D6 : Ahead-of-time/asynchronous uploading

   In the case where the buffers to be displayed are not on the target
   hardware, we need to ensure the buffers are uploaded before the
   synchronization time. If data is uploaded at the render time we will
   end up with an unknown render latency, resulting in bad A/V
   synchronization.

   In order for this to happen, the buffers provided by downstream
   elements should have a GstAllocator implementation allowing
   uploading memory on _map(GST_MAP_WRITE).

   If this uploading happens asynchronously, the GstAllocator should
   implement a system so that if an intermediary element wishes to map
   the memory it can do so (either by providing a cached version of the
   memory, or by using locks).


 D7 : Overlay and positioning support

   FIXME : Move to a separate design doc

   struct _GstVideoCompositingMeta {
     GstMeta               meta;

     /* zorder : Depth Position of the layer in the final scene
      *        0 = background
      *    2**32 = foreground
      */
     guint                 zorder;

     /* x,y    : Spatial position of the layer in the final scene
      */
     guint                 x;
     guint                 y;

     /* width/height : Target width/height of the layer in the
      *   final scene.
      */

     guint                 width;
     guint                 height;
     /* basewidth/baseheight : Reference scene width/height
      *   If both values are zero, the x/y/width/height values above
      *   are to be used as absolute coordinates, regardless of the
      *   final scene's width and height.
      *   If the values are non-zero, the x/y/width/height values
      *   above should be scaled based on those values.
      *     Ex : real x position = x / basewidth * scene_width
      */
     guint                 basewidth;
     guint                 baseheight;

     /* alpha : Global alpha multiplier
      *   0.0 = completely transparent
      *   1.0 = no modification of original transparency (or opacity)
      */
     gdouble               alpha;
   }


 D8 : De-interlacing support

   FIXME : Move to a separate design doc

   For systems that can apply deinterlacing, the user needs to be in
   control of whether it should be applied or not.

   This should be done through the usage of the deinterlace element.

   In order to benefit from the HW-acceleration, downstream/upstream
   elements need a way by which they can indicate that the
   deinterlacing process will be applied later.

   To this extent, we introduce a new GstMeta : GstDeinterlaceMeta

   typedef const gchar *GstDeinterlaceMethod;

   struct _GstDeinterlaceMeta {
     GstMeta              meta;

     GstDeinterlaceMethod method;
   }


 D9 : Context sharing

   Re-use parts of -bad's videocontext ?


 D10 : Non-MT-safe APIs

   If the wrapped API/system does not offer an API which is MT-safe
   and/or usable from more than one thread (like OpenGL), we need:
   * A system by which a global context can be provided to all elements
     wanting to use that system,
   * A system by which elements can serialize processing to a 3rd party
     thread.


 [0]: Defining "noticeable overhead" is always tricky, but essentially
 means that the overhead introduced by GStreamer core and the element
 code should not exceed the overhead introduced for non-hw-accelerated
 elements.
	Hardware Acceleration in GStreamer 1.0
	--------------------------------------

	Status : DRAFT


	Preamble:

	This document serves to identify and define the various usages of
	hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the
	problems that arise and need to be solved, and a proposal API.


	Out of scope:

	This document will initially limit itself to usage of hwaccel in the
	field of video capture, processing and display due to their
	complexity.
	It is not excluded that some parts of the research could be
	applicable to other fields (audio, text, generic media).

	This document will not cover how encoded data is parsed and
	fed/obtained to/from the various hardware subsystems.


	Overall Goal:

	Make the most of the underlying hardware features while at the same
	time not introduce any noticable overhead [0] and provide the
	biggest flexibility of use-cases possible.


	Secondary Goals:

	Avoid Providing a system that only allows (efficient) usage of one
	use-case and/or through a specific combination or elements. This is
	contrary to the principles of GStreamer.

	Not introduce any unneeded memory copies.

	Not introduce any extra latency.

	Process data asynchronously wherever possible.


	Terminology:

	Due to the limitations of the GStreamer 0.10 API, most of these
	element, especially sink elements, were named "non-raw video
	elements".
	In the rest of this document, we will no longer refer to them as
	non-raw since they _do_ handle raw video and in GStreamer 1.0 it no
	longer matters where the raw video is located or accessed. We will
	prefer the term "hardware-accelerated video element".


	Specificities:

	Hardware-accelerated elements differ from non-hwaccel elements in a
	few ways:

	* They handle memory which ,in the vast majority of the cases, is
	not accessible directly.
	* The processing _can_ happen asynchronously
	* They _might_ be part of a GPU sub-system and therefore tightly
	coupled to the display system.


	Features handled:

	HW-accelerated elements can handle a variety of individual logical
	features. These should, in the spirit of GStreamer, be controlable
	in an individual fashion.

	* Video decoding and encoding
	* Display
	* Capture
	* Scaling (Downscaling (preview), Upscaling (Super-resolution))
	* Deinterlacing (including inverse-telecine)
	* Post-processing (Noise reduction, ...)
	* Colorspace conversion
	* Overlaying and compositing


	Use-cases:
	----------

	UC1 : HW-accelerated video decoding to counterpart sink

	Example : * VDPAU decoder to VDPAU sink
	* libVA decoder to libVA sink

	In these situations, the HW-accelerated decoder and sink can use the
	same API to communicate with each other and share data.

	There might be extra processing that can be applied before display
	(deinterlacing, noise reduction, overlaying, ...) and that is
	provided by the backing hardware. All these features should be
	usable in a transparent fashion from GStreamer.

	They might also need to communicate/share a common context.


	UC2 : HW-accelerated video decoding to different hwaccel sink

	Example : * VDPAU/libVA decoder to OpenGL-based sink

	The goal here is to end up with the decoded pictures as openGL
	textures, which can then be used in an openGL scene (with all the
	transformations one can do with those textures).

	GStreamer is responsible for:
	1) Filling the contents of those textures
	2) Informing the application when to use which texture at which time
	(i.e. synchronization).

	How the textures are used is not the responsibility of GStreamer,
	although a fallback could be possible (displaying the texture in a
	specified X window for ex) if the application does not handle the
	OpenGL scene.

	Efficient usage is only possible if the HW-accelerated system
	provides an API by which one can either:
	* Be given openGL texture IDs for the decoder to decode into
	* OR 'transform' hwaccel-backed buffers into texture IDs

	Just as for UC1, some information will need to be exchanged between
	the OpenGL-backed elements and the other HW-accelerated element.


	UC3 : HW-accelerated decoding to HW-accelerated encoding

	This is needed in cases where we want to reencode a stream from one
	format/profile to another format/profile, like for example for
	UPNP/DLNA embedded devices.

	If the encoder and decoder are using the same backing hardware, this
	is similar to UC1.

	If the encoder and decoder are backed by 1) different hardware but
	there is an API allowing communication between the two, OR 2) the
	same hardware but through different APIs this is similar to UC2.

	If the hardware backing the encoder and decoder don't have direct
	communication means, then best-effort must be ensured to only
	introduce one copy. The recent ongoing improvements in the kernel
	regarding DMA usage could help in that regards, allowing some
	hardware to be aware of another hardware.


	UC4 : HW-accelerated decoding to software plugin

	Examples : * Transcoding a stream using a software encoder
	* Applying measurement/transformations
	* Your crazy idea here
	* ...

	While the most common usage of HW-accelerated decoding is for
	display, we do not want to limit users of the GStreamer framework to
	only be able to use those plugins in some limited use-cases. Users
	should be able to benefit from the acceleration in any use-cases.


	UC5 : Software element to HW-accelerated display

	Examples : * Software decoder to VA/VDPAU/GL/.. sink
	* Visualization to VA/VDPAU/GL/... sink
	* anything in fact

	We need to ensure in these cases that any GStreamer plugin can
	output data to a HW-accelerated display.

	This process must not introduce any unwanted synchronization issues,
	meaning the transfer to the backing hardware needs to happen before
	the synchronization time in the sinks.


	UC6 : HW-accelerated capture to HW-accelerated encoder

	Examples : * Camerabin usage
	* Streaming server
	* Video-over-IP
	* ...

	In order to provide not only low-cpu usage (through HW-accelerated
	encoding) but also low-latency, we need to be able to have capture
	hardware provide the data to be encoded in such a way that the
	encoder can read it without any copy.

	Some capture APIs provide means by which the hardware can be
	provided by a pool of buffers backed by some MMAP contiguous
	memory.


	UC6.1 : UC6 + simultaneous preview

	Examples : Camerabin usage (preview of video/photo while shooting)



	Problems:
	---------

	P1 : Ranking of decoders

	How do we pick the best decoder available ? Do we just set the
	ranking of hardware-accelerated plugins to higher ranks ?


	P2 : Capabilities of HW-accelerated decoders

	Hardware decoders can have much tighter constraints as to what they
	can handle (limitations in sizes, bitrate, profile, level,
	...).

	These limitations might be known without probbing the hardware, but
	in most cases they require querying it.
	Getting as much information about the stream to decode is needed.
	This can be obtained through parsers and only look for a decoder
	once the parser has provided extensive caps.


	P3 : Finding and auto-plugging the best elements

	Taking the case where several decoders are available and several
	sink elements are available, how do we establish which is the best
	combination ?

	Assuming we take the highest-ranked (and compatible) decoder, how do
	we figure out which sink element is compatible ?

	Assuming the user/application selects a specific sink, how do we
	figure out which is the best decoder to use ?

	/!\ Caps are not longer sufficient to establish compatibility


	P4 : How to handle systems that require calls to happen in one thread

	In OpenGL (for example) calls can only be done from one thread,
	which might not be a GStreamer thread (the sink could be controlled
	from an application thread).

	How do we properly (and safely) handle buffers and contexts ? Do we
	create an API that allows marshalling processing into the proper
	thread (resulting in an asynchronous API from the GStreamer point of
	view) ?



	Proposal Design:

	D1 : GstCaps

	We use the "video/x-raw" GstCaps.

	The format field and other required fields are filled in the same
	way they would be for non-HW-accelerated streams.


	D2 : Buffers and memory access

	The buffers used/provided/consumed by the various HW-accelerated
	elements must be usable with non-HW-accelerated elements.

	To that extent, the GstMemory backing the various buffers must be
	accessible via the mapping methods and therefore have the proper
	GstAllocator implementation if-so required.

	In the un-likelihood that the hardware does not provide any means to
	map the memory or that there are such limitation (such as on DRM
	systems), there should still be an implementation of
	GstMemoryMapFunction that returns NULL (and a size/maxsize of zero)
	when called.


	D3 : GstVideoMeta

	In the same way that a custom GstAllocator is required, it is
	important that elements implement the proper GstVideoMeta API
	wherever applicable.

	The GstVideoMeta fields should correspond to the memory returned by
	a call to gst_buffer_map() and/or gst_video_meta_map().

	=> gst_video_meta_{map\|unmap}() needs to call the
	GstVideoMeta->{map\|unmap} implementations


	D4 : Custom GstMeta

	In order to pass along API and/or hardware-specific information
	regarding the various buffers, the elements will be able to create
	custom GstMeta.

	Ex (For VDPAU):

	struct _GstVDPAUMeta {
	GstMeta meta;

	VdpDevice device;
	VdpVideoSurface surface;
	...
	};

	If an element supports multiple APIs for accessing/using the data
	(like for example VDPAU and GLX), it should all the applicable
	GstMeta.


	D5 : Buffer pools

	In order to:
	* avoid expensive cycles of buffer destruction/creation,
	* allow upstream elements to end up with the optimal buffers/memory
	to which to upload,
	elements should implement GstBufferPools whenever possible.

	If the backing hardware has a system by which it differentiates used
	buffers and available buffers, the bufferpool should have the proper
	release_buffer() and acquire_buffer() implementations.


	D6 : Ahead-of-time/asynchronous uploading

	In the case where the buffers to be displayed are not on the target
	hardware, we need to ensure the buffers are uploaded before the
	synchronization time. If data is uploaded at the render time we will
	end up with an unknown render latency, resulting in bad A/V
	synchronization.

	In order for this to happen, the buffers provided by downstream
	elements should have a GstAllocator implementation allowing
	uploading memory on _map(GST_MAP_WRITE).

	If this uploading happens asynchronously, the GstAllocator should
	implement a system so that if an intermediary element wishes to map
	the memory it can do so (either by providing a cached version of the
	memory, or by using locks).


	D7 : Overlay and positioning support

	FIXME : Move to a separate design doc

	struct _GstVideoCompositingMeta {
	GstMeta meta;

	/* zorder : Depth Position of the layer in the final scene
	* 0 = background
	* 2**32 = foreground
	*/
	guint zorder;

	/* x,y : Spatial position of the layer in the final scene
	*/
	guint x;
	guint y;

	/* width/height : Target width/height of the layer in the
	* final scene.
	*/

	guint width;
	guint height;
	/* basewidth/baseheight : Reference scene width/height
	* If both values are zero, the x/y/width/height values above
	* are to be used as absolute coordinates, regardless of the
	* final scene's width and height.
	* If the values are non-zero, the x/y/width/height values
	* above should be scaled based on those values.
	* Ex : real x position = x / basewidth * scene_width
	*/
	guint basewidth;
	guint baseheight;

	/* alpha : Global alpha multiplier
	* 0.0 = completely transparent
	* 1.0 = no modification of original transparency (or opacity)
	*/
	gdouble alpha;
	}


	D8 : De-interlacing support

	FIXME : Move to a separate design doc

	For systems that can apply deinterlacing, the user needs to be in
	control of whether it should be applied or not.

	This should be done through the usage of the deinterlace element.

	In order to benefit from the HW-acceleration, downstream/upstream
	elements need a way by which they can indicate that the
	deinterlacing process will be applied later.

	To this extent, we introduce a new GstMeta : GstDeinterlaceMeta

	typedef const gchar *GstDeinterlaceMethod;

	struct _GstDeinterlaceMeta {
	GstMeta meta;

	GstDeinterlaceMethod method;
	}


	D9 : Context sharing

	Re-use parts of -bad's videocontext ?


	D10 : Non-MT-safe APIs

	If the wrapped API/system does not offer an API which is MT-safe
	and/or usable from more than one thread (like OpenGL), we need:
	* A system by which a global context can be provided to all elements
	wanting to use that system,
	* A system by which elements can serialize processing to a 3rd party
	thread.


	[0]: Defining "noticeable overhead" is always tricky, but essentially
	means that the overhead introduced by GStreamer core and the element
	code should not exceed the overhead introduced for non-hw-accelerated
	elements.