docs/random/ensonic/draft-bufferpools.txt - mtk-gstreamer - Git at Google

 BufferPools
 -----------

 This document proposes a mechnism to build pools of reusable buffers. The
 proposal should improve performance and help to implement zero-copy usecases.

 Last edited: 2009-09-01 Stefan Kost


 Current Behaviour
 -----------------

 Elements either create own buffers or request downstream buffers via pad_alloc.
 There is hardly any reuse of buffers, instead they are ususaly disposed after
 being rendered.


 Problems
 --------

   - hardware based elements like to reuse buffers as they e.g.
     - mlock them (dsp)
     - establish a index<->adress relation (v4l2)
   - not reusing buffers has overhead and makes run time behaviour
     non-deterministic:
     - malloc (which usualy becomes an mmap for bigger buffers and thus a
       syscall) and free (can trigger compression of freelists in the allocator)
     - shm alloc/attach, detach/free (xvideo)
   - some usecases cause memcpys
     - not having the right amount of buffers (e.g. too few buffers in v4l2src)
     - receiving buffers of wrong type (e.g. plain buffers in xvimagesink)
     - receving buffers with wrong alignment (dsp)
   - some usecases cause unneded cacheflushes when buffers are passed between
     user and kernel-space


 What is needed
 --------------

 Elements that sink raw data buffers of usualy constant size would like to
 maintain a bufferpool. These could be sinks or encoders. We need mechanims to
 select and dynamically update:

   - the bufferpool owners in a pipeline
   - the bufferpool sizes
   - the queued buffer sizes, alignments and flags


 Proposal
 --------
 Querying the bufferpool size and buffer alignments can work simillar to latency
 queries (gst/gstbin.c:{gst_bin_query,bin_query_latency_fold}. Aggregation is
 quite straight forward : number-of-buffers is summed up and for alignment we
 gather the MAX value.

 Bins need to track which elemnts have been selected as bufferpools owners and
 update if those are removed (FIXME: in which states?).

 Bins would also need to track if elements that replied to the query are removed
 and update the bufferpool configuration (event). Likewise addition of new
 elements needs to be handled (query and if configuration is changed, update with
 event).

 Bufferpools owners need to handle caps changes to keep the queued buffers valid
 for the negotiated format.

 The bufferpool could be a helper GObject (like we use GstAdapter). If would
 manage a collection of GstBuffers. For each buffer t tracks wheter its in use or
 available. The bufferpool in gst-plugin-good/sys/v4l2/gstv4l2bufferpool might be
 a starting point.


 Scenarios
 ---------

 v4l2src ! xvimagesink
 ~~~~~~~~~~~~~~~~~~~~~
 - v4l2src would report 1 buffer (do we still want the queue-size property?)
 - xvimagesink would report 1 buffer

 v4l2src ! tee name=t ! queue ! xvimagesink t. ! queue ! enc ! mux ! filesink
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - v4l2src would report 1 buffer
 - xvimagesink would report 1 buffer
 - enc would report 1 buffer

 filesrc ! demux ! queue ! dec ! xvimagesink
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - dec would report 1 buffer
 - xvimagesink would report 1 buffer


 Issues
 ------

 Does it make sense to also have pools for sources or should they always use
 buffers from a downstream element.

 Do we need to add +1 to aggregated buffercount to alloc to have a buffer
 floating? E.g. Can we push buffers queickly enough to have e.g.  v4l2src !
 xvimagesink working with 2 buffers. What about v4l2src ! queue ! xvimagesink?

 There are more attributes on buffers needed to reduce the overhead even more:

   - padding: when using buffers on hardware one might need to pad the buffer on
     the end to a specific alignment
   - mlock: hardware that uses DMA needs buffers memory locked, if a buffer is
     already memory locked, it can be used by other hardware based elements as is
   - cache flushes: hardware based elements usualy need to flush cpu caches when
     sending results as the dma based memory writes do no update eventually
     cached values on the cpu. now if there is no element next in the pipeline
     that actually reads from this memory area we could avoid the flushes. All
     other hardware elements and elements with any caps (tee, queue, capsfilter)
     are examples for those.
	BufferPools
	-----------

	This document proposes a mechnism to build pools of reusable buffers. The
	proposal should improve performance and help to implement zero-copy usecases.

	Last edited: 2009-09-01 Stefan Kost


	Current Behaviour
	-----------------

	Elements either create own buffers or request downstream buffers via pad_alloc.
	There is hardly any reuse of buffers, instead they are ususaly disposed after
	being rendered.


	Problems
	--------

	- hardware based elements like to reuse buffers as they e.g.
	- mlock them (dsp)
	- establish a index<->adress relation (v4l2)
	- not reusing buffers has overhead and makes run time behaviour
	non-deterministic:
	- malloc (which usualy becomes an mmap for bigger buffers and thus a
	syscall) and free (can trigger compression of freelists in the allocator)
	- shm alloc/attach, detach/free (xvideo)
	- some usecases cause memcpys
	- not having the right amount of buffers (e.g. too few buffers in v4l2src)
	- receiving buffers of wrong type (e.g. plain buffers in xvimagesink)
	- receving buffers with wrong alignment (dsp)
	- some usecases cause unneded cacheflushes when buffers are passed between
	user and kernel-space


	What is needed
	--------------

	Elements that sink raw data buffers of usualy constant size would like to
	maintain a bufferpool. These could be sinks or encoders. We need mechanims to
	select and dynamically update:

	- the bufferpool owners in a pipeline
	- the bufferpool sizes
	- the queued buffer sizes, alignments and flags


	Proposal
	--------
	Querying the bufferpool size and buffer alignments can work simillar to latency
	queries (gst/gstbin.c:{gst_bin_query,bin_query_latency_fold}. Aggregation is
	quite straight forward : number-of-buffers is summed up and for alignment we
	gather the MAX value.

	Bins need to track which elemnts have been selected as bufferpools owners and
	update if those are removed (FIXME: in which states?).

	Bins would also need to track if elements that replied to the query are removed
	and update the bufferpool configuration (event). Likewise addition of new
	elements needs to be handled (query and if configuration is changed, update with
	event).

	Bufferpools owners need to handle caps changes to keep the queued buffers valid
	for the negotiated format.

	The bufferpool could be a helper GObject (like we use GstAdapter). If would
	manage a collection of GstBuffers. For each buffer t tracks wheter its in use or
	available. The bufferpool in gst-plugin-good/sys/v4l2/gstv4l2bufferpool might be
	a starting point.


	Scenarios
	---------

	v4l2src ! xvimagesink
	~~~~~~~~~~~~~~~~~~~~~
	- v4l2src would report 1 buffer (do we still want the queue-size property?)
	- xvimagesink would report 1 buffer

	v4l2src ! tee name=t ! queue ! xvimagesink t. ! queue ! enc ! mux ! filesink
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	- v4l2src would report 1 buffer
	- xvimagesink would report 1 buffer
	- enc would report 1 buffer

	filesrc ! demux ! queue ! dec ! xvimagesink
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	- dec would report 1 buffer
	- xvimagesink would report 1 buffer


	Issues
	------

	Does it make sense to also have pools for sources or should they always use
	buffers from a downstream element.

	Do we need to add +1 to aggregated buffercount to alloc to have a buffer
	floating? E.g. Can we push buffers queickly enough to have e.g. v4l2src !
	xvimagesink working with 2 buffers. What about v4l2src ! queue ! xvimagesink?

	There are more attributes on buffers needed to reduce the overhead even more:

	- padding: when using buffers on hardware one might need to pad the buffer on
	the end to a specific alignment
	- mlock: hardware that uses DMA needs buffers memory locked, if a buffer is
	already memory locked, it can be used by other hardware based elements as is
	- cache flushes: hardware based elements usualy need to flush cpu caches when
	sending results as the dma based memory writes do no update eventually
	cached values on the cpu. now if there is no element next in the pipeline
	that actually reads from this memory area we could avoid the flushes. All
	other hardware elements and elements with any caps (tee, queue, capsfilter)
	are examples for those.