docs/random/thomasvs/capturing - mtk-gstreamer - Git at Google

 ELEMENTS (v4lsrc, alsasrc, osssrc)
 --------
 - capturing elements should not do fps/sample rate correction themselves
   they should timestamp buffers according to "a clock", period.

 - if the element is the clock provider:
   - timestamp buffers based on the internals of the clock it's providing,
     without calling the exposed clock functions
   - do this by getting a measure of elapsed time based on the internal clock
     that is being wrapped.  Ie., count the number of samples the *device*
     has processed/dropped/...
     If there are no underruns, the produced buffers are a contiguous data
     stream.
   - possibilities:
     - the device has a method to query for the absolute time related to
       a buffer you're about to capture or just have captured:
       Use that time as the timestamp on the capture buffer
       (it's important that this time is related to the capture buffer;
        ie. it's a time that "stands still" if you're not capturing)
     - since you're providing the clocking, but don't have the previous method,
       you should open the device with a given rate and continuously read
       samples from it, even in PAUSED.  This allows you to update an internal
       clock.
       You use this internal clock as well to timestamp the buffers going out,
       so you again form a contiguous set of buffers.
       The only acceptable way to continuously read samples then is in a private
       thread.
   - as long as no underruns happen, the flow being output is a perfect stream:
     the flow is data-contiguous and time-contiguous.
   - underruns should be handled like this:
     - if the code can detect how many samples it dropped, it should just
       send the next buffer with the new correct offset.  Ie, it produced
       a data gap, and since it provides the clock, it produces a perfect
       data gap (the timestamp will be correctly updated too).
     - if it cannot detect how many samples it dropped, there's a fallback
       algorithm.  The element uses another GstClock (for example, system clock)
       on which it corrects the skew and drift continuously as long as it
       doesn't drop.  When it detected a drop, it can get the time delta
       on the other GstClock since the last time it captured and the current
       time, and use that delta to guesstimate the number of samples dropped.

 - if the element is not the clock provider
   - the element should always respect the clock it is given.
   - the element should timestamp outgoing buffers based on time given by
     the provided clock, by querying for the time on that clock, and
     comparing to the base time.
   - the element should NOT drop/add frames.  Rather, it should just
     - timestamp the buffers with the current time according to the provided
       clock
     - set the duration according to the *theoretical/nominal* framerate
     - when underruns happen (the device has lost capture data because our
       element is not handling them quickly enough), this should be detectable
       by the element through the device.  On underrun, the offset of your
       next buffer will not match the end_offset of your previous one
       (ie, the data flow is no longer contiguous).
       If the exact number of samples dropped is detectable, this is the
       difference between new offset and old offset_end.
       If it's not detectable, it should be guessed based on the elapsed time
       between now and the last capture.

 - a second element can be responsible for making the stream time-contiguous.
   (ie, T1 + D1 = T2 for all buffers).  This way they are made
   acceptable for gapless presentation (which is useful for audio).
   - The element treats the incoming stream as data-contiguous but not
     necessarily time-contiguous.
   - If the timestamps are contiguous as well, then everything is fine and
     nothing needs to be done.  This is the case where a file is being read
     from disk, or capturing was done by an element that provided the clock.
   - If they are not contiguous, then this element must make them so.
     Since it should respect the nominal framerate, it has to stretch or
     shorten the incoming data to match the timestamps set on the data.
     For audio and video, this means it could interpolate or add/drop samples.
     For audio, resampling/interpolation is preferred.
     For video, a simple mechanism that chooses the frame with a timestamp as
     close as possible to the theoretical timestamp could be used.
   - When it receives a new buffer that is not data-contiguous with the
     previous one, the capture element dropped samples/frames.
     The adjuster can correct this by sending out as much "no-signal" data
     (for audio, e.g. silence or background noise; for video, sending out
     black frames) as it wants, since a data discontinuity is unrepairable.
     So it can use these to catch up more aggressively.
     It should just make sure that the next buffer it gets again goes
     back to respecting the nominal framerate.

 - To achieve the best possible long-time capture, the following can be done:
   - audiosrc captures audio and provides the clock.  It does contiguous
     timestamping by default.
   - videosrc captures video timestamped with the audiosrc's clock.  This data
     feed doesn't match the nominal framerate.  If there is an encoding format
     that supports storing the actual timestamps instead of pretending the
     data flow respects the nominal framerate, this can be corrected after
     recording.
   - at the end of recording, the absolute length in time of both streams,
     measured against a common clock, is the same or can be made the same by
     chopping off data.
   - the nominal rate of both audio and video is also known.
   - given the length and the nominal rate, we have an evenly spaced list
     of theoretical sampling points.
   - video frames can now be matched to these theoretical sampling points by
     interpolating or reusing/dropping frames.  It can choose the best
     possible algorithm for this to decrease the visible effects
     (interpolating results in blur, add/drop frames results in jerkiness).
   - with the video resampled at the theoretical framerate, and the audio
     already correct, the recording can now be muxed correctly into a format
     that implicitly assumes a data rate matching the nominal framerate.
   - One possibility is to use the GDP to store the recording, because that
     retains all of the timestamping information.
   - The process is symmetrical; if you want to use the clock provided by
     the video capturer, you can stretch/shrink the audio at the end of
     recording to match.

 TERMINOLOGY
 -----------
 - nominal rate
   the framerate/samplerate
   exposed in the caps; ie. the theoretical framerate of the
   data flow.  This is the fps reported by the device or set for the encoder,
   or the sampling rate of the audio device.
 - contiguous data flow
   offset_end of old buffer matches offset of new buffer
   for audio, this is a more important requirement, since you configure
   output devices for a contiguous data flow.
 - contiguous time flow
   T1 + D1 = T2
   for video, this is a more important requirement, because the sampling
   period is bigger, so it is more important to match the presentation time
 - "perfect stream"
   data and time are contiguous and match the nominal rate
   videotestsrc, sinesrc, filesrc ! decoder produce this

 NETWORK
 -------
 - elements can be synchronized by writing a NTP clock subclass that listens
   to an ntp server, and tries to match its own clock against the NTP server
   by doing gradual rate adjustment, compared with the own system clock.
 - sending audio and video over the network using tcpserversink is possible
   when the streams are made to be perfect streams and synchronized.
   Since the streams are perfect and synchronized, the timestamps transmitted
   along with the buffers can be trusted.  The client just has to make
   sure that it respects the timestamps.
 - One good way of doing that is to make an element that provides a clock
   based on the timestamps of the data stream, interpolating using another
   GstClock inbetween those time points.  This allows you to create
   a perfect network stream player (one that doesn't lag (increasing buffers))
   or play too fast (having an empty network queue).
 - On the client side, a GStreamer-ish way to do that is to cut the playback
   pipeline in half, and have a decoupled element that converts
   timestamps/durations (by resampling/interpolating/...) so that the sinks
   consume data at the same rate the tcp sources provide it.
   tcpclientsrc ! theoradec ! clocker name=clocker { clocker. ! xvimagesink }

 SYNCHRONISATION
 ---------------
 - low rate source with high rate source:
   the high rate source can drop samples so it starts with the same phase
   as the low rate source.  This could be done in a synchronizer element.
   example:
   - audio, 8000 Hz, and video, 5 fps
   - pipeline goes to playing
   - video src does capture and receives its first frame 50 ms after playing
     -> phase is -90 or 270 degrees
   - to compensate, the equivalent of 150 ms of audio could be dropped so
     that the first videoframe's timestamp coincides with the timestamp of
     the first audio buffer
   - this should be done in the raw audio domain since it's typically not
     possible to chop off samples in the encoded domain

 - two low rate sources:
   not possible to do this correctly, maybe something in the middle can be
   found ?

 IMPROVING QUALITY
 -----------------
 - video src can capture at a higher framerate than will be encoded
 - this gives the corrector more frames to choose from or interpolate with
   to match the target framerate, reducing jerkiness.
   e.g. capturing at 15 fps for 5 fps framerate.

 LIVE CHANGES IN PIPELINE
 ------------------------
 - case 1: video recording for some time, user wants to add audio recording on
           the fly
   - user sets complete pipeline to paused
   - user adds element for audio recording
   - new element gets same base time as video element
   - on PLAYING, new element will be in sync and the first buffer produced
     will have a non-zero timestamp that is the same as the first new video
     buffer

 - case 2: video recording for some time, user wants to add in an audio file
           from disk.
   - two possible expectations:
     A) user expects the audio file to "start playing now" and be muxed
        together with the current video frames
     B) user expects the audio file to "start playing from the point where the
        video currently is" (ie, video is at 10 seconds, so mux with audio
        starting from 10 secs)
   - case A):
     - complete pipeline gets paused
     - filesrc ! dec added
     - both get base_time same as video element
     - pipeline to playing
     - all elements receive new "now" as base_time so timestamps are reset
     - muxer will receive synchronized data from both
   - case B):
     nothing gets paused
     - filesrc ! dec added
     - both get base_time that is the current clock time
     - pipeline to playing
     - core sets
     1) - new audio part starts sending out data with timestamp 0 from start
       of file
       - muxer receives a whole set of frames from the audio side that are late
       (since the timestamps start at 0), so keeps dropping until it has
       caught up with the current set).
     OR
     2) - audio part does clock query

 THINGS TO DIG UP
 ----------------
 - is there a better way to get at "when was this frame captured" then doing
   a clock query after capturing ?
   Imagine a video device with a hardware buffer of four frames.  If you
   haven't asked for a frame from it in a while, three frames could be
   queued up.  So three consecutive frame gets result in immediate returns
   with pretty much the same clock query for each of them.
   So we should find a way to get "a comparable clock time" corresponding
   to the captured frame.

 - v4l2 api returns a gettimeofday() timestamp with each buffer.
   Given that, you can timestamp the buffer by subtracting the delta
   between the buffer's clock timestamp with the current system clock time,
   from the current time reported by the provided clock.
	ELEMENTS (v4lsrc, alsasrc, osssrc)
	--------
	- capturing elements should not do fps/sample rate correction themselves
	they should timestamp buffers according to "a clock", period.

	- if the element is the clock provider:
	- timestamp buffers based on the internals of the clock it's providing,
	without calling the exposed clock functions
	- do this by getting a measure of elapsed time based on the internal clock
	that is being wrapped. Ie., count the number of samples the device
	has processed/dropped/...
	If there are no underruns, the produced buffers are a contiguous data
	stream.
	- possibilities:
	- the device has a method to query for the absolute time related to
	a buffer you're about to capture or just have captured:
	Use that time as the timestamp on the capture buffer
	(it's important that this time is related to the capture buffer;
	ie. it's a time that "stands still" if you're not capturing)
	- since you're providing the clocking, but don't have the previous method,
	you should open the device with a given rate and continuously read
	samples from it, even in PAUSED. This allows you to update an internal
	clock.
	You use this internal clock as well to timestamp the buffers going out,
	so you again form a contiguous set of buffers.
	The only acceptable way to continuously read samples then is in a private
	thread.
	- as long as no underruns happen, the flow being output is a perfect stream:
	the flow is data-contiguous and time-contiguous.
	- underruns should be handled like this:
	- if the code can detect how many samples it dropped, it should just
	send the next buffer with the new correct offset. Ie, it produced
	a data gap, and since it provides the clock, it produces a perfect
	data gap (the timestamp will be correctly updated too).
	- if it cannot detect how many samples it dropped, there's a fallback
	algorithm. The element uses another GstClock (for example, system clock)
	on which it corrects the skew and drift continuously as long as it
	doesn't drop. When it detected a drop, it can get the time delta
	on the other GstClock since the last time it captured and the current
	time, and use that delta to guesstimate the number of samples dropped.

	- if the element is not the clock provider
	- the element should always respect the clock it is given.
	- the element should timestamp outgoing buffers based on time given by
	the provided clock, by querying for the time on that clock, and
	comparing to the base time.
	- the element should NOT drop/add frames. Rather, it should just
	- timestamp the buffers with the current time according to the provided
	clock
	- set the duration according to the theoretical/nominal framerate
	- when underruns happen (the device has lost capture data because our
	element is not handling them quickly enough), this should be detectable
	by the element through the device. On underrun, the offset of your
	next buffer will not match the end_offset of your previous one
	(ie, the data flow is no longer contiguous).
	If the exact number of samples dropped is detectable, this is the
	difference between new offset and old offset_end.
	If it's not detectable, it should be guessed based on the elapsed time
	between now and the last capture.

	- a second element can be responsible for making the stream time-contiguous.
	(ie, T1 + D1 = T2 for all buffers). This way they are made
	acceptable for gapless presentation (which is useful for audio).
	- The element treats the incoming stream as data-contiguous but not
	necessarily time-contiguous.
	- If the timestamps are contiguous as well, then everything is fine and
	nothing needs to be done. This is the case where a file is being read
	from disk, or capturing was done by an element that provided the clock.
	- If they are not contiguous, then this element must make them so.
	Since it should respect the nominal framerate, it has to stretch or
	shorten the incoming data to match the timestamps set on the data.
	For audio and video, this means it could interpolate or add/drop samples.
	For audio, resampling/interpolation is preferred.
	For video, a simple mechanism that chooses the frame with a timestamp as
	close as possible to the theoretical timestamp could be used.
	- When it receives a new buffer that is not data-contiguous with the
	previous one, the capture element dropped samples/frames.
	The adjuster can correct this by sending out as much "no-signal" data
	(for audio, e.g. silence or background noise; for video, sending out
	black frames) as it wants, since a data discontinuity is unrepairable.
	So it can use these to catch up more aggressively.
	It should just make sure that the next buffer it gets again goes
	back to respecting the nominal framerate.

	- To achieve the best possible long-time capture, the following can be done:
	- audiosrc captures audio and provides the clock. It does contiguous
	timestamping by default.
	- videosrc captures video timestamped with the audiosrc's clock. This data
	feed doesn't match the nominal framerate. If there is an encoding format
	that supports storing the actual timestamps instead of pretending the
	data flow respects the nominal framerate, this can be corrected after
	recording.
	- at the end of recording, the absolute length in time of both streams,
	measured against a common clock, is the same or can be made the same by
	chopping off data.
	- the nominal rate of both audio and video is also known.
	- given the length and the nominal rate, we have an evenly spaced list
	of theoretical sampling points.
	- video frames can now be matched to these theoretical sampling points by
	interpolating or reusing/dropping frames. It can choose the best
	possible algorithm for this to decrease the visible effects
	(interpolating results in blur, add/drop frames results in jerkiness).
	- with the video resampled at the theoretical framerate, and the audio
	already correct, the recording can now be muxed correctly into a format
	that implicitly assumes a data rate matching the nominal framerate.
	- One possibility is to use the GDP to store the recording, because that
	retains all of the timestamping information.
	- The process is symmetrical; if you want to use the clock provided by
	the video capturer, you can stretch/shrink the audio at the end of
	recording to match.

	TERMINOLOGY
	-----------
	- nominal rate
	the framerate/samplerate
	exposed in the caps; ie. the theoretical framerate of the
	data flow. This is the fps reported by the device or set for the encoder,
	or the sampling rate of the audio device.
	- contiguous data flow
	offset_end of old buffer matches offset of new buffer
	for audio, this is a more important requirement, since you configure
	output devices for a contiguous data flow.
	- contiguous time flow
	T1 + D1 = T2
	for video, this is a more important requirement, because the sampling
	period is bigger, so it is more important to match the presentation time
	- "perfect stream"
	data and time are contiguous and match the nominal rate
	videotestsrc, sinesrc, filesrc ! decoder produce this

	NETWORK
	-------
	- elements can be synchronized by writing a NTP clock subclass that listens
	to an ntp server, and tries to match its own clock against the NTP server
	by doing gradual rate adjustment, compared with the own system clock.
	- sending audio and video over the network using tcpserversink is possible
	when the streams are made to be perfect streams and synchronized.
	Since the streams are perfect and synchronized, the timestamps transmitted
	along with the buffers can be trusted. The client just has to make
	sure that it respects the timestamps.
	- One good way of doing that is to make an element that provides a clock
	based on the timestamps of the data stream, interpolating using another
	GstClock inbetween those time points. This allows you to create
	a perfect network stream player (one that doesn't lag (increasing buffers))
	or play too fast (having an empty network queue).
	- On the client side, a GStreamer-ish way to do that is to cut the playback
	pipeline in half, and have a decoupled element that converts
	timestamps/durations (by resampling/interpolating/...) so that the sinks
	consume data at the same rate the tcp sources provide it.
	tcpclientsrc ! theoradec ! clocker name=clocker { clocker. ! xvimagesink }

	SYNCHRONISATION
	---------------
	- low rate source with high rate source:
	the high rate source can drop samples so it starts with the same phase
	as the low rate source. This could be done in a synchronizer element.
	example:
	- audio, 8000 Hz, and video, 5 fps
	- pipeline goes to playing
	- video src does capture and receives its first frame 50 ms after playing
	-> phase is -90 or 270 degrees
	- to compensate, the equivalent of 150 ms of audio could be dropped so
	that the first videoframe's timestamp coincides with the timestamp of
	the first audio buffer
	- this should be done in the raw audio domain since it's typically not
	possible to chop off samples in the encoded domain

	- two low rate sources:
	not possible to do this correctly, maybe something in the middle can be
	found ?

	IMPROVING QUALITY
	-----------------
	- video src can capture at a higher framerate than will be encoded
	- this gives the corrector more frames to choose from or interpolate with
	to match the target framerate, reducing jerkiness.
	e.g. capturing at 15 fps for 5 fps framerate.

	LIVE CHANGES IN PIPELINE
	------------------------
	- case 1: video recording for some time, user wants to add audio recording on
	the fly
	- user sets complete pipeline to paused
	- user adds element for audio recording
	- new element gets same base time as video element
	- on PLAYING, new element will be in sync and the first buffer produced
	will have a non-zero timestamp that is the same as the first new video
	buffer

	- case 2: video recording for some time, user wants to add in an audio file
	from disk.
	- two possible expectations:
	A) user expects the audio file to "start playing now" and be muxed
	together with the current video frames
	B) user expects the audio file to "start playing from the point where the
	video currently is" (ie, video is at 10 seconds, so mux with audio
	starting from 10 secs)
	- case A):
	- complete pipeline gets paused
	- filesrc ! dec added
	- both get base_time same as video element
	- pipeline to playing
	- all elements receive new "now" as base_time so timestamps are reset
	- muxer will receive synchronized data from both
	- case B):
	nothing gets paused
	- filesrc ! dec added
	- both get base_time that is the current clock time
	- pipeline to playing
	- core sets
	1) - new audio part starts sending out data with timestamp 0 from start
	of file
	- muxer receives a whole set of frames from the audio side that are late
	(since the timestamps start at 0), so keeps dropping until it has
	caught up with the current set).
	OR
	2) - audio part does clock query

	THINGS TO DIG UP
	----------------
	- is there a better way to get at "when was this frame captured" then doing
	a clock query after capturing ?
	Imagine a video device with a hardware buffer of four frames. If you
	haven't asked for a frame from it in a while, three frames could be
	queued up. So three consecutive frame gets result in immediate returns
	with pretty much the same clock query for each of them.
	So we should find a way to get "a comparable clock time" corresponding
	to the captured frame.

	- v4l2 api returns a gettimeofday() timestamp with each buffer.
	Given that, you can timestamp the buffer by subtracting the delta
	between the buffer's clock timestamp with the current system clock time,
	from the current time reported by the provided clock.